2026년 2월 12일·아키텍처·

5장: AI 서비스 API 설계 패턴

비동기 작업 패턴, 멀티모달 입력 처리, Function Calling 인터페이스, 배치 API, 구조화된 출력 등 AI 서비스 고유의 API 설계 패턴을 학습합니다.

17분1,113자9개 섹션

api-design graphql architecture

api-design5 / 11

1 2 3 4 5 6 7 8 9 10 11

이전4장: GraphQL — 유연한 데이터 쿼리 다음6장: 스트리밍 응답 인터페이스 설계

학습 목표

AI 서비스의 비동기 작업 처리 패턴(폴링, 웹훅, SSE)을 비교합니다
멀티모달 입력을 처리하는 API 설계 방법을 학습합니다
Function Calling 인터페이스의 프로토콜을 이해합니다
배치 API와 구조화된 출력 패턴을 익힙니다

비동기 작업 패턴

AI 추론은 수 초에서 수 분까지 소요될 수 있으므로, 동기적 요청-응답 패턴만으로는 부족합니다. 작업의 특성에 따라 적절한 비동기 패턴을 선택해야 합니다.

폴링 패턴

클라이언트가 주기적으로 작업 상태를 확인하는 가장 단순한 패턴입니다.

polling_pattern.py

python

from fastapi import FastAPI, BackgroundTasks
from enum import Enum
import uuid
 
 
class BatchStatus(str, Enum):
    pending = "pending"
    running = "running"
    completed = "completed"
    failed = "failed"
    cancelled = "cancelled"
 
 
class BatchRequest(BaseModel):
    model: str
    inputs: list[CompletionInput]
    metadata: dict[str, str] | None = None
 
 
class BatchResponse(BaseModel):
    id: str
    status: BatchStatus
    total: int
    completed: int
    failed: int
    results: list[CompletionResult] | None = None
    created_at: str
    completed_at: str | None = None
    # 폴링 간격 힌트
    retry_after: int | None = None
 
 
@app.post("/api/v1/batches", status_code=202)
async def create_batch(
    request: BatchRequest,
    background_tasks: BackgroundTasks,
) -> BatchResponse:
    batch_id = f"batch_{uuid.uuid4().hex[:12]}"
    
    # 배치 작업을 백그라운드에서 처리
    background_tasks.add_task(
        process_batch, batch_id, request
    )
    
    return BatchResponse(
        id=batch_id,
        status=BatchStatus.pending,
        total=len(request.inputs),
        completed=0,
        failed=0,
        created_at=datetime.now().isoformat(),
        retry_after=5,  # 5초 후 다시 확인 권장
    )
 
 
@app.get("/api/v1/batches/{batch_id}")
async def get_batch(batch_id: str) -> BatchResponse:
    batch = await batch_store.get(batch_id)
    if not batch:
        raise HTTPException(status_code=404, detail="배치를 찾을 수 없습니다")
    
    response = BatchResponse(**batch)
    
    # 상태에 따른 폴링 간격 힌트
    if batch["status"] == "running":
        response.retry_after = 5
    
    return response

웹훅 패턴

서버가 작업 완료 시 클라이언트의 URL로 결과를 푸시하는 패턴입니다. 폴링의 불필요한 요청을 제거합니다.

webhook_pattern.py

python

class BatchRequestWithWebhook(BaseModel):
    model: str
    inputs: list[CompletionInput]
    webhook_url: str  # 완료 시 결과를 보낼 URL
    webhook_secret: str | None = None  # HMAC 서명용
 
 
@app.post("/api/v1/batches", status_code=202)
async def create_batch_with_webhook(
    request: BatchRequestWithWebhook,
    background_tasks: BackgroundTasks,
) -> BatchResponse:
    batch_id = f"batch_{uuid.uuid4().hex[:12]}"
    
    background_tasks.add_task(
        process_and_notify, batch_id, request
    )
    
    return BatchResponse(
        id=batch_id,
        status=BatchStatus.pending,
        total=len(request.inputs),
        completed=0,
        failed=0,
        created_at=datetime.now().isoformat(),
    )
 
 
async def process_and_notify(
    batch_id: str,
    request: BatchRequestWithWebhook,
):
    result = await process_batch(batch_id, request)
    
    # 웹훅으로 결과 전달
    payload = result.model_dump_json()
    
    headers = {"Content-Type": "application/json"}
    if request.webhook_secret:
        signature = hmac.new(
            request.webhook_secret.encode(),
            payload.encode(),
            hashlib.sha256,
        ).hexdigest()
        headers["X-Webhook-Signature"] = f"sha256={signature}"
    
    async with httpx.AsyncClient() as client:
        await client.post(
            request.webhook_url,
            content=payload,
            headers=headers,
            timeout=30.0,
        )

SSE 패턴

SSE(Server-Sent Events)는 서버에서 클라이언트로 단방향 실시간 이벤트를 전송합니다. 6장에서 깊이 다루겠지만, 비동기 작업 상태 업데이트에도 활용됩니다.

sse_status_updates.py

python

from sse_starlette.sse import EventSourceResponse
 
 
@app.get("/api/v1/batches/{batch_id}/stream")
async def stream_batch_status(batch_id: str):
    async def event_generator():
        while True:
            batch = await batch_store.get(batch_id)
            
            yield {
                "event": "status",
                "data": json.dumps({
                    "status": batch["status"],
                    "completed": batch["completed"],
                    "total": batch["total"],
                }),
            }
            
            if batch["status"] in ("completed", "failed", "cancelled"):
                yield {
                    "event": "done",
                    "data": json.dumps(batch),
                }
                break
            
            await asyncio.sleep(2)
    
    return EventSourceResponse(event_generator())

패턴 선택 가이드

패턴	적합한 상황	장점	단점
폴링	단순 구현, 방화벽 제한	구현 간단, 상태 비저장	불필요한 요청, 지연
웹훅	서버 간 통신, 장시간 작업	실시간 알림, 효율적	엔드포인트 관리 필요
SSE	브라우저 클라이언트, 진행률 표시	실시간, 재연결 지원	단방향만 가능

토큰 사용량 리포팅

AI API에서 토큰 사용량 정보는 비용 추적과 예산 관리의 핵심입니다.

usage_reporting.py

python

class TokenUsage(BaseModel):
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    
    # 세부 분류 (선택적)
    prompt_tokens_details: PromptTokensDetail | None = None
    completion_tokens_details: CompletionTokensDetail | None = None
 
 
class PromptTokensDetail(BaseModel):
    cached_tokens: int = 0    # 캐시된 프롬프트 토큰
    text_tokens: int = 0      # 텍스트 토큰
    image_tokens: int = 0     # 이미지 토큰
    audio_tokens: int = 0     # 오디오 토큰
 
 
class CompletionTokensDetail(BaseModel):
    text_tokens: int = 0
    reasoning_tokens: int = 0  # 추론 토큰 (o1 계열)
 
 
class CompletionResponse(BaseModel):
    id: str
    model: str
    choices: list[Choice]
    usage: TokenUsage
    
    # 비용 정보 (선택적이지만 권장)
    cost: CostInfo | None = None
 
 
class CostInfo(BaseModel):
    prompt_cost: float
    completion_cost: float
    total_cost: float
    currency: str = "USD"

Tip

응답에 토큰 사용량뿐만 아니라 예상 비용 정보를 함께 제공하면 개발자가 비용을 실시간으로 추적할 수 있습니다. 특히 cached_tokens 정보는 프롬프트 캐싱 최적화 효과를 확인하는 데 중요합니다.

멀티모달 입력 처리

최신 AI 모델은 텍스트, 이미지, 오디오 등 다양한 입력을 동시에 처리합니다. 멀티모달 API 설계에는 두 가지 접근 방식이 있습니다.

인라인 Base64 방식

작은 파일(수 MB 이하)을 요청 본문에 직접 포함합니다.

multimodal-inline.json

json

{
  "model": "claude-4-vision",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "이 아키텍처 다이어그램을 분석해주세요"
        },
        {
          "type": "image",
          "source": {
            "type": "base64",
            "media_type": "image/png",
            "data": "iVBORw0KGgoAAAANSUhEUg..."
          }
        }
      ]
    }
  ]
}

URL 참조 방식

대용량 파일은 사전 업로드 후 URL로 참조합니다.

multimodal-url.json

json

{
  "model": "claude-4-vision",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "이 PDF 문서를 요약해주세요"
        },
        {
          "type": "document",
          "source": {
            "type": "url",
            "url": "https://files.example.com/uploads/abc123.pdf"
          }
        }
      ]
    }
  ]
}

파일 업로드 API

대용량 파일 처리를 위한 별도의 업로드 엔드포인트가 필요합니다.

file_upload.py

python

from fastapi import UploadFile, File
 
 
@app.post("/api/v1/files")
async def upload_file(
    file: UploadFile = File(...),
    purpose: str = "assistants",
) -> FileResponse:
    # 파일 크기 검증
    if file.size > 100 * 1024 * 1024:  # 100MB 제한
        raise APIError(
            status_code=413,
            error_type="file_too_large",
            message="파일 크기는 100MB를 초과할 수 없습니다",
        )
    
    # 허용된 MIME 타입 검증
    allowed_types = {
        "image/png", "image/jpeg", "image/webp", "image/gif",
        "application/pdf",
        "audio/mp3", "audio/wav", "audio/ogg",
        "text/plain", "text/csv",
    }
    if file.content_type not in allowed_types:
        raise APIError(
            status_code=415,
            error_type="unsupported_media_type",
            message=f"지원하지 않는 파일 형식입니다: {file.content_type}",
        )
    
    # 파일 저장 및 메타데이터 반환
    file_id = await file_store.save(file, purpose)
    
    return FileResponse(
        id=file_id,
        filename=file.filename,
        size=file.size,
        content_type=file.content_type,
        purpose=purpose,
        created_at=datetime.now().isoformat(),
    )

도구 호출 인터페이스

Function Calling(함수 호출)은 LLM이 외부 도구를 활용할 수 있게 하는 인터페이스입니다. API 레벨에서 다단계 대화 프로토콜을 정의해야 합니다.

도구 정의

tool-definition.json

json

{
  "model": "claude-4",
  "messages": [
    {
      "role": "user",
      "content": "서울의 현재 날씨를 알려주세요"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "지정된 도시의 현재 날씨를 조회합니다",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {
              "type": "string",
              "description": "도시 이름"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "온도 단위"
            }
          },
          "required": ["city"]
        }
      }
    }
  ]
}

다단계 대화 프로토콜

tool_calling_flow.py

python

# 1단계: 초기 요청 → 도구 호출 응답
initial_response = {
    "id": "resp_abc123",
    "model": "claude-4",
    "choices": [{
        "index": 0,
        "message": {
            "role": "assistant",
            "content": None,
            "tool_calls": [{
                "id": "call_xyz789",
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "arguments": '{"city": "서울", "unit": "celsius"}'
                }
            }]
        },
        "finish_reason": "tool_calls"
    }],
    "usage": {"prompt_tokens": 120, "completion_tokens": 25, "total_tokens": 145}
}
 
# 2단계: 도구 결과를 포함한 후속 요청
followup_request = {
    "model": "claude-4",
    "messages": [
        {"role": "user", "content": "서울의 현재 날씨를 알려주세요"},
        {
            "role": "assistant",
            "content": None,
            "tool_calls": [{
                "id": "call_xyz789",
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "arguments": '{"city": "서울", "unit": "celsius"}'
                }
            }]
        },
        {
            "role": "tool",
            "tool_call_id": "call_xyz789",
            "content": '{"temperature": 18, "condition": "맑음", "humidity": 45}'
        }
    ],
    "tools": [...]  # 동일한 도구 정의
}
 
# 3단계: 최종 응답
final_response = {
    "id": "resp_def456",
    "model": "claude-4",
    "choices": [{
        "index": 0,
        "message": {
            "role": "assistant",
            "content": "서울의 현재 날씨는 맑음이며, 기온은 18도, 습도는 45%입니다."
        },
        "finish_reason": "stop"
    }],
    "usage": {"prompt_tokens": 180, "completion_tokens": 35, "total_tokens": 215}
}

Warning

도구 호출 인터페이스에서 tool_call_id로 요청과 응답을 연결하는 것이 중요합니다. 병렬 도구 호출 시 여러 도구가 동시에 호출될 수 있으며, ID를 통해 각 결과를 올바른 호출에 매핑해야 합니다.

배치 API

대량의 요청을 효율적으로 처리하기 위한 배치 API 패턴입니다. 실시간 응답이 필요하지 않은 분석, 평가, 데이터 처리 작업에 적합합니다.

batch_api.py

python

class BatchInput(BaseModel):
    custom_id: str  # 클라이언트 측 식별자
    method: str     # "POST"
    url: str        # "/v1/chat/completions"
    body: dict      # 요청 본문
 
 
class BatchOutput(BaseModel):
    custom_id: str
    status_code: int
    body: dict
    error: dict | None = None
 
 
@app.post("/api/v1/batches")
async def create_batch(
    inputs: list[BatchInput],
    completion_window: str = "24h",  # 완료 보장 시간
    metadata: dict[str, str] | None = None,
) -> BatchResponse:
    """
    배치 API는 개별 요청 대비 50% 할인된 가격으로 제공됩니다.
    최대 50,000건의 요청을 하나의 배치로 처리할 수 있습니다.
    """
    batch_id = f"batch_{uuid.uuid4().hex[:12]}"
    
    await batch_processor.enqueue(
        batch_id=batch_id,
        inputs=inputs,
        completion_window=completion_window,
        metadata=metadata,
    )
    
    return BatchResponse(
        id=batch_id,
        status="pending",
        total=len(inputs),
        completed=0,
        failed=0,
    )

구조화된 출력

LLM의 비결정적 출력을 프로그래밍적으로 처리하기 위해 JSON Schema로 출력 형식을 강제하는 패턴입니다.

structured-output-request.json

json

{
  "model": "claude-4",
  "messages": [
    {
      "role": "user",
      "content": "다음 기술 블로그 포스트의 메타데이터를 추출하세요: ..."
    }
  ],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "blog_metadata",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "title": { "type": "string" },
          "summary": { "type": "string", "maxLength": 200 },
          "tags": {
            "type": "array",
            "items": { "type": "string" },
            "maxItems": 5
          },
          "difficulty": {
            "type": "string",
            "enum": ["beginner", "intermediate", "advanced"]
          },
          "estimated_reading_minutes": { "type": "integer", "minimum": 1 }
        },
        "required": ["title", "summary", "tags", "difficulty"],
        "additionalProperties": false
      }
    }
  }
}

structured_output_validation.py

python

from pydantic import BaseModel, Field
from typing import Literal
 
 
class BlogMetadata(BaseModel):
    title: str
    summary: str = Field(max_length=200)
    tags: list[str] = Field(max_length=5)
    difficulty: Literal["beginner", "intermediate", "advanced"]
    estimated_reading_minutes: int = Field(ge=1)
 
 
# 응답을 Pydantic 모델로 검증
response = await ai_client.complete(request)
content = response.choices[0].message.content
metadata = BlogMetadata.model_validate_json(content)

에러 분류 체계

AI API의 에러는 전통적 API보다 다양한 원인을 가집니다. 체계적인 에러 분류가 필요합니다.

error_taxonomy.py

python

ERROR_TAXONOMY = {
    # 클라이언트 오류 (4xx)
    "invalid_request": (400, "요청 형식이 올바르지 않습니다"),
    "authentication_error": (401, "인증에 실패했습니다"),
    "permission_denied": (403, "이 리소스에 접근할 권한이 없습니다"),
    "not_found": (404, "요청한 리소스를 찾을 수 없습니다"),
    "content_policy_violation": (400, "콘텐츠 정책을 위반하는 요청입니다"),
    "context_length_exceeded": (400, "모델의 컨텍스트 길이를 초과했습니다"),
    "invalid_tool_call": (400, "도구 호출 형식이 올바르지 않습니다"),
    
    # 레이트 리밋 (429)
    "rate_limit_exceeded": (429, "요청 한도를 초과했습니다"),
    "token_limit_exceeded": (429, "토큰 한도를 초과했습니다"),
    "budget_exceeded": (429, "월간 예산을 초과했습니다"),
    
    # 서버 오류 (5xx)
    "model_error": (500, "모델 추론 중 오류가 발생했습니다"),
    "model_overloaded": (503, "모델이 과부하 상태입니다"),
    "timeout": (504, "추론 시간이 초과되었습니다"),
}

정리

이 장에서는 AI 서비스에 특화된 API 설계 패턴을 살펴보았습니다. 비동기 작업 처리의 세 가지 패턴(폴링, 웹훅, SSE), 멀티모달 입력의 인라인과 URL 참조 방식, 다단계 Function Calling 프로토콜, 배치 API, 구조화된 출력의 JSON Schema 강제, 그리고 AI 특화 에러 분류 체계를 다루었습니다.

이러한 패턴들은 서로 독립적이 아니라 조합하여 사용됩니다. 예를 들어, 배치 API에 구조화된 출력을 적용하고, 웹훅으로 완료를 통보하는 식입니다.

다음 장 미리보기

6장에서는 AI 서비스에서 가장 중요한 사용자 경험 요소인 스트리밍 응답 인터페이스를 깊이 다룹니다. SSE 프로토콜, OpenAI 호환 스트리밍 형식, 구조화된 스트리밍, 에러 처리, 프론트엔드 통합 패턴을 상세히 살펴봅니다.

이 글이 도움이 되셨나요?

아키텍처

6장: 스트리밍 응답 인터페이스 설계

SSE 기반 토큰 스트리밍 프로토콜, OpenAI 호환 스트리밍 형식, 에러 처리, 클라이언트 취소, 프론트엔드 통합 패턴을 학습합니다.

2026년 2월 14일·15분

아키텍처

4장: GraphQL — 유연한 데이터 쿼리

GraphQL의 스키마 퍼스트 설계, 타입 시스템, N+1 문제 해결, AI 서비스 데이터 모델링을 Apollo Server 실습과 함께 학습합니다.

2026년 2월 10일·12분

아키텍처

7장: API 버전 관리와 하위 호환성

URL 경로, 헤더, 쿼리 파라미터 버전 관리 전략과 AI 서비스에서의 모델 버전 분리, 프롬프트 버전 관리, 폐기 정책을 학습합니다.

2026년 2월 16일·16분

2026년 2월 12일·아키텍처·

5장: AI 서비스 API 설계 패턴

비동기 작업 패턴, 멀티모달 입력 처리, Function Calling 인터페이스, 배치 API, 구조화된 출력 등 AI 서비스 고유의 API 설계 패턴을 학습합니다.

17분1,113자9개 섹션

api-design graphql architecture

api-design5 / 11

1 2 3 4 5 6 7 8 9 10 11

이전4장: GraphQL — 유연한 데이터 쿼리 다음6장: 스트리밍 응답 인터페이스 설계

학습 목표

AI 서비스의 비동기 작업 처리 패턴(폴링, 웹훅, SSE)을 비교합니다
멀티모달 입력을 처리하는 API 설계 방법을 학습합니다
Function Calling 인터페이스의 프로토콜을 이해합니다
배치 API와 구조화된 출력 패턴을 익힙니다

비동기 작업 패턴

폴링 패턴

클라이언트가 주기적으로 작업 상태를 확인하는 가장 단순한 패턴입니다.

polling_pattern.py

python

from fastapi import FastAPI, BackgroundTasks
from enum import Enum
import uuid
 
 
class BatchStatus(str, Enum):
    pending = "pending"
    running = "running"
    completed = "completed"
    failed = "failed"
    cancelled = "cancelled"
 
 
class BatchRequest(BaseModel):
    model: str
    inputs: list[CompletionInput]
    metadata: dict[str, str] | None = None
 
 
class BatchResponse(BaseModel):
    id: str
    status: BatchStatus
    total: int
    completed: int
    failed: int
    results: list[CompletionResult] | None = None
    created_at: str
    completed_at: str | None = None
    # 폴링 간격 힌트
    retry_after: int | None = None
 
 
@app.post("/api/v1/batches", status_code=202)
async def create_batch(
    request: BatchRequest,
    background_tasks: BackgroundTasks,
) -> BatchResponse:
    batch_id = f"batch_{uuid.uuid4().hex[:12]}"
    
    # 배치 작업을 백그라운드에서 처리
    background_tasks.add_task(
        process_batch, batch_id, request
    )
    
    return BatchResponse(
        id=batch_id,
        status=BatchStatus.pending,
        total=len(request.inputs),
        completed=0,
        failed=0,
        created_at=datetime.now().isoformat(),
        retry_after=5,  # 5초 후 다시 확인 권장
    )
 
 
@app.get("/api/v1/batches/{batch_id}")
async def get_batch(batch_id: str) -> BatchResponse:
    batch = await batch_store.get(batch_id)
    if not batch:
        raise HTTPException(status_code=404, detail="배치를 찾을 수 없습니다")
    
    response = BatchResponse(**batch)
    
    # 상태에 따른 폴링 간격 힌트
    if batch["status"] == "running":
        response.retry_after = 5
    
    return response

웹훅 패턴

서버가 작업 완료 시 클라이언트의 URL로 결과를 푸시하는 패턴입니다. 폴링의 불필요한 요청을 제거합니다.

webhook_pattern.py

python

class BatchRequestWithWebhook(BaseModel):
    model: str
    inputs: list[CompletionInput]
    webhook_url: str  # 완료 시 결과를 보낼 URL
    webhook_secret: str | None = None  # HMAC 서명용
 
 
@app.post("/api/v1/batches", status_code=202)
async def create_batch_with_webhook(
    request: BatchRequestWithWebhook,
    background_tasks: BackgroundTasks,
) -> BatchResponse:
    batch_id = f"batch_{uuid.uuid4().hex[:12]}"
    
    background_tasks.add_task(
        process_and_notify, batch_id, request
    )
    
    return BatchResponse(
        id=batch_id,
        status=BatchStatus.pending,
        total=len(request.inputs),
        completed=0,
        failed=0,
        created_at=datetime.now().isoformat(),
    )
 
 
async def process_and_notify(
    batch_id: str,
    request: BatchRequestWithWebhook,
):
    result = await process_batch(batch_id, request)
    
    # 웹훅으로 결과 전달
    payload = result.model_dump_json()
    
    headers = {"Content-Type": "application/json"}
    if request.webhook_secret:
        signature = hmac.new(
            request.webhook_secret.encode(),
            payload.encode(),
            hashlib.sha256,
        ).hexdigest()
        headers["X-Webhook-Signature"] = f"sha256={signature}"
    
    async with httpx.AsyncClient() as client:
        await client.post(
            request.webhook_url,
            content=payload,
            headers=headers,
            timeout=30.0,
        )

SSE 패턴

sse_status_updates.py

python

from sse_starlette.sse import EventSourceResponse
 
 
@app.get("/api/v1/batches/{batch_id}/stream")
async def stream_batch_status(batch_id: str):
    async def event_generator():
        while True:
            batch = await batch_store.get(batch_id)
            
            yield {
                "event": "status",
                "data": json.dumps({
                    "status": batch["status"],
                    "completed": batch["completed"],
                    "total": batch["total"],
                }),
            }
            
            if batch["status"] in ("completed", "failed", "cancelled"):
                yield {
                    "event": "done",
                    "data": json.dumps(batch),
                }
                break
            
            await asyncio.sleep(2)
    
    return EventSourceResponse(event_generator())

패턴 선택 가이드

패턴	적합한 상황	장점	단점
폴링	단순 구현, 방화벽 제한	구현 간단, 상태 비저장	불필요한 요청, 지연
웹훅	서버 간 통신, 장시간 작업	실시간 알림, 효율적	엔드포인트 관리 필요
SSE	브라우저 클라이언트, 진행률 표시	실시간, 재연결 지원	단방향만 가능

토큰 사용량 리포팅

AI API에서 토큰 사용량 정보는 비용 추적과 예산 관리의 핵심입니다.

usage_reporting.py

python

class TokenUsage(BaseModel):
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    
    # 세부 분류 (선택적)
    prompt_tokens_details: PromptTokensDetail | None = None
    completion_tokens_details: CompletionTokensDetail | None = None
 
 
class PromptTokensDetail(BaseModel):
    cached_tokens: int = 0    # 캐시된 프롬프트 토큰
    text_tokens: int = 0      # 텍스트 토큰
    image_tokens: int = 0     # 이미지 토큰
    audio_tokens: int = 0     # 오디오 토큰
 
 
class CompletionTokensDetail(BaseModel):
    text_tokens: int = 0
    reasoning_tokens: int = 0  # 추론 토큰 (o1 계열)
 
 
class CompletionResponse(BaseModel):
    id: str
    model: str
    choices: list[Choice]
    usage: TokenUsage
    
    # 비용 정보 (선택적이지만 권장)
    cost: CostInfo | None = None
 
 
class CostInfo(BaseModel):
    prompt_cost: float
    completion_cost: float
    total_cost: float
    currency: str = "USD"

Tip

멀티모달 입력 처리

최신 AI 모델은 텍스트, 이미지, 오디오 등 다양한 입력을 동시에 처리합니다. 멀티모달 API 설계에는 두 가지 접근 방식이 있습니다.

인라인 Base64 방식

작은 파일(수 MB 이하)을 요청 본문에 직접 포함합니다.

multimodal-inline.json

json

{
  "model": "claude-4-vision",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "이 아키텍처 다이어그램을 분석해주세요"
        },
        {
          "type": "image",
          "source": {
            "type": "base64",
            "media_type": "image/png",
            "data": "iVBORw0KGgoAAAANSUhEUg..."
          }
        }
      ]
    }
  ]
}

URL 참조 방식

대용량 파일은 사전 업로드 후 URL로 참조합니다.

multimodal-url.json

json

{
  "model": "claude-4-vision",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "이 PDF 문서를 요약해주세요"
        },
        {
          "type": "document",
          "source": {
            "type": "url",
            "url": "https://files.example.com/uploads/abc123.pdf"
          }
        }
      ]
    }
  ]
}

파일 업로드 API

대용량 파일 처리를 위한 별도의 업로드 엔드포인트가 필요합니다.

file_upload.py

python

from fastapi import UploadFile, File
 
 
@app.post("/api/v1/files")
async def upload_file(
    file: UploadFile = File(...),
    purpose: str = "assistants",
) -> FileResponse:
    # 파일 크기 검증
    if file.size > 100 * 1024 * 1024:  # 100MB 제한
        raise APIError(
            status_code=413,
            error_type="file_too_large",
            message="파일 크기는 100MB를 초과할 수 없습니다",
        )
    
    # 허용된 MIME 타입 검증
    allowed_types = {
        "image/png", "image/jpeg", "image/webp", "image/gif",
        "application/pdf",
        "audio/mp3", "audio/wav", "audio/ogg",
        "text/plain", "text/csv",
    }
    if file.content_type not in allowed_types:
        raise APIError(
            status_code=415,
            error_type="unsupported_media_type",
            message=f"지원하지 않는 파일 형식입니다: {file.content_type}",
        )
    
    # 파일 저장 및 메타데이터 반환
    file_id = await file_store.save(file, purpose)
    
    return FileResponse(
        id=file_id,
        filename=file.filename,
        size=file.size,
        content_type=file.content_type,
        purpose=purpose,
        created_at=datetime.now().isoformat(),
    )

도구 호출 인터페이스

Function Calling(함수 호출)은 LLM이 외부 도구를 활용할 수 있게 하는 인터페이스입니다. API 레벨에서 다단계 대화 프로토콜을 정의해야 합니다.

도구 정의

tool-definition.json

json

{
  "model": "claude-4",
  "messages": [
    {
      "role": "user",
      "content": "서울의 현재 날씨를 알려주세요"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "지정된 도시의 현재 날씨를 조회합니다",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {
              "type": "string",
              "description": "도시 이름"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "온도 단위"
            }
          },
          "required": ["city"]
        }
      }
    }
  ]
}

다단계 대화 프로토콜

tool_calling_flow.py

python

# 1단계: 초기 요청 → 도구 호출 응답
initial_response = {
    "id": "resp_abc123",
    "model": "claude-4",
    "choices": [{
        "index": 0,
        "message": {
            "role": "assistant",
            "content": None,
            "tool_calls": [{
                "id": "call_xyz789",
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "arguments": '{"city": "서울", "unit": "celsius"}'
                }
            }]
        },
        "finish_reason": "tool_calls"
    }],
    "usage": {"prompt_tokens": 120, "completion_tokens": 25, "total_tokens": 145}
}
 
# 2단계: 도구 결과를 포함한 후속 요청
followup_request = {
    "model": "claude-4",
    "messages": [
        {"role": "user", "content": "서울의 현재 날씨를 알려주세요"},
        {
            "role": "assistant",
            "content": None,
            "tool_calls": [{
                "id": "call_xyz789",
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "arguments": '{"city": "서울", "unit": "celsius"}'
                }
            }]
        },
        {
            "role": "tool",
            "tool_call_id": "call_xyz789",
            "content": '{"temperature": 18, "condition": "맑음", "humidity": 45}'
        }
    ],
    "tools": [...]  # 동일한 도구 정의
}
 
# 3단계: 최종 응답
final_response = {
    "id": "resp_def456",
    "model": "claude-4",
    "choices": [{
        "index": 0,
        "message": {
            "role": "assistant",
            "content": "서울의 현재 날씨는 맑음이며, 기온은 18도, 습도는 45%입니다."
        },
        "finish_reason": "stop"
    }],
    "usage": {"prompt_tokens": 180, "completion_tokens": 35, "total_tokens": 215}
}

Warning

배치 API

대량의 요청을 효율적으로 처리하기 위한 배치 API 패턴입니다. 실시간 응답이 필요하지 않은 분석, 평가, 데이터 처리 작업에 적합합니다.

batch_api.py

python

class BatchInput(BaseModel):
    custom_id: str  # 클라이언트 측 식별자
    method: str     # "POST"
    url: str        # "/v1/chat/completions"
    body: dict      # 요청 본문
 
 
class BatchOutput(BaseModel):
    custom_id: str
    status_code: int
    body: dict
    error: dict | None = None
 
 
@app.post("/api/v1/batches")
async def create_batch(
    inputs: list[BatchInput],
    completion_window: str = "24h",  # 완료 보장 시간
    metadata: dict[str, str] | None = None,
) -> BatchResponse:
    """
    배치 API는 개별 요청 대비 50% 할인된 가격으로 제공됩니다.
    최대 50,000건의 요청을 하나의 배치로 처리할 수 있습니다.
    """
    batch_id = f"batch_{uuid.uuid4().hex[:12]}"
    
    await batch_processor.enqueue(
        batch_id=batch_id,
        inputs=inputs,
        completion_window=completion_window,
        metadata=metadata,
    )
    
    return BatchResponse(
        id=batch_id,
        status="pending",
        total=len(inputs),
        completed=0,
        failed=0,
    )

구조화된 출력

LLM의 비결정적 출력을 프로그래밍적으로 처리하기 위해 JSON Schema로 출력 형식을 강제하는 패턴입니다.

structured-output-request.json

json

{
  "model": "claude-4",
  "messages": [
    {
      "role": "user",
      "content": "다음 기술 블로그 포스트의 메타데이터를 추출하세요: ..."
    }
  ],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "blog_metadata",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "title": { "type": "string" },
          "summary": { "type": "string", "maxLength": 200 },
          "tags": {
            "type": "array",
            "items": { "type": "string" },
            "maxItems": 5
          },
          "difficulty": {
            "type": "string",
            "enum": ["beginner", "intermediate", "advanced"]
          },
          "estimated_reading_minutes": { "type": "integer", "minimum": 1 }
        },
        "required": ["title", "summary", "tags", "difficulty"],
        "additionalProperties": false
      }
    }
  }
}

structured_output_validation.py

python

from pydantic import BaseModel, Field
from typing import Literal
 
 
class BlogMetadata(BaseModel):
    title: str
    summary: str = Field(max_length=200)
    tags: list[str] = Field(max_length=5)
    difficulty: Literal["beginner", "intermediate", "advanced"]
    estimated_reading_minutes: int = Field(ge=1)
 
 
# 응답을 Pydantic 모델로 검증
response = await ai_client.complete(request)
content = response.choices[0].message.content
metadata = BlogMetadata.model_validate_json(content)

에러 분류 체계

AI API의 에러는 전통적 API보다 다양한 원인을 가집니다. 체계적인 에러 분류가 필요합니다.

error_taxonomy.py

python

ERROR_TAXONOMY = {
    # 클라이언트 오류 (4xx)
    "invalid_request": (400, "요청 형식이 올바르지 않습니다"),
    "authentication_error": (401, "인증에 실패했습니다"),
    "permission_denied": (403, "이 리소스에 접근할 권한이 없습니다"),
    "not_found": (404, "요청한 리소스를 찾을 수 없습니다"),
    "content_policy_violation": (400, "콘텐츠 정책을 위반하는 요청입니다"),
    "context_length_exceeded": (400, "모델의 컨텍스트 길이를 초과했습니다"),
    "invalid_tool_call": (400, "도구 호출 형식이 올바르지 않습니다"),
    
    # 레이트 리밋 (429)
    "rate_limit_exceeded": (429, "요청 한도를 초과했습니다"),
    "token_limit_exceeded": (429, "토큰 한도를 초과했습니다"),
    "budget_exceeded": (429, "월간 예산을 초과했습니다"),
    
    # 서버 오류 (5xx)
    "model_error": (500, "모델 추론 중 오류가 발생했습니다"),
    "model_overloaded": (503, "모델이 과부하 상태입니다"),
    "timeout": (504, "추론 시간이 초과되었습니다"),
}