2026년 2월 5일·AI / ML·

10장: 실전 프로젝트 - 프로덕션 AI 서비스 파이프라인 구축

모델 서빙부터 Kubernetes 배포, 오토스케일링, CI/CD까지 전체 AI 서비스 배포 파이프라인을 처음부터 끝까지 구축하는 종합 실전 프로젝트입니다.

20분1,743자9개 섹션

mlops kubernetes infrastructure performance

ai-deployment10 / 10

1 2 3 4 5 6 7 8 9 10

이전9장: CI/CD 파이프라인 - GitHub Actions로 모델 배포 자동화

프로젝트 개요

이 장에서는 지금까지 다룬 모든 내용을 종합하여 프로덕션 수준의 AI 서비스 배포 파이프라인을 처음부터 끝까지 구축합니다. 가상의 시나리오를 설정하고, 실제로 구현해야 할 모든 구성 요소를 단계별로 완성하겠습니다.

시나리오

SaaS 스타트업에서 고객 지원 챗봇 서비스를 운영합니다. 다음과 같은 요구사항이 있습니다.

Llama-3.1-8B-Instruct 모델 기반 챗봇
평일 오전 9시~오후 6시 트래픽 집중, 야간/주말 트래픽 감소
평균 일간 요청량 10만 건, 피크 시 분당 500건
평균 TTFT 2초 이내, P99 TTFT 5초 이내
월간 가용성 99.5% 이상
월간 예산 5,000달러 이내

전체 아키텍처

text

프로덕션 AI 서비스 아키텍처:
 
[사용자]
    |
    v
[CloudFront CDN]
    |
    v
[ALB (Application Load Balancer)]
    |
    v
[EKS 클러스터]
    |
    +---> [Namespace: ai-serving]
    |       +---> [Deployment: vllm-chatbot] (GPU Pod x 2-4)
    |       +---> [Service: vllm-service]
    |       +---> [HPA: 커스텀 메트릭 기반]
    |       +---> [ConfigMap: 서빙 설정]
    |       +---> [Secret: 인증 정보]
    |
    +---> [Namespace: monitoring]
    |       +---> [Prometheus]
    |       +---> [Grafana]
    |       +---> [AlertManager]
    |
    +---> [Namespace: kube-system]
            +---> [GPU Operator]
            +---> [Cluster Autoscaler]
            +---> [AWS Node Termination Handler]
 
[S3: 모델 스토리지]
[ECR: 컨테이너 이미지]
[GitHub Actions: CI/CD]

1단계: 프로젝트 구조 설정

리포지토리 구조

text

ai-chatbot-serving/
  src/
    server.py           # API 서버 (vLLM 위에 래퍼)
    middleware.py        # 인증, 레이트 리미팅, 로깅
    preprocessing.py     # 입력 전처리
    config.py            # 설정 관리
  tests/
    unit/               # 유닛 테스트
    smoke/              # 스모크 테스트
    load/               # 부하 테스트
  evals/
    datasets/           # 평가 데이터셋
    run_evaluation.py   # 평가 실행 스크립트
    check_gates.py      # 품질 게이트 검사
    thresholds.yaml     # 품질 임계값
  k8s/
    base/               # Kustomize 베이스
    overlays/
      staging/
      production/
  scripts/
    download_model.sh   # 모델 다운로드
    benchmark.py        # 벤치마크 스크립트
    monitor_canary.py   # 카나리 모니터링
  .github/
    workflows/
      ci.yml
      cd.yml
      model-eval.yml
  Dockerfile
  requirements.txt
  requirements-dev.txt
  requirements-eval.txt

애플리케이션 코드

src/config.py

python

import os
from dataclasses import dataclass
 
@dataclass
class ServingConfig:
    model_path: str = os.getenv("MODEL_PATH", "meta-llama/Llama-3.1-8B-Instruct")
    host: str = os.getenv("HOST", "0.0.0.0")
    port: int = int(os.getenv("PORT", "8000"))
    max_model_len: int = int(os.getenv("MAX_MODEL_LEN", "4096"))
    gpu_memory_utilization: float = float(os.getenv("GPU_MEMORY_UTILIZATION", "0.90"))
    max_num_seqs: int = int(os.getenv("MAX_NUM_SEQS", "256"))
    enable_prefix_caching: bool = os.getenv("ENABLE_PREFIX_CACHING", "true").lower() == "true"
 
    # 시스템 프롬프트
    system_prompt: str = os.getenv(
        "SYSTEM_PROMPT",
        "당신은 친절하고 전문적인 고객 지원 AI 어시스턴트입니다. "
        "정확하고 도움이 되는 답변을 제공하세요."
    )
 
    # 레이트 리미팅
    rate_limit_rpm: int = int(os.getenv("RATE_LIMIT_RPM", "60"))

src/middleware.py

python

import time
import logging
from collections import defaultdict
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from starlette.responses import JSONResponse
 
logger = logging.getLogger(__name__)
 
class RateLimitMiddleware(BaseHTTPMiddleware):
    """간단한 인메모리 레이트 리미터입니다."""
 
    def __init__(self, app, rpm: int = 60):
        super().__init__(app)
        self.rpm = rpm
        self.requests: dict[str, list[float]] = defaultdict(list)
 
    async def dispatch(self, request: Request, call_next):
        if request.url.path in ("/health", "/metrics"):
            return await call_next(request)
 
        client_ip = request.client.host if request.client else "unknown"
        now = time.time()
 
        # 1분 이내의 요청만 유지
        self.requests[client_ip] = [
            t for t in self.requests[client_ip]
            if now - t < 60
        ]
 
        if len(self.requests[client_ip]) >= self.rpm:
            return JSONResponse(
                status_code=429,
                content={"error": "Rate limit exceeded"}
            )
 
        self.requests[client_ip].append(now)
        return await call_next(request)
 
 
class RequestLoggingMiddleware(BaseHTTPMiddleware):
    """요청/응답 로깅 미들웨어입니다."""
 
    async def dispatch(self, request: Request, call_next):
        start_time = time.time()
        response = await call_next(request)
        duration = time.time() - start_time
 
        if request.url.path not in ("/health", "/metrics"):
            logger.info(
                "request_completed",
                extra={
                    "method": request.method,
                    "path": request.url.path,
                    "status_code": response.status_code,
                    "duration_ms": round(duration * 1000, 2),
                }
            )
 
        return response

src/server.py

python

import logging
import uvicorn
from fastapi import FastAPI
from src.config import ServingConfig
from src.middleware import RateLimitMiddleware, RequestLoggingMiddleware
 
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
)
logger = logging.getLogger(__name__)
 
config = ServingConfig()
 
app = FastAPI(title="AI Chatbot Service")
app.add_middleware(RequestLoggingMiddleware)
app.add_middleware(RateLimitMiddleware, rpm=config.rate_limit_rpm)
 
@app.get("/health")
async def health():
    return {"status": "healthy"}
 
@app.get("/ready")
async def ready():
    # vLLM 서버의 헬스 체크를 프록시
    import httpx
    async with httpx.AsyncClient() as client:
        try:
            resp = await client.get(
                f"http://localhost:{config.port}/health",
                timeout=5.0,
            )
            if resp.status_code == 200:
                return {"status": "ready"}
        except Exception:
            pass
    return JSONResponse(
        status_code=503,
        content={"status": "not ready"}
    )
 
if __name__ == "__main__":
    uvicorn.run(app, host=config.host, port=8080)

2단계: Docker 이미지 빌드

Dockerfile

dockerfile

FROM vllm/vllm-openai:v0.6.0 AS base
 
# 시스템 의존성
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl && \
    rm -rf /var/lib/apt/lists/*
 
# Python 의존성
COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r /app/requirements.txt
 
# 애플리케이션 코드
COPY src/ /app/src/
 
WORKDIR /app
 
# 헬스 체크
HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 \
    CMD curl -sf http://localhost:8000/health || exit 1
 
EXPOSE 8000
 
# vLLM 서버 실행 (환경 변수로 설정 주입)
COPY entrypoint.sh /app/entrypoint.sh
RUN chmod +x /app/entrypoint.sh
 
ENTRYPOINT ["/app/entrypoint.sh"]

entrypoint.sh

bash

#!/bin/bash
set -e
 
MODEL_PATH="${MODEL_PATH:-meta-llama/Llama-3.1-8B-Instruct}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-4096}"
GPU_MEM_UTIL="${GPU_MEMORY_UTILIZATION:-0.90}"
MAX_NUM_SEQS="${MAX_NUM_SEQS:-256}"
 
ARGS=(
    "--model" "$MODEL_PATH"
    "--host" "0.0.0.0"
    "--port" "8000"
    "--max-model-len" "$MAX_MODEL_LEN"
    "--gpu-memory-utilization" "$GPU_MEM_UTIL"
    "--max-num-seqs" "$MAX_NUM_SEQS"
    "--dtype" "bfloat16"
    "--disable-log-requests"
)
 
if [ "$ENABLE_PREFIX_CACHING" = "true" ]; then
    ARGS+=("--enable-prefix-caching")
fi
 
if [ -n "$TENSOR_PARALLEL_SIZE" ]; then
    ARGS+=("--tensor-parallel-size" "$TENSOR_PARALLEL_SIZE")
fi
 
echo "Starting vLLM with args: vllm serve ${ARGS[*]}"
exec vllm serve "${ARGS[@]}"

3단계: Kubernetes 매니페스트

베이스 매니페스트

k8s/base/namespace.yaml

yaml

apiVersion: v1
kind: Namespace
metadata:
  name: ai-serving
  labels:
    app: ai-chatbot

k8s/base/configmap.yaml

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: vllm-config
data:
  MODEL_PATH: "/models/Llama-3.1-8B-Instruct"
  MAX_MODEL_LEN: "4096"
  GPU_MEMORY_UTILIZATION: "0.90"
  MAX_NUM_SEQS: "256"
  ENABLE_PREFIX_CACHING: "true"
  SYSTEM_PROMPT: "당신은 친절하고 전문적인 고객 지원 AI 어시스턴트입니다. 정확하고 도움이 되는 답변을 제공하세요."

k8s/base/deployment.yaml

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-chatbot
  labels:
    app: vllm-chatbot
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: vllm-chatbot
  template:
    metadata:
      labels:
        app: vllm-chatbot
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      terminationGracePeriodSeconds: 120
      tolerations:
        - key: nvidia.com/gpu
          operator: Equal
          value: present
          effect: NoSchedule
      nodeSelector:
        accelerator: nvidia-a100
      initContainers:
        - name: model-loader
          image: amazon/aws-cli:2.15
          command:
            - sh
            - -c
            - |
              if [ ! -f /models/Llama-3.1-8B-Instruct/config.json ]; then
                echo "Downloading model from S3..."
                aws s3 sync \
                  s3://ai-chatbot-models/Llama-3.1-8B-Instruct \
                  /models/Llama-3.1-8B-Instruct \
                  --quiet
                echo "Download complete"
              else
                echo "Model already cached"
              fi
          volumeMounts:
            - name: model-cache
              mountPath: /models
          envFrom:
            - secretRef:
                name: aws-credentials
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
      containers:
        - name: vllm
          image: ai-serving:latest
          envFrom:
            - configMapRef:
                name: vllm-config
          ports:
            - containerPort: 8000
              name: http
          volumeMounts:
            - name: model-cache
              mountPath: /models
              readOnly: true
            - name: shm
              mountPath: /dev/shm
          resources:
            requests:
              nvidia.com/gpu: 1
              cpu: "4"
              memory: "24Gi"
            limits:
              nvidia.com/gpu: 1
              cpu: "8"
              memory: "32Gi"
          startupProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 30
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 10"]
      volumes:
        - name: model-cache
          emptyDir:
            sizeLimit: 50Gi
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 2Gi

k8s/base/service.yaml

yaml

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-chatbot
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP
      name: http
  type: ClusterIP

k8s/base/hpa.yaml

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-chatbot-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-chatbot
  minReplicas: 2
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_requests_waiting
        target:
          type: AverageValue
          averageValue: "5"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300

k8s/base/ingress.yaml

yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  rules:
    - host: chatbot-api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: vllm-service
                port:
                  number: 80
  tls:
    - hosts:
        - chatbot-api.example.com
      secretName: chatbot-api-tls

k8s/base/kustomization.yaml

yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
 
resources:
  - namespace.yaml
  - configmap.yaml
  - deployment.yaml
  - service.yaml
  - hpa.yaml
  - ingress.yaml
 
commonLabels:
  project: ai-chatbot

4단계: 모니터링 설정

Prometheus 알림 규칙

k8s/base/prometheus-rules.yaml

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: vllm-alerts
  namespace: monitoring
spec:
  groups:
    - name: vllm-serving
      rules:
        - alert: HighLatency
          expr: histogram_quantile(0.99, rate(vllm:e2e_request_latency_seconds_bucket[5m])) > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "P99 지연 시간이 10초를 초과했습니다"
 
        - alert: HighErrorRate
          expr: >
            rate(vllm:request_failure_total[5m])
            / (rate(vllm:request_success_total[5m]) + rate(vllm:request_failure_total[5m]))
            > 0.01
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "오류율이 1%를 초과했습니다"
 
        - alert: GPUMemoryPressure
          expr: vllm:gpu_cache_usage_perc > 95
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "GPU KV 캐시 사용률이 95%를 초과했습니다"
 
        - alert: LongRequestQueue
          expr: avg(vllm:num_requests_waiting) > 20
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "요청 대기 큐가 깊어지고 있습니다"

5단계: 부하 테스트

프로덕션 배포 전에 부하 테스트를 통해 시스템의 한계를 파악하고 적절한 설정을 결정합니다.

tests/load/load_test.py

python

import asyncio
import time
import statistics
from openai import AsyncOpenAI
 
client = AsyncOpenAI(
    base_url="http://chatbot-api.example.com/v1",
    api_key="not-needed",
)
 
TEST_PROMPTS = [
    "배송 상태를 확인하고 싶습니다. 주문번호는 ABC123입니다.",
    "환불 절차에 대해 안내해 주세요.",
    "제품 사용 중 오류가 발생했습니다. 화면에 에러 코드 E-101이 표시됩니다.",
    "멤버십 등급 혜택이 궁금합니다.",
    "계정 비밀번호를 변경하고 싶습니다.",
]
 
async def send_request(semaphore, prompt):
    async with semaphore:
        start = time.perf_counter()
        first_token_time = None
        token_count = 0
 
        try:
            response = await client.chat.completions.create(
                model="meta-llama/Llama-3.1-8B-Instruct",
                messages=[
                    {"role": "system", "content": "당신은 고객 지원 AI입니다."},
                    {"role": "user", "content": prompt},
                ],
                max_tokens=256,
                stream=True,
            )
 
            async for chunk in response:
                if chunk.choices[0].delta.content:
                    if first_token_time is None:
                        first_token_time = time.perf_counter() - start
                    token_count += 1
 
            total_time = time.perf_counter() - start
            return {
                "success": True,
                "ttft": first_token_time,
                "total_time": total_time,
                "tokens": token_count,
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "total_time": time.perf_counter() - start,
            }
 
async def run_load_test(
    total_requests: int = 500,
    concurrency: int = 50,
    ramp_up_seconds: int = 30,
):
    """부하 테스트를 실행합니다."""
    print(f"Starting load test: {total_requests} requests, concurrency={concurrency}")
    semaphore = asyncio.Semaphore(concurrency)
 
    prompts = [TEST_PROMPTS[i % len(TEST_PROMPTS)] for i in range(total_requests)]
 
    start_time = time.perf_counter()
    results = await asyncio.gather(
        *[send_request(semaphore, p) for p in prompts]
    )
    total_duration = time.perf_counter() - start_time
 
    # 결과 분석
    successful = [r for r in results if r["success"]]
    failed = [r for r in results if not r["success"]]
 
    if successful:
        ttfts = [r["ttft"] for r in successful if r.get("ttft")]
        total_times = [r["total_time"] for r in successful]
        total_tokens = sum(r["tokens"] for r in successful)
 
        print(f"\n--- Load Test Results ---")
        print(f"Total requests: {total_requests}")
        print(f"Successful: {len(successful)} ({len(successful)/total_requests*100:.1f}%)")
        print(f"Failed: {len(failed)} ({len(failed)/total_requests*100:.1f}%)")
        print(f"Total duration: {total_duration:.1f}s")
        print(f"Throughput: {total_tokens/total_duration:.1f} tokens/s")
        print(f"RPS: {total_requests/total_duration:.1f}")
        print(f"\nTTFT (ms):")
        print(f"  P50: {statistics.median(ttfts)*1000:.0f}")
        print(f"  P90: {sorted(ttfts)[int(len(ttfts)*0.9)]*1000:.0f}")
        print(f"  P99: {sorted(ttfts)[int(len(ttfts)*0.99)]*1000:.0f}")
        print(f"\nTotal latency (ms):")
        print(f"  P50: {statistics.median(total_times)*1000:.0f}")
        print(f"  P90: {sorted(total_times)[int(len(total_times)*0.9)]*1000:.0f}")
        print(f"  P99: {sorted(total_times)[int(len(total_times)*0.99)]*1000:.0f}")
 
if __name__ == "__main__":
    asyncio.run(run_load_test())

Tip

부하 테스트는 프로덕션 환경과 동일한 설정의 스테이징 환경에서 수행해야 합니다. 로컬 환경에서의 결과는 네트워크 지연, GPU 성능 등의 차이로 인해 프로덕션과 크게 다를 수 있습니다.

6단계: 비용 분석

시나리오의 예산(월 5,000달러) 내에서 운영하기 위한 비용 분석입니다.

text

비용 구성 (혼합 전략):
 
기본 용량 (온디맨드, 24/7):
  g5.2xlarge (A10G 24GB) x 2
  AWQ 4비트 양자화 적용 (8B 모델 -> 약 5GB)
  시간당: $1.21 x 2 = $2.42
  월: $2.42 x 24 x 30 = $1,742
 
추가 용량 (스팟, 업무 시간):
  g5.2xlarge (스팟) x 1-2
  평일 9-18시 (월 약 200시간)
  시간당: $0.36 x 1.5(평균) = $0.54
  월: $0.54 x 200 = $108
 
EKS 클러스터:
  $0.10/시간 x 720시간 = $72
 
스토리지 + 네트워크:
  약 $100
 
모니터링:
  약 $50
 
총 월간 비용: 약 $2,072
예산 대비: 41.4% (충분한 여유)

Info

g5.2xlarge(A10G 24GB)에 AWQ 4비트 양자화를 적용한 Llama-3.1-8B-Instruct는 충분한 성능을 제공합니다. 8비트 양자화 시 약 8GB, 4비트 시 약 5GB의 GPU 메모리를 사용하므로 A10G 24GB에 여유 있게 탑재됩니다. 예산에 여유가 있으므로, 품질이 중요하다면 FP16(16GB)로 서빙하는 것도 가능합니다.

7단계: 운영 런북

프로덕션 운영에서 발생할 수 있는 주요 상황과 대응 절차를 문서화합니다.

모델 서버 무응답

text

증상: 헬스 체크 실패, 요청 타임아웃
원인 가능성: GPU OOM, 프로세스 데드락, CUDA 오류
 
대응 절차:
1. Pod 상태 확인
   kubectl get pods -n ai-serving -l app=vllm-chatbot
 
2. Pod 로그 확인
   kubectl logs <pod-name> -n ai-serving --tail=100
 
3. GPU 상태 확인 (Pod 내부)
   kubectl exec <pod-name> -n ai-serving -- nvidia-smi
 
4. Pod 재시작 (최후 수단)
   kubectl delete pod <pod-name> -n ai-serving
 
5. 반복 발생 시 리소스 설정 검토
   - GPU 메모리 활용률 낮추기 (0.90 -> 0.85)
   - 최대 동시 요청 수 줄이기

요청 지연 증가

text

증상: TTFT, TPOT 증가, 대기 큐 깊어짐
원인 가능성: 트래픽 급증, GPU 과부하, KV 캐시 포화
 
대응 절차:
1. 현재 메트릭 확인
   - 대기 요청 수, GPU 캐시 사용률, 요청률 확인
 
2. HPA 상태 확인
   kubectl get hpa -n ai-serving
 
3. 수동 스케일업 (긴급 시)
   kubectl scale deployment vllm-chatbot -n ai-serving --replicas=4
 
4. 트래픽 소스 분석
   - 비정상적 대량 요청이 있는지 확인
   - 레이트 리미팅 적용 여부 확인

스팟 인스턴스 회수

text

증상: Pod 종료, 노드 드레인
원인: AWS 스팟 인스턴스 용량 회수
 
대응 절차:
1. Node Termination Handler가 자동으로 처리
   - Pod가 안전하게 축출됨
   - 진행 중인 요청은 Graceful Shutdown
 
2. 온디맨드 Pod가 서비스 유지 중인지 확인
 
3. Cluster Autoscaler가 새 노드 프로비저닝
   - 스팟 가용하면 스팟으로 재생성
   - 스팟 불가 시 온디맨드 폴백

전체 시리즈 정리

이 시리즈를 통해 AI 모델을 프로덕션 환경에 배포하기 위한 전체 파이프라인을 단계별로 구축했습니다.

1장에서 AI 서비스 배포의 전체 그림과 고유한 과제를 이해했습니다.

2장에서 vLLM과 TGI를 비교하며 모델 서빙 프레임워크의 아키텍처와 선택 기준을 학습했습니다.

3장에서 양자화, 배칭, KV 캐시 전략을 통해 추론 성능을 최적화하는 기법을 다루었습니다.

4장에서 GPU 지원 Docker 컨테이너로 AI 서비스를 패키징하고, 효율적인 이미지 관리 전략을 수립했습니다.

5장에서 Kubernetes의 핵심 개념을 AI 워크로드 관점에서 학습하고, GPU 노드 구성과 클러스터 설계를 다루었습니다.

6장에서 프로덕션 배포에 필요한 프로브 설정, 리소스 관리, 무중단 배포 전략을 구현했습니다.

7장에서 커스텀 메트릭 기반 HPA와 Cluster Autoscaler를 활용한 자동 확장 전략을 구축했습니다.

8장에서 스팟 인스턴스, 양자화, 모델 라우팅을 통한 비용 최적화 전략을 학습했습니다.

9장에서 GitHub Actions를 활용한 CI/CD 파이프라인을 구축하고, 모델 평가를 배포 프로세스에 통합했습니다.

10장에서 모든 내용을 종합하여 프로덕션 수준의 AI 서비스 파이프라인을 실전으로 구축했습니다.

AI 서비스 배포는 빠르게 발전하는 분야입니다. vLLM을 비롯한 서빙 프레임워크는 계속 최적화되고 있으며, GPU 하드웨어도 세대를 거듭하며 성능이 향상되고 있습니다. 이 시리즈에서 다룬 아키텍처 패턴과 설계 원칙은 도구가 바뀌더라도 유효한 기초를 제공합니다. 지속적인 학습과 실험을 통해 자신의 서비스에 최적화된 배포 파이프라인을 발전시켜 나가시기 바랍니다.

이 글이 도움이 되셨나요?

AI / ML

9장: CI/CD 파이프라인 - GitHub Actions로 모델 배포 자동화

GitHub Actions를 활용하여 AI 서비스의 빌드, 테스트, 배포를 자동화하는 CI/CD 파이프라인을 구축하고, 모델 평가를 파이프라인에 통합합니다.

2026년 2월 3일·14분

AI / ML

8장: 비용 최적화 - 스팟 인스턴스, 모델 공유, 리소스 관리

GPU 기반 AI 서비스의 운영 비용을 체계적으로 절감하는 전략을 다루며, 스팟 인스턴스 활용, 모델 공유 아키텍처, 리소스 관리 기법을 소개합니다.

2026년 2월 1일·18분

AI / ML

7장: 오토스케일링 - 트래픽 기반 GPU 워크로드 확장

Kubernetes에서 GPU 기반 AI 서비스의 자동 확장 전략을 구현하며, HPA 커스텀 메트릭과 Cluster Autoscaler를 활용한 효율적인 스케일링 방법을 다룹니다.

2026년 1월 30일·17분

2026년 2월 5일·AI / ML·

10장: 실전 프로젝트 - 프로덕션 AI 서비스 파이프라인 구축

모델 서빙부터 Kubernetes 배포, 오토스케일링, CI/CD까지 전체 AI 서비스 배포 파이프라인을 처음부터 끝까지 구축하는 종합 실전 프로젝트입니다.

20분1,743자9개 섹션

mlops kubernetes infrastructure performance

ai-deployment10 / 10

1 2 3 4 5 6 7 8 9 10

이전9장: CI/CD 파이프라인 - GitHub Actions로 모델 배포 자동화

프로젝트 개요

시나리오

SaaS 스타트업에서 고객 지원 챗봇 서비스를 운영합니다. 다음과 같은 요구사항이 있습니다.

Llama-3.1-8B-Instruct 모델 기반 챗봇
평일 오전 9시~오후 6시 트래픽 집중, 야간/주말 트래픽 감소
평균 일간 요청량 10만 건, 피크 시 분당 500건
평균 TTFT 2초 이내, P99 TTFT 5초 이내
월간 가용성 99.5% 이상
월간 예산 5,000달러 이내

전체 아키텍처

text

프로덕션 AI 서비스 아키텍처:
 
[사용자]
    |
    v
[CloudFront CDN]
    |
    v
[ALB (Application Load Balancer)]
    |
    v
[EKS 클러스터]
    |
    +---> [Namespace: ai-serving]
    |       +---> [Deployment: vllm-chatbot] (GPU Pod x 2-4)
    |       +---> [Service: vllm-service]
    |       +---> [HPA: 커스텀 메트릭 기반]
    |       +---> [ConfigMap: 서빙 설정]
    |       +---> [Secret: 인증 정보]
    |
    +---> [Namespace: monitoring]
    |       +---> [Prometheus]
    |       +---> [Grafana]
    |       +---> [AlertManager]
    |
    +---> [Namespace: kube-system]
            +---> [GPU Operator]
            +---> [Cluster Autoscaler]
            +---> [AWS Node Termination Handler]
 
[S3: 모델 스토리지]
[ECR: 컨테이너 이미지]
[GitHub Actions: CI/CD]

1단계: 프로젝트 구조 설정

리포지토리 구조

text

ai-chatbot-serving/
  src/
    server.py           # API 서버 (vLLM 위에 래퍼)
    middleware.py        # 인증, 레이트 리미팅, 로깅
    preprocessing.py     # 입력 전처리
    config.py            # 설정 관리
  tests/
    unit/               # 유닛 테스트
    smoke/              # 스모크 테스트
    load/               # 부하 테스트
  evals/
    datasets/           # 평가 데이터셋
    run_evaluation.py   # 평가 실행 스크립트
    check_gates.py      # 품질 게이트 검사
    thresholds.yaml     # 품질 임계값
  k8s/
    base/               # Kustomize 베이스
    overlays/
      staging/
      production/
  scripts/
    download_model.sh   # 모델 다운로드
    benchmark.py        # 벤치마크 스크립트
    monitor_canary.py   # 카나리 모니터링
  .github/
    workflows/
      ci.yml
      cd.yml
      model-eval.yml
  Dockerfile
  requirements.txt
  requirements-dev.txt
  requirements-eval.txt

애플리케이션 코드

src/config.py

python

import os
from dataclasses import dataclass
 
@dataclass
class ServingConfig:
    model_path: str = os.getenv("MODEL_PATH", "meta-llama/Llama-3.1-8B-Instruct")
    host: str = os.getenv("HOST", "0.0.0.0")
    port: int = int(os.getenv("PORT", "8000"))
    max_model_len: int = int(os.getenv("MAX_MODEL_LEN", "4096"))
    gpu_memory_utilization: float = float(os.getenv("GPU_MEMORY_UTILIZATION", "0.90"))
    max_num_seqs: int = int(os.getenv("MAX_NUM_SEQS", "256"))
    enable_prefix_caching: bool = os.getenv("ENABLE_PREFIX_CACHING", "true").lower() == "true"
 
    # 시스템 프롬프트
    system_prompt: str = os.getenv(
        "SYSTEM_PROMPT",
        "당신은 친절하고 전문적인 고객 지원 AI 어시스턴트입니다. "
        "정확하고 도움이 되는 답변을 제공하세요."
    )
 
    # 레이트 리미팅
    rate_limit_rpm: int = int(os.getenv("RATE_LIMIT_RPM", "60"))

src/middleware.py

python

import time
import logging
from collections import defaultdict
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from starlette.responses import JSONResponse
 
logger = logging.getLogger(__name__)
 
class RateLimitMiddleware(BaseHTTPMiddleware):
    """간단한 인메모리 레이트 리미터입니다."""
 
    def __init__(self, app, rpm: int = 60):
        super().__init__(app)
        self.rpm = rpm
        self.requests: dict[str, list[float]] = defaultdict(list)
 
    async def dispatch(self, request: Request, call_next):
        if request.url.path in ("/health", "/metrics"):
            return await call_next(request)
 
        client_ip = request.client.host if request.client else "unknown"
        now = time.time()
 
        # 1분 이내의 요청만 유지
        self.requests[client_ip] = [
            t for t in self.requests[client_ip]
            if now - t < 60
        ]
 
        if len(self.requests[client_ip]) >= self.rpm:
            return JSONResponse(
                status_code=429,
                content={"error": "Rate limit exceeded"}
            )
 
        self.requests[client_ip].append(now)
        return await call_next(request)
 
 
class RequestLoggingMiddleware(BaseHTTPMiddleware):
    """요청/응답 로깅 미들웨어입니다."""
 
    async def dispatch(self, request: Request, call_next):
        start_time = time.time()
        response = await call_next(request)
        duration = time.time() - start_time
 
        if request.url.path not in ("/health", "/metrics"):
            logger.info(
                "request_completed",
                extra={
                    "method": request.method,
                    "path": request.url.path,
                    "status_code": response.status_code,
                    "duration_ms": round(duration * 1000, 2),
                }
            )
 
        return response

src/server.py

python

import logging
import uvicorn
from fastapi import FastAPI
from src.config import ServingConfig
from src.middleware import RateLimitMiddleware, RequestLoggingMiddleware
 
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
)
logger = logging.getLogger(__name__)
 
config = ServingConfig()
 
app = FastAPI(title="AI Chatbot Service")
app.add_middleware(RequestLoggingMiddleware)
app.add_middleware(RateLimitMiddleware, rpm=config.rate_limit_rpm)
 
@app.get("/health")
async def health():
    return {"status": "healthy"}
 
@app.get("/ready")
async def ready():
    # vLLM 서버의 헬스 체크를 프록시
    import httpx
    async with httpx.AsyncClient() as client:
        try:
            resp = await client.get(
                f"http://localhost:{config.port}/health",
                timeout=5.0,
            )
            if resp.status_code == 200:
                return {"status": "ready"}
        except Exception:
            pass
    return JSONResponse(
        status_code=503,
        content={"status": "not ready"}
    )
 
if __name__ == "__main__":
    uvicorn.run(app, host=config.host, port=8080)

2단계: Docker 이미지 빌드

Dockerfile

dockerfile

FROM vllm/vllm-openai:v0.6.0 AS base
 
# 시스템 의존성
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl && \
    rm -rf /var/lib/apt/lists/*
 
# Python 의존성
COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r /app/requirements.txt
 
# 애플리케이션 코드
COPY src/ /app/src/
 
WORKDIR /app
 
# 헬스 체크
HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 \
    CMD curl -sf http://localhost:8000/health || exit 1
 
EXPOSE 8000
 
# vLLM 서버 실행 (환경 변수로 설정 주입)
COPY entrypoint.sh /app/entrypoint.sh
RUN chmod +x /app/entrypoint.sh
 
ENTRYPOINT ["/app/entrypoint.sh"]

entrypoint.sh

bash

#!/bin/bash
set -e
 
MODEL_PATH="${MODEL_PATH:-meta-llama/Llama-3.1-8B-Instruct}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-4096}"
GPU_MEM_UTIL="${GPU_MEMORY_UTILIZATION:-0.90}"
MAX_NUM_SEQS="${MAX_NUM_SEQS:-256}"
 
ARGS=(
    "--model" "$MODEL_PATH"
    "--host" "0.0.0.0"
    "--port" "8000"
    "--max-model-len" "$MAX_MODEL_LEN"
    "--gpu-memory-utilization" "$GPU_MEM_UTIL"
    "--max-num-seqs" "$MAX_NUM_SEQS"
    "--dtype" "bfloat16"
    "--disable-log-requests"
)
 
if [ "$ENABLE_PREFIX_CACHING" = "true" ]; then
    ARGS+=("--enable-prefix-caching")
fi
 
if [ -n "$TENSOR_PARALLEL_SIZE" ]; then
    ARGS+=("--tensor-parallel-size" "$TENSOR_PARALLEL_SIZE")
fi
 
echo "Starting vLLM with args: vllm serve ${ARGS[*]}"
exec vllm serve "${ARGS[@]}"

3단계: Kubernetes 매니페스트

베이스 매니페스트

k8s/base/namespace.yaml

yaml

apiVersion: v1
kind: Namespace
metadata:
  name: ai-serving
  labels:
    app: ai-chatbot

k8s/base/configmap.yaml

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: vllm-config
data:
  MODEL_PATH: "/models/Llama-3.1-8B-Instruct"
  MAX_MODEL_LEN: "4096"
  GPU_MEMORY_UTILIZATION: "0.90"
  MAX_NUM_SEQS: "256"
  ENABLE_PREFIX_CACHING: "true"
  SYSTEM_PROMPT: "당신은 친절하고 전문적인 고객 지원 AI 어시스턴트입니다. 정확하고 도움이 되는 답변을 제공하세요."

k8s/base/deployment.yaml

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-chatbot
  labels:
    app: vllm-chatbot
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: vllm-chatbot
  template:
    metadata:
      labels:
        app: vllm-chatbot
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      terminationGracePeriodSeconds: 120
      tolerations:
        - key: nvidia.com/gpu
          operator: Equal
          value: present
          effect: NoSchedule
      nodeSelector:
        accelerator: nvidia-a100
      initContainers:
        - name: model-loader
          image: amazon/aws-cli:2.15
          command:
            - sh
            - -c
            - |
              if [ ! -f /models/Llama-3.1-8B-Instruct/config.json ]; then
                echo "Downloading model from S3..."
                aws s3 sync \
                  s3://ai-chatbot-models/Llama-3.1-8B-Instruct \
                  /models/Llama-3.1-8B-Instruct \
                  --quiet
                echo "Download complete"
              else
                echo "Model already cached"
              fi
          volumeMounts:
            - name: model-cache
              mountPath: /models
          envFrom:
            - secretRef:
                name: aws-credentials
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
      containers:
        - name: vllm
          image: ai-serving:latest
          envFrom:
            - configMapRef:
                name: vllm-config
          ports:
            - containerPort: 8000
              name: http
          volumeMounts:
            - name: model-cache
              mountPath: /models
              readOnly: true
            - name: shm
              mountPath: /dev/shm
          resources:
            requests:
              nvidia.com/gpu: 1
              cpu: "4"
              memory: "24Gi"
            limits:
              nvidia.com/gpu: 1
              cpu: "8"
              memory: "32Gi"
          startupProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 30
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 10"]
      volumes:
        - name: model-cache
          emptyDir:
            sizeLimit: 50Gi
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 2Gi

k8s/base/service.yaml

yaml

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-chatbot
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP
      name: http
  type: ClusterIP

k8s/base/hpa.yaml

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-chatbot-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-chatbot
  minReplicas: 2
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_requests_waiting
        target:
          type: AverageValue
          averageValue: "5"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300

k8s/base/ingress.yaml

yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  rules:
    - host: chatbot-api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: vllm-service
                port:
                  number: 80
  tls:
    - hosts:
        - chatbot-api.example.com
      secretName: chatbot-api-tls

k8s/base/kustomization.yaml

yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
 
resources:
  - namespace.yaml
  - configmap.yaml
  - deployment.yaml
  - service.yaml
  - hpa.yaml
  - ingress.yaml
 
commonLabels:
  project: ai-chatbot

4단계: 모니터링 설정

Prometheus 알림 규칙

k8s/base/prometheus-rules.yaml

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: vllm-alerts
  namespace: monitoring
spec:
  groups:
    - name: vllm-serving
      rules:
        - alert: HighLatency
          expr: histogram_quantile(0.99, rate(vllm:e2e_request_latency_seconds_bucket[5m])) > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "P99 지연 시간이 10초를 초과했습니다"
 
        - alert: HighErrorRate
          expr: >
            rate(vllm:request_failure_total[5m])
            / (rate(vllm:request_success_total[5m]) + rate(vllm:request_failure_total[5m]))
            > 0.01
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "오류율이 1%를 초과했습니다"
 
        - alert: GPUMemoryPressure
          expr: vllm:gpu_cache_usage_perc > 95
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "GPU KV 캐시 사용률이 95%를 초과했습니다"
 
        - alert: LongRequestQueue
          expr: avg(vllm:num_requests_waiting) > 20
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "요청 대기 큐가 깊어지고 있습니다"

5단계: 부하 테스트

프로덕션 배포 전에 부하 테스트를 통해 시스템의 한계를 파악하고 적절한 설정을 결정합니다.

tests/load/load_test.py

python

import asyncio
import time
import statistics
from openai import AsyncOpenAI
 
client = AsyncOpenAI(
    base_url="http://chatbot-api.example.com/v1",
    api_key="not-needed",
)
 
TEST_PROMPTS = [
    "배송 상태를 확인하고 싶습니다. 주문번호는 ABC123입니다.",
    "환불 절차에 대해 안내해 주세요.",
    "제품 사용 중 오류가 발생했습니다. 화면에 에러 코드 E-101이 표시됩니다.",
    "멤버십 등급 혜택이 궁금합니다.",
    "계정 비밀번호를 변경하고 싶습니다.",
]
 
async def send_request(semaphore, prompt):
    async with semaphore:
        start = time.perf_counter()
        first_token_time = None
        token_count = 0
 
        try:
            response = await client.chat.completions.create(
                model="meta-llama/Llama-3.1-8B-Instruct",
                messages=[
                    {"role": "system", "content": "당신은 고객 지원 AI입니다."},
                    {"role": "user", "content": prompt},
                ],
                max_tokens=256,
                stream=True,
            )
 
            async for chunk in response:
                if chunk.choices[0].delta.content:
                    if first_token_time is None:
                        first_token_time = time.perf_counter() - start
                    token_count += 1
 
            total_time = time.perf_counter() - start
            return {
                "success": True,
                "ttft": first_token_time,
                "total_time": total_time,
                "tokens": token_count,
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "total_time": time.perf_counter() - start,
            }
 
async def run_load_test(
    total_requests: int = 500,
    concurrency: int = 50,
    ramp_up_seconds: int = 30,
):
    """부하 테스트를 실행합니다."""
    print(f"Starting load test: {total_requests} requests, concurrency={concurrency}")
    semaphore = asyncio.Semaphore(concurrency)
 
    prompts = [TEST_PROMPTS[i % len(TEST_PROMPTS)] for i in range(total_requests)]
 
    start_time = time.perf_counter()
    results = await asyncio.gather(
        *[send_request(semaphore, p) for p in prompts]
    )
    total_duration = time.perf_counter() - start_time
 
    # 결과 분석
    successful = [r for r in results if r["success"]]
    failed = [r for r in results if not r["success"]]
 
    if successful:
        ttfts = [r["ttft"] for r in successful if r.get("ttft")]
        total_times = [r["total_time"] for r in successful]
        total_tokens = sum(r["tokens"] for r in successful)
 
        print(f"\n--- Load Test Results ---")
        print(f"Total requests: {total_requests}")
        print(f"Successful: {len(successful)} ({len(successful)/total_requests*100:.1f}%)")
        print(f"Failed: {len(failed)} ({len(failed)/total_requests*100:.1f}%)")
        print(f"Total duration: {total_duration:.1f}s")
        print(f"Throughput: {total_tokens/total_duration:.1f} tokens/s")
        print(f"RPS: {total_requests/total_duration:.1f}")
        print(f"\nTTFT (ms):")
        print(f"  P50: {statistics.median(ttfts)*1000:.0f}")
        print(f"  P90: {sorted(ttfts)[int(len(ttfts)*0.9)]*1000:.0f}")
        print(f"  P99: {sorted(ttfts)[int(len(ttfts)*0.99)]*1000:.0f}")
        print(f"\nTotal latency (ms):")
        print(f"  P50: {statistics.median(total_times)*1000:.0f}")
        print(f"  P90: {sorted(total_times)[int(len(total_times)*0.9)]*1000:.0f}")
        print(f"  P99: {sorted(total_times)[int(len(total_times)*0.99)]*1000:.0f}")
 
if __name__ == "__main__":
    asyncio.run(run_load_test())

Tip

6단계: 비용 분석

시나리오의 예산(월 5,000달러) 내에서 운영하기 위한 비용 분석입니다.

text

비용 구성 (혼합 전략):
 
기본 용량 (온디맨드, 24/7):
  g5.2xlarge (A10G 24GB) x 2
  AWQ 4비트 양자화 적용 (8B 모델 -> 약 5GB)
  시간당: $1.21 x 2 = $2.42
  월: $2.42 x 24 x 30 = $1,742
 
추가 용량 (스팟, 업무 시간):
  g5.2xlarge (스팟) x 1-2
  평일 9-18시 (월 약 200시간)
  시간당: $0.36 x 1.5(평균) = $0.54
  월: $0.54 x 200 = $108
 
EKS 클러스터:
  $0.10/시간 x 720시간 = $72
 
스토리지 + 네트워크:
  약 $100
 
모니터링:
  약 $50
 
총 월간 비용: 약 $2,072
예산 대비: 41.4% (충분한 여유)

Info

7단계: 운영 런북

프로덕션 운영에서 발생할 수 있는 주요 상황과 대응 절차를 문서화합니다.

모델 서버 무응답

text

증상: 헬스 체크 실패, 요청 타임아웃
원인 가능성: GPU OOM, 프로세스 데드락, CUDA 오류
 
대응 절차:
1. Pod 상태 확인
   kubectl get pods -n ai-serving -l app=vllm-chatbot
 
2. Pod 로그 확인
   kubectl logs <pod-name> -n ai-serving --tail=100
 
3. GPU 상태 확인 (Pod 내부)
   kubectl exec <pod-name> -n ai-serving -- nvidia-smi
 
4. Pod 재시작 (최후 수단)
   kubectl delete pod <pod-name> -n ai-serving
 
5. 반복 발생 시 리소스 설정 검토
   - GPU 메모리 활용률 낮추기 (0.90 -> 0.85)
   - 최대 동시 요청 수 줄이기

요청 지연 증가

text

증상: TTFT, TPOT 증가, 대기 큐 깊어짐
원인 가능성: 트래픽 급증, GPU 과부하, KV 캐시 포화
 
대응 절차:
1. 현재 메트릭 확인
   - 대기 요청 수, GPU 캐시 사용률, 요청률 확인
 
2. HPA 상태 확인
   kubectl get hpa -n ai-serving
 
3. 수동 스케일업 (긴급 시)
   kubectl scale deployment vllm-chatbot -n ai-serving --replicas=4
 
4. 트래픽 소스 분석
   - 비정상적 대량 요청이 있는지 확인
   - 레이트 리미팅 적용 여부 확인

스팟 인스턴스 회수

text

증상: Pod 종료, 노드 드레인
원인: AWS 스팟 인스턴스 용량 회수
 
대응 절차:
1. Node Termination Handler가 자동으로 처리
   - Pod가 안전하게 축출됨
   - 진행 중인 요청은 Graceful Shutdown
 
2. 온디맨드 Pod가 서비스 유지 중인지 확인
 
3. Cluster Autoscaler가 새 노드 프로비저닝
   - 스팟 가용하면 스팟으로 재생성
   - 스팟 불가 시 온디맨드 폴백