2026년 4월 3일·아키텍처·

9장: 프로덕션 스트리밍 인프라

로드밸런서의 WebSocket 업그레이드, CDN과 스트리밍, Kubernetes에서의 스트리밍 서비스 운영, 모니터링 전략, HTTP/3(QUIC)과 WebTransport의 미래를 다룹니다.

17분734자8개 섹션

streaming ai

streaming-ai9 / 10

1 2 3 4 5 6 7 8 9 10

이전8장: 백프레셔와 흐름 제어 다음10장: 실전 프로젝트 — 스트리밍 AI 시스템 구축

학습 목표

L7 로드밸런서에서 WebSocket 업그레이드를 올바르게 처리하는 방법을 이해합니다
CDN이 스트리밍에 미치는 영향과 우회 전략을 파악합니다
Kubernetes에서 스트리밍 서비스를 운영하는 핵심 설정을 학습합니다
연결 수, 지연시간, 처리량 등 핵심 모니터링 지표를 다룹니다
HTTP/3(QUIC)과 WebTransport가 스트리밍에 가져올 변화를 전망합니다

로드밸런서와 스트리밍

로드밸런서는 스트리밍 인프라에서 가장 먼저 마주치는 관문입니다. 프로토콜별로 다른 처리가 필요합니다.

SSE 로드밸런싱

SSE는 표준 HTTP이므로 일반적인 L7 로드밸런서에서 잘 동작합니다. 다만 주의할 점이 있습니다.

nginx-sse-config.conf

nginx

upstream sse_backend {
    server backend1:3000;
    server backend2:3000;
    server backend3:3000;
}
 
server {
    listen 443 ssl;
    
    location /api/stream {
        proxy_pass http://sse_backend;
        
        # SSE 핵심 설정
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        
        # 버퍼링 비활성화 — 스트리밍의 핵심
        proxy_buffering off;
        proxy_cache off;
        
        # 청크 전송 인코딩 비활성화
        chunked_transfer_encoding off;
        
        # 연결 타임아웃 (SSE는 장시간 연결)
        proxy_read_timeout 3600s;
        proxy_send_timeout 3600s;
        
        # 헤더 전달
        proxy_set_header X-Accel-Buffering no;
    }
}

Warning

proxy_buffering off는 SSE 스트리밍에서 가장 중요한 설정입니다. 이 설정 없이는 Nginx가 응답을 버퍼에 모아두었다가 한꺼번에 전달하므로, 실시간 토큰 전달이 불가능합니다.

WebSocket 업그레이드 처리

WebSocket은 HTTP 업그레이드를 통해 프로토콜이 전환됩니다. 로드밸런서가 이 업그레이드를 올바르게 중계해야 합니다.

nginx-websocket-config.conf

nginx

upstream ws_backend {
    # 스티키 세션: 같은 클라이언트를 같은 서버로
    ip_hash;
    
    server backend1:8080;
    server backend2:8080;
    server backend3:8080;
}
 
map $http_upgrade $connection_upgrade {
    default upgrade;
    ""      close;
}
 
server {
    listen 443 ssl;
    
    location /ws {
        proxy_pass http://ws_backend;
        proxy_http_version 1.1;
        
        # WebSocket 업그레이드 핵심 헤더
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;
        
        # 유휴 연결 타임아웃
        proxy_read_timeout 86400s;  # 24시간
        proxy_send_timeout 86400s;
        
        # 원본 IP 전달
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

AWS ALB 설정

AWS ALB(Application Load Balancer)는 WebSocket을 기본 지원합니다.

aws-alb-target-group.yaml

yaml

# CloudFormation 또는 CDK 설정 예시
TargetGroup:
  Type: AWS::ElasticLoadBalancingV2::TargetGroup
  Properties:
    Protocol: HTTP
    Port: 8080
    TargetType: ip
    HealthCheckPath: /health
    # WebSocket 연결 유지를 위한 설정
    TargetGroupAttributes:
      - Key: deregistration_delay.timeout_seconds
        Value: "300"  # 연결 드레이닝 대기 시간
      - Key: stickiness.enabled
        Value: "true"
      - Key: stickiness.type
        Value: "lb_cookie"
      - Key: stickiness.lb_cookie.duration_seconds
        Value: "86400"  # 24시간 스티키 세션

CDN과 스트리밍

CDN은 정적 콘텐츠에 최적화되어 있으므로, 스트리밍 트래픽과는 충돌할 수 있습니다.

CDN이 스트리밍을 방해하는 경우

문제	원인	해결
응답 버퍼링	CDN이 전체 응답을 캐시하려 시도	`Cache-Control: no-cache`
연결 타임아웃	CDN의 짧은 유휴 타임아웃	하트비트/핑 주기 설정
청크 결합	CDN이 작은 청크를 결합	스트리밍 경로를 CDN 바이패스
압축 충돌	CDN의 gzip 압축이 스트리밍 지연	SSE 경로에서 압축 비활성화

CDN 바이패스 아키텍처

text

[정적 자산]
클라이언트 ──> CDN ──> 오리진 서버
 
[스트리밍 API]
클라이언트 ──────────> 스트리밍 서버 (CDN 우회)

Tip

CloudFront에서는 특정 경로 패턴(예: /api/stream/*)에 대해 캐싱을 비활성화하고, TTL을 0으로 설정하여 스트리밍을 통과시킬 수 있습니다. 완전한 바이패스보다 관리가 편리합니다.

Kubernetes에서의 스트리밍 서비스

Kubernetes 환경에서 스트리밍 서비스를 운영할 때의 핵심 설정을 살펴보겠습니다.

Ingress 설정

k8s-streaming-ingress.yaml

yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: streaming-ingress
  annotations:
    # Nginx Ingress Controller 설정
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
    # WebSocket 지원
    nginx.ingress.kubernetes.io/websocket-services: "ws-service"
    # 스티키 세션
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "86400"
spec:
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /api/stream
            pathType: Prefix
            backend:
              service:
                name: sse-service
                port:
                  number: 3000
          - path: /ws
            pathType: Prefix
            backend:
              service:
                name: ws-service
                port:
                  number: 8080

디플로이먼트 설정

스트리밍 서비스의 Pod 설정에서 주의할 점입니다.

k8s-streaming-deployment.yaml

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: streaming-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # 제로 다운타임
  template:
    spec:
      terminationGracePeriodSeconds: 300  # 연결 드레이닝 대기
      containers:
        - name: streaming
          image: streaming-service:latest
          ports:
            - containerPort: 3000
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          # 연결 수 기반 스케일링을 위한 메트릭 노출
          env:
            - name: METRICS_PORT
              value: "9090"
          # 장시간 연결을 유지하는 서비스이므로
          # liveness 프로브를 넉넉하게 설정
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 30
            failureThreshold: 5
          readinessProbe:
            httpGet:
              path: /ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          # PreStop hook으로 그레이스풀 셧다운
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - -c
                  - "curl -X POST localhost:3000/admin/drain && sleep 120"

HPA (Horizontal Pod Autoscaler)

스트리밍 서비스의 오토스케일링은 CPU/메모리가 아닌 연결 수 기반이 효과적입니다.

k8s-hpa.yaml

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: streaming-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: streaming-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
    # 커스텀 메트릭: 활성 연결 수
    - type: Pods
      pods:
        metric:
          name: active_connections
        target:
          type: AverageValue
          averageValue: "500"  # Pod당 평균 500 연결
    # CPU도 보조 지표로
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # 스케일 다운은 보수적으로
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

Info

스트리밍 서비스의 스케일 다운은 특히 주의가 필요합니다. Pod가 종료되면 해당 Pod의 모든 활성 연결이 끊어집니다. terminationGracePeriodSeconds와 PreStop 훅을 통해 진행 중인 스트리밍이 완료된 후 종료되도록 해야 합니다.

모니터링 핵심 지표

스트리밍 시스템에서 모니터링해야 할 핵심 지표를 정리합니다.

연결 지표

지표	설명	경고 임계값 (예시)
활성 연결 수	현재 열린 SSE/WebSocket 연결	Pod당 1,000 초과
연결 생성률	초당 새 연결 수	분당 500 초과
연결 지속 시간	평균 연결 유지 시간	비정상적으로 짧음 (1초 미만)
연결 실패율	핸드셰이크 실패 비율	5% 초과

스트리밍 지표

지표	설명	경고 임계값 (예시)
TTFT (P50/P95/P99)	첫 토큰 도달 시간	P99 > 3초
TPOT (P50/P95/P99)	토큰 간 간격	P99 > 100ms
스트리밍 완료율	정상 완료된 스트리밍 비율	90% 미만
중단율	사용자/시스템 중단 비율	20% 초과

인프라 지표

지표	설명	경고 임계값 (예시)
큐 깊이	추론 대기열 길이	100 초과
GPU 사용률	추론 서버 GPU 활용도	95% 초과 (과부하) 또는 20% 미만 (낭비)
메모리 사용량	Pod 메모리 (연결당 소비)	limits의 80% 초과
에러율 (5xx)	서버 오류 비율	1% 초과

streaming-metrics.ts

typescript

import { Counter, Histogram, Gauge } from "prom-client";
 
// Prometheus 메트릭 정의
const activeConnections = new Gauge({
  name: "streaming_active_connections",
  help: "Number of active streaming connections",
  labelNames: ["protocol", "endpoint"],
});
 
const ttft = new Histogram({
  name: "streaming_ttft_seconds",
  help: "Time to first token in seconds",
  labelNames: ["model"],
  buckets: [0.1, 0.25, 0.5, 1, 2, 5, 10],
});
 
const tpot = new Histogram({
  name: "streaming_tpot_milliseconds",
  help: "Time per output token in milliseconds",
  labelNames: ["model"],
  buckets: [10, 20, 30, 50, 75, 100, 200],
});
 
const streamCompletions = new Counter({
  name: "streaming_completions_total",
  help: "Total streaming completions",
  labelNames: ["status"], // "success", "cancelled", "error"
});
 
// 사용 예시
function trackStreaming(model: string) {
  const startTime = Date.now();
  let firstTokenTime: number | null = null;
  let lastTokenTime = startTime;
 
  return {
    onFirstToken() {
      firstTokenTime = Date.now();
      ttft.labels(model).observe(
        (firstTokenTime - startTime) / 1000
      );
    },
 
    onToken() {
      const now = Date.now();
      if (firstTokenTime) {
        tpot.labels(model).observe(now - lastTokenTime);
      }
      lastTokenTime = now;
    },
 
    onComplete(status: "success" | "cancelled" | "error") {
      streamCompletions.labels(status).inc();
    },
  };
}

HTTP/3 (QUIC)의 영향

2026년 현재, HTTP/3는 클라이언트-서버 트래픽의 약 85%를 차지합니다. QUIC 프로토콜 기반의 HTTP/3는 스트리밍에 중요한 개선을 가져옵니다.

Head-of-Line 블로킹 해소

HTTP/2는 TCP 위에서 멀티플렉싱을 지원하지만, TCP 레벨의 패킷 손실이 모든 스트림을 블로킹합니다. QUIC는 UDP 위에서 동작하며, 스트림 간 독립성을 보장합니다.

HTTP/2 vs HTTP/3 패킷 손실 시

text

[HTTP/2 (TCP)]
스트림 A: 데이터 ──> 데이터 ──> [손실] ──> 재전송 대기
스트림 B: 데이터 ──> 데이터 ──> [블로킹] ──> 대기  ← 영향 받음
스트림 C: 데이터 ──> 데이터 ──> [블로킹] ──> 대기  ← 영향 받음
 
[HTTP/3 (QUIC)]
스트림 A: 데이터 ──> 데이터 ──> [손실] ──> 재전송 대기
스트림 B: 데이터 ──> 데이터 ──> 데이터 ──> 정상  ← 영향 없음
스트림 C: 데이터 ──> 데이터 ──> 데이터 ──> 정상  ← 영향 없음

0-RTT 연결 설정

QUIC의 0-RTT 핸드셰이크는 재연결 시 지연 없이 데이터 전송을 시작할 수 있게 합니다. 모바일 환경에서 네트워크 전환(Wi-Fi에서 셀룰러로) 시 특히 유리합니다.

연결 마이그레이션

네트워크 인터페이스가 변경되어도(Wi-Fi에서 5G로) QUIC 연결은 유지됩니다. Connection ID 기반으로 연결을 식별하기 때문입니다. 이는 모바일에서의 스트리밍 안정성을 크게 높입니다.

WebTransport 전망

WebTransport는 HTTP/3 위에서 동작하는 양방향 통신 프로토콜로, WebSocket의 후계자로 주목받고 있습니다.

특성	WebSocket	WebTransport
전송 계층	TCP	QUIC (UDP)
멀티플렉싱	단일 스트림	다중 스트림
Head-of-Line 블로킹	있음	없음
비신뢰 전송	불가	데이터그램 지원
연결 마이그레이션	불가	지원
0-RTT	불가	지원

webtransport-example.ts

typescript

// WebTransport 클라이언트 (현재 Chrome/Edge 지원)
const transport = new WebTransport(
  "https://api.example.com/webtransport"
);
 
await transport.ready;
 
// 양방향 스트림 열기
const stream = await transport.createBidirectionalStream();
const writer = stream.writable.getWriter();
const reader = stream.readable.getReader();
 
// 요청 전송
await writer.write(
  new TextEncoder().encode(
    JSON.stringify({ message: "안녕하세요" })
  )
);
 
// 응답 스트리밍 수신
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
 
  const text = new TextDecoder().decode(value);
  console.log("수신:", text);
}

Info

WebTransport는 아직 모든 브라우저에서 지원되지 않으며, 서버 생태계도 성숙 과정에 있습니다. 현시점에서는 SSE + WebSocket 조합이 안정적이며, WebTransport는 특수한 성능 요구사항이 있을 때 고려하는 것이 바람직합니다.

정리

이번 장에서는 스트리밍 시스템을 프로덕션에 배포하고 운영하기 위한 인프라 구성을 살펴보았습니다.

Nginx에서 proxy_buffering off는 SSE 스트리밍의 필수 설정입니다
WebSocket은 업그레이드 헤더 중계와 스티키 세션이 필요합니다
CDN은 스트리밍을 방해할 수 있으므로, 스트리밍 경로를 별도로 관리합니다
Kubernetes에서는 terminationGracePeriodSeconds와 PreStop 훅으로 그레이스풀 셧다운을 구현합니다
TTFT, TPOT, 연결 수, 큐 깊이 등 스트리밍 특화 지표를 모니터링합니다
HTTP/3는 Head-of-Line 블로킹 해소와 연결 마이그레이션으로 스트리밍 안정성을 높입니다

다음 장에서는 시리즈의 마무리로, 지금까지 학습한 모든 것을 종합하여 실전 하이브리드 스트리밍 AI 시스템을 구축합니다. SSE + gRPC + WebSocket을 결합한 아키텍처 설계, 엔드투엔드 구현, 성능 최적화, 운영 체크리스트를 다루겠습니다.

이 글이 도움이 되셨나요?

아키텍처

10장: 실전 프로젝트 — 스트리밍 AI 시스템 구축

SSE, gRPC, WebSocket을 결합한 하이브리드 스트리밍 AI 시스템을 설계하고 구현합니다. 프로토콜 선택 의사결정 트리, 엔드투엔드 구현, 성능 최적화, 운영 체크리스트를 다룹니다.

2026년 4월 5일·21분

아키텍처

8장: 백프레셔와 흐름 제어

생산자-소비자 속도 불일치를 관리하는 백프레셔의 원리, 버퍼링/드롭/속도 제한 전략, LLM API 레이트 리미팅, 토큰 버킷 알고리즘, 큐 깊이 모니터링을 다룹니다.

2026년 4월 1일·18분

아키텍처

7장: 이벤트 소싱과 CQRS 패턴

이벤트 소싱과 CQRS 패턴의 원리를 살펴보고, AI 시스템에서의 적용 사례를 다룹니다. 대화 이력 관리, 에이전트 상태 추적, 시간 여행 디버깅, Kafka와 EventStoreDB 활용을 포함합니다.

2026년 3월 30일·15분

2026년 4월 3일·아키텍처·

9장: 프로덕션 스트리밍 인프라

로드밸런서의 WebSocket 업그레이드, CDN과 스트리밍, Kubernetes에서의 스트리밍 서비스 운영, 모니터링 전략, HTTP/3(QUIC)과 WebTransport의 미래를 다룹니다.

17분734자8개 섹션

streaming ai

streaming-ai9 / 10

1 2 3 4 5 6 7 8 9 10

이전8장: 백프레셔와 흐름 제어 다음10장: 실전 프로젝트 — 스트리밍 AI 시스템 구축

학습 목표

L7 로드밸런서에서 WebSocket 업그레이드를 올바르게 처리하는 방법을 이해합니다
CDN이 스트리밍에 미치는 영향과 우회 전략을 파악합니다
Kubernetes에서 스트리밍 서비스를 운영하는 핵심 설정을 학습합니다
연결 수, 지연시간, 처리량 등 핵심 모니터링 지표를 다룹니다
HTTP/3(QUIC)과 WebTransport가 스트리밍에 가져올 변화를 전망합니다

로드밸런서와 스트리밍

로드밸런서는 스트리밍 인프라에서 가장 먼저 마주치는 관문입니다. 프로토콜별로 다른 처리가 필요합니다.

SSE 로드밸런싱

SSE는 표준 HTTP이므로 일반적인 L7 로드밸런서에서 잘 동작합니다. 다만 주의할 점이 있습니다.

nginx-sse-config.conf

nginx

upstream sse_backend {
    server backend1:3000;
    server backend2:3000;
    server backend3:3000;
}
 
server {
    listen 443 ssl;
    
    location /api/stream {
        proxy_pass http://sse_backend;
        
        # SSE 핵심 설정
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        
        # 버퍼링 비활성화 — 스트리밍의 핵심
        proxy_buffering off;
        proxy_cache off;
        
        # 청크 전송 인코딩 비활성화
        chunked_transfer_encoding off;
        
        # 연결 타임아웃 (SSE는 장시간 연결)
        proxy_read_timeout 3600s;
        proxy_send_timeout 3600s;
        
        # 헤더 전달
        proxy_set_header X-Accel-Buffering no;
    }
}

Warning

WebSocket 업그레이드 처리

WebSocket은 HTTP 업그레이드를 통해 프로토콜이 전환됩니다. 로드밸런서가 이 업그레이드를 올바르게 중계해야 합니다.

nginx-websocket-config.conf

nginx

upstream ws_backend {
    # 스티키 세션: 같은 클라이언트를 같은 서버로
    ip_hash;
    
    server backend1:8080;
    server backend2:8080;
    server backend3:8080;
}
 
map $http_upgrade $connection_upgrade {
    default upgrade;
    ""      close;
}
 
server {
    listen 443 ssl;
    
    location /ws {
        proxy_pass http://ws_backend;
        proxy_http_version 1.1;
        
        # WebSocket 업그레이드 핵심 헤더
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;
        
        # 유휴 연결 타임아웃
        proxy_read_timeout 86400s;  # 24시간
        proxy_send_timeout 86400s;
        
        # 원본 IP 전달
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

AWS ALB 설정

AWS ALB(Application Load Balancer)는 WebSocket을 기본 지원합니다.

aws-alb-target-group.yaml

yaml

# CloudFormation 또는 CDK 설정 예시
TargetGroup:
  Type: AWS::ElasticLoadBalancingV2::TargetGroup
  Properties:
    Protocol: HTTP
    Port: 8080
    TargetType: ip
    HealthCheckPath: /health
    # WebSocket 연결 유지를 위한 설정
    TargetGroupAttributes:
      - Key: deregistration_delay.timeout_seconds
        Value: "300"  # 연결 드레이닝 대기 시간
      - Key: stickiness.enabled
        Value: "true"
      - Key: stickiness.type
        Value: "lb_cookie"
      - Key: stickiness.lb_cookie.duration_seconds
        Value: "86400"  # 24시간 스티키 세션

CDN과 스트리밍

CDN은 정적 콘텐츠에 최적화되어 있으므로, 스트리밍 트래픽과는 충돌할 수 있습니다.

CDN이 스트리밍을 방해하는 경우

문제	원인	해결
응답 버퍼링	CDN이 전체 응답을 캐시하려 시도	`Cache-Control: no-cache`
연결 타임아웃	CDN의 짧은 유휴 타임아웃	하트비트/핑 주기 설정
청크 결합	CDN이 작은 청크를 결합	스트리밍 경로를 CDN 바이패스
압축 충돌	CDN의 gzip 압축이 스트리밍 지연	SSE 경로에서 압축 비활성화

CDN 바이패스 아키텍처

text

[정적 자산]
클라이언트 ──> CDN ──> 오리진 서버
 
[스트리밍 API]
클라이언트 ──────────> 스트리밍 서버 (CDN 우회)

Tip

Kubernetes에서의 스트리밍 서비스

Kubernetes 환경에서 스트리밍 서비스를 운영할 때의 핵심 설정을 살펴보겠습니다.

Ingress 설정

k8s-streaming-ingress.yaml

yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: streaming-ingress
  annotations:
    # Nginx Ingress Controller 설정
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
    # WebSocket 지원
    nginx.ingress.kubernetes.io/websocket-services: "ws-service"
    # 스티키 세션
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "86400"
spec:
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /api/stream
            pathType: Prefix
            backend:
              service:
                name: sse-service
                port:
                  number: 3000
          - path: /ws
            pathType: Prefix
            backend:
              service:
                name: ws-service
                port:
                  number: 8080

디플로이먼트 설정

스트리밍 서비스의 Pod 설정에서 주의할 점입니다.

k8s-streaming-deployment.yaml

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: streaming-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # 제로 다운타임
  template:
    spec:
      terminationGracePeriodSeconds: 300  # 연결 드레이닝 대기
      containers:
        - name: streaming
          image: streaming-service:latest
          ports:
            - containerPort: 3000
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          # 연결 수 기반 스케일링을 위한 메트릭 노출
          env:
            - name: METRICS_PORT
              value: "9090"
          # 장시간 연결을 유지하는 서비스이므로
          # liveness 프로브를 넉넉하게 설정
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 30
            failureThreshold: 5
          readinessProbe:
            httpGet:
              path: /ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          # PreStop hook으로 그레이스풀 셧다운
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - -c
                  - "curl -X POST localhost:3000/admin/drain && sleep 120"

HPA (Horizontal Pod Autoscaler)

스트리밍 서비스의 오토스케일링은 CPU/메모리가 아닌 연결 수 기반이 효과적입니다.

k8s-hpa.yaml

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: streaming-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: streaming-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
    # 커스텀 메트릭: 활성 연결 수
    - type: Pods
      pods:
        metric:
          name: active_connections
        target:
          type: AverageValue
          averageValue: "500"  # Pod당 평균 500 연결
    # CPU도 보조 지표로
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # 스케일 다운은 보수적으로
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

Info

모니터링 핵심 지표

스트리밍 시스템에서 모니터링해야 할 핵심 지표를 정리합니다.

연결 지표

지표	설명	경고 임계값 (예시)
활성 연결 수	현재 열린 SSE/WebSocket 연결	Pod당 1,000 초과
연결 생성률	초당 새 연결 수	분당 500 초과
연결 지속 시간	평균 연결 유지 시간	비정상적으로 짧음 (1초 미만)
연결 실패율	핸드셰이크 실패 비율	5% 초과

스트리밍 지표

지표	설명	경고 임계값 (예시)
TTFT (P50/P95/P99)	첫 토큰 도달 시간	P99 > 3초
TPOT (P50/P95/P99)	토큰 간 간격	P99 > 100ms
스트리밍 완료율	정상 완료된 스트리밍 비율	90% 미만
중단율	사용자/시스템 중단 비율	20% 초과

인프라 지표

지표	설명	경고 임계값 (예시)
큐 깊이	추론 대기열 길이	100 초과
GPU 사용률	추론 서버 GPU 활용도	95% 초과 (과부하) 또는 20% 미만 (낭비)
메모리 사용량	Pod 메모리 (연결당 소비)	limits의 80% 초과
에러율 (5xx)	서버 오류 비율	1% 초과

streaming-metrics.ts

typescript

import { Counter, Histogram, Gauge } from "prom-client";
 
// Prometheus 메트릭 정의
const activeConnections = new Gauge({
  name: "streaming_active_connections",
  help: "Number of active streaming connections",
  labelNames: ["protocol", "endpoint"],
});
 
const ttft = new Histogram({
  name: "streaming_ttft_seconds",
  help: "Time to first token in seconds",
  labelNames: ["model"],
  buckets: [0.1, 0.25, 0.5, 1, 2, 5, 10],
});
 
const tpot = new Histogram({
  name: "streaming_tpot_milliseconds",
  help: "Time per output token in milliseconds",
  labelNames: ["model"],
  buckets: [10, 20, 30, 50, 75, 100, 200],
});
 
const streamCompletions = new Counter({
  name: "streaming_completions_total",
  help: "Total streaming completions",
  labelNames: ["status"], // "success", "cancelled", "error"
});
 
// 사용 예시
function trackStreaming(model: string) {
  const startTime = Date.now();
  let firstTokenTime: number | null = null;
  let lastTokenTime = startTime;
 
  return {
    onFirstToken() {
      firstTokenTime = Date.now();
      ttft.labels(model).observe(
        (firstTokenTime - startTime) / 1000
      );
    },
 
    onToken() {
      const now = Date.now();
      if (firstTokenTime) {
        tpot.labels(model).observe(now - lastTokenTime);
      }
      lastTokenTime = now;
    },
 
    onComplete(status: "success" | "cancelled" | "error") {
      streamCompletions.labels(status).inc();
    },
  };
}

HTTP/3 (QUIC)의 영향

2026년 현재, HTTP/3는 클라이언트-서버 트래픽의 약 85%를 차지합니다. QUIC 프로토콜 기반의 HTTP/3는 스트리밍에 중요한 개선을 가져옵니다.

Head-of-Line 블로킹 해소

HTTP/2 vs HTTP/3 패킷 손실 시

text

[HTTP/2 (TCP)]
스트림 A: 데이터 ──> 데이터 ──> [손실] ──> 재전송 대기
스트림 B: 데이터 ──> 데이터 ──> [블로킹] ──> 대기  ← 영향 받음
스트림 C: 데이터 ──> 데이터 ──> [블로킹] ──> 대기  ← 영향 받음
 
[HTTP/3 (QUIC)]
스트림 A: 데이터 ──> 데이터 ──> [손실] ──> 재전송 대기
스트림 B: 데이터 ──> 데이터 ──> 데이터 ──> 정상  ← 영향 없음
스트림 C: 데이터 ──> 데이터 ──> 데이터 ──> 정상  ← 영향 없음

특성	WebSocket	WebTransport
전송 계층	TCP	QUIC (UDP)
멀티플렉싱	단일 스트림	다중 스트림
Head-of-Line 블로킹	있음	없음
비신뢰 전송	불가	데이터그램 지원
연결 마이그레이션	불가	지원
0-RTT	불가	지원

webtransport-example.ts

typescript

// WebTransport 클라이언트 (현재 Chrome/Edge 지원)
const transport = new WebTransport(
  "https://api.example.com/webtransport"
);
 
await transport.ready;
 
// 양방향 스트림 열기
const stream = await transport.createBidirectionalStream();
const writer = stream.writable.getWriter();
const reader = stream.readable.getReader();
 
// 요청 전송
await writer.write(
  new TextEncoder().encode(
    JSON.stringify({ message: "안녕하세요" })
  )
);
 
// 응답 스트리밍 수신
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
 
  const text = new TextDecoder().decode(value);
  console.log("수신:", text);
}

Info

정리

이번 장에서는 스트리밍 시스템을 프로덕션에 배포하고 운영하기 위한 인프라 구성을 살펴보았습니다.

Nginx에서 proxy_buffering off는 SSE 스트리밍의 필수 설정입니다
WebSocket은 업그레이드 헤더 중계와 스티키 세션이 필요합니다
CDN은 스트리밍을 방해할 수 있으므로, 스트리밍 경로를 별도로 관리합니다
Kubernetes에서는 terminationGracePeriodSeconds와 PreStop 훅으로 그레이스풀 셧다운을 구현합니다
TTFT, TPOT, 연결 수, 큐 깊이 등 스트리밍 특화 지표를 모니터링합니다
HTTP/3는 Head-of-Line 블로킹 해소와 연결 마이그레이션으로 스트리밍 안정성을 높입니다

이 글이 도움이 되셨나요?

아키텍처

관련 글

10장: 실전 프로젝트 — 스트리밍 AI 시스템 구축

8장: 백프레셔와 흐름 제어

7장: 이벤트 소싱과 CQRS 패턴

댓글

관련 글

10장: 실전 프로젝트 — 스트리밍 AI 시스템 구축

8장: 백프레셔와 흐름 제어

7장: 이벤트 소싱과 CQRS 패턴

댓글