2026년 1월 26일·AI / ML·

5장: Kubernetes 기초 - AI 워크로드를 위한 클러스터 설계

Kubernetes의 핵심 개념을 AI 워크로드 관점에서 설명하고, GPU 노드 구성과 AI 서비스에 적합한 클러스터 아키텍처를 설계합니다.

14분835자7개 섹션

mlops kubernetes infrastructure performance

이전4장: 컨테이너화 - Docker로 AI 서비스 패키징 다음6장: Kubernetes 배포 실전 - GPU 노드와 모델 서빙 배포

왜 Kubernetes인가

Docker Compose로도 AI 서비스를 운영할 수 있지만, 프로덕션 환경에서는 한계가 명확합니다. 단일 서버가 장애를 일으키면 서비스 전체가 중단되며, 트래픽 증가에 대응하는 자동 확장이 불가능합니다. 무중단 배포, 자동 복구, 리소스 격리 등 프로덕션에 필수적인 기능이 부재합니다.

Kubernetes는 이러한 문제를 해결하는 컨테이너 오케스트레이션(Orchestration) 플랫폼입니다. 선언적 설정(Declarative Configuration)을 통해 원하는 상태를 정의하면, Kubernetes가 자동으로 현재 상태를 원하는 상태로 수렴시킵니다. 컨테이너가 죽으면 자동으로 재시작하고, 노드가 장애를 일으키면 다른 노드에 워크로드를 재배치합니다.

AI 워크로드에서 Kubernetes가 특히 중요한 이유는 GPU라는 희소 자원의 관리 때문입니다. Kubernetes의 리소스 스케줄링 기능을 활용하면 GPU를 효율적으로 할당하고, 여러 서비스 간에 공유하며, 유휴 자원을 최소화할 수 있습니다.

Kubernetes 핵심 개념

Pod

Pod는 Kubernetes에서 배포 가능한 가장 작은 단위입니다. 하나 이상의 컨테이너를 포함하며, 같은 Pod 내의 컨테이너는 네트워크와 스토리지를 공유합니다.

AI 서비스에서 Pod는 일반적으로 vLLM이나 TGI 컨테이너 하나를 포함합니다. 사이드카(Sidecar) 패턴으로 모니터링 에이전트나 프록시 컨테이너를 함께 배치하기도 합니다.

기본 Pod 정의

yaml

apiVersion: v1
kind: Pod
metadata:
  name: vllm-server
  labels:
    app: vllm
spec:
  containers:
    - name: vllm
      image: vllm/vllm-openai:latest
      args:
        - "--model"
        - "meta-llama/Llama-3.1-8B-Instruct"
        - "--host"
        - "0.0.0.0"
        - "--port"
        - "8000"
        - "--max-model-len"
        - "4096"
      ports:
        - containerPort: 8000
      resources:
        limits:
          nvidia.com/gpu: 1
        requests:
          nvidia.com/gpu: 1
          memory: "24Gi"
          cpu: "4"

Info

nvidia.com/gpu 리소스는 NVIDIA GPU Operator가 클러스터에 설치되어 있어야 사용할 수 있습니다. GPU 리소스는 limits와 requests가 반드시 동일해야 합니다. GPU의 부분 할당(예: 0.5 GPU)은 기본적으로 지원되지 않습니다.

Deployment

Deployment는 Pod의 원하는 복제본(Replica) 수를 관리하고, 롤링 업데이트를 수행하는 리소스입니다. Pod가 죽으면 자동으로 새 Pod를 생성하여 원하는 복제본 수를 유지합니다.

vLLM Deployment

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
  labels:
    app: vllm
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Llama-3.1-8B-Instruct"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8000"
            - "--max-model-len"
            - "4096"
            - "--gpu-memory-utilization"
            - "0.90"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
              memory: "24Gi"
              cpu: "4"
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
            failureThreshold: 3

Service

Service는 Pod 집합에 대한 안정적인 네트워크 접근점을 제공합니다. Pod는 생성과 소멸을 반복하며 IP 주소가 변경되지만, Service는 고정된 DNS 이름과 IP를 유지합니다.

vLLM Service

yaml

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP
  type: ClusterIP

이 Service를 통해 클러스터 내부에서 http://vllm-service 주소로 vLLM에 접근할 수 있으며, 자동으로 여러 Pod에 로드 밸런싱됩니다.

Namespace

Namespace는 클러스터 내의 논리적 격리 단위입니다. 서로 다른 팀, 환경(개발/스테이징/프로덕션), 서비스를 네임스페이스로 분리하여 리소스 관리와 접근 제어를 수행합니다.

AI 서비스 네임스페이스 생성

bash

kubectl create namespace ai-serving

네임스페이스에 리소스 배포

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
  namespace: ai-serving
# ... (이하 동일)

GPU 노드 구성

NVIDIA GPU Operator

NVIDIA GPU Operator는 Kubernetes 클러스터에서 GPU를 사용하기 위한 모든 컴포넌트를 자동으로 설치하고 관리합니다. GPU 드라이버, CUDA 런타임, 컨테이너 런타임, 디바이스 플러그인을 포함합니다.

Helm으로 GPU Operator 설치

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
 
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true

설치 후 GPU 노드의 상태를 확인합니다.

GPU 리소스 확인

bash

kubectl get nodes -o json | \
  python3 -c "
import sys, json
nodes = json.load(sys.stdin)['items']
for n in nodes:
    name = n['metadata']['name']
    gpus = n['status'].get('capacity', {}).get('nvidia.com/gpu', '0')
    print(f'{name}: {gpus} GPU(s)')
"

노드 레이블과 Taint

GPU 노드에 레이블(Label)과 테인트(Taint)를 설정하여, GPU 워크로드만 해당 노드에 스케줄링되도록 합니다.

GPU 노드 레이블링

bash

kubectl label node gpu-node-1 accelerator=nvidia-a100
kubectl label node gpu-node-2 accelerator=nvidia-a100

GPU 노드 테인트 설정

bash

kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
kubectl taint nodes gpu-node-2 nvidia.com/gpu=present:NoSchedule

테인트가 설정된 노드에는 해당 테인트를 허용(Tolerate)하는 Pod만 스케줄링됩니다. 이를 통해 일반 CPU 워크로드가 GPU 노드에 배치되어 GPU 자원을 낭비하는 것을 방지합니다.

GPU 노드에 스케줄링되는 Pod

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  template:
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Equal
          value: present
          effect: NoSchedule
      nodeSelector:
        accelerator: nvidia-a100
      containers:
        - name: vllm
          # ... (컨테이너 설정)
          resources:
            limits:
              nvidia.com/gpu: 1

노드 풀 설계

클라우드 환경에서는 용도별 노드 풀(Node Pool)을 분리하는 것이 권장됩니다.

text

클러스터 노드 풀 구성:
 
1. system-pool (CPU)
   - Kubernetes 시스템 컴포넌트
   - 모니터링, 로깅
   - 인그레스 컨트롤러
   - 인스턴스: c6i.xlarge x 3
 
2. gpu-a100-pool (GPU)
   - AI 모델 서빙 (vLLM)
   - 인스턴스: p4d.24xlarge (A100 x 8)
   - 오토스케일링: 1-4 노드
 
3. gpu-spot-pool (GPU, 스팟)
   - 비프로덕션 워크로드
   - 배치 추론
   - 인스턴스: p4d.24xlarge (스팟)
   - 오토스케일링: 0-8 노드

스토리지 설정

모델 저장을 위한 Persistent Volume

모델 가중치 파일은 Pod가 재시작되어도 유지되어야 하므로, Persistent Volume(PV)에 저장합니다.

모델 스토리지 PVC

yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
  namespace: ai-serving
spec:
  accessModes:
    - ReadOnlyMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 100Gi

Pod에서 PVC 마운트

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  template:
    spec:
      volumes:
        - name: model-volume
          persistentVolumeClaim:
            claimName: model-storage
      containers:
        - name: vllm
          volumeMounts:
            - name: model-volume
              mountPath: /models
              readOnly: true
          args:
            - "--model"
            - "/models/Llama-3.1-8B-Instruct"

Warning

AWS EFS나 GCP Filestore 같은 네트워크 파일 시스템에서 직접 모델을 로드하면 로딩 시간이 매우 길어질 수 있습니다. Init Container를 활용하여 모델을 로컬 NVMe 볼륨으로 복사한 후 서빙하는 패턴을 권장합니다.

Init Container를 활용한 모델 사전 로딩

Init Container로 모델 복사

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  template:
    spec:
      initContainers:
        - name: model-loader
          image: amazon/aws-cli:latest
          command:
            - sh
            - -c
            - |
              echo "S3에서 모델 다운로드 중..."
              aws s3 sync s3://my-models/Llama-3.1-8B-Instruct /local-models/Llama-3.1-8B-Instruct
              echo "모델 다운로드 완료"
          volumeMounts:
            - name: local-model
              mountPath: /local-models
          env:
            - name: AWS_DEFAULT_REGION
              value: "ap-northeast-2"
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          volumeMounts:
            - name: local-model
              mountPath: /models
              readOnly: true
          args:
            - "--model"
            - "/models/Llama-3.1-8B-Instruct"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8000"
          resources:
            limits:
              nvidia.com/gpu: 1
      volumes:
        - name: local-model
          emptyDir:
            sizeLimit: 50Gi

네트워킹

Ingress 설정

외부에서 AI 서비스에 접근하려면 Ingress를 설정해야 합니다.

Ingress 설정

yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  namespace: ai-serving
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
spec:
  ingressClassName: nginx
  rules:
    - host: ai-api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: vllm-service
                port:
                  number: 80
  tls:
    - hosts:
        - ai-api.example.com
      secretName: ai-api-tls

AI 서비스의 Ingress에서 특히 주의해야 할 설정은 타임아웃입니다. LLM 추론은 수십 초가 걸릴 수 있으므로, 프록시 읽기 타임아웃을 충분히 길게 설정해야 합니다. 또한 스트리밍 응답을 위해 프록시 버퍼링을 비활성화해야 합니다.

로드 밸런싱 전략

AI 서비스의 로드 밸런싱은 일반적인 HTTP 서비스와 다른 고려 사항이 있습니다. 요청마다 처리 시간이 크게 다를 수 있으므로(짧은 답변 vs 긴 답변), 라운드 로빈(Round Robin) 방식보다는 최소 연결(Least Connections) 방식이 더 적합합니다.

최소 연결 로드 밸런싱

yaml

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
  selector:
    app: vllm
  ports:
    - port: 80
      targetPort: 8000
  type: LoadBalancer
  externalTrafficPolicy: Local

클러스터 관리 도구

kubectl 필수 명령어

자주 사용하는 kubectl 명령어

bash

# 리소스 상태 확인
kubectl get pods -n ai-serving
kubectl get deployments -n ai-serving
kubectl get services -n ai-serving
 
# Pod 상세 정보 및 이벤트
kubectl describe pod vllm-deployment-xxxxx -n ai-serving
 
# Pod 로그 확인
kubectl logs vllm-deployment-xxxxx -n ai-serving -f
 
# GPU 사용 현황
kubectl describe nodes | grep -A 5 "nvidia.com/gpu"
 
# Pod 내부 접속
kubectl exec -it vllm-deployment-xxxxx -n ai-serving -- /bin/bash
 
# 리소스 사용량 확인
kubectl top pods -n ai-serving
kubectl top nodes

관리형 Kubernetes 서비스

직접 Kubernetes 클러스터를 구축하는 것은 복잡하고 운영 부담이 큽니다. 클라우드 환경에서는 관리형 Kubernetes 서비스를 사용하는 것을 권장합니다.

Amazon EKS: AWS의 관리형 Kubernetes입니다. p4d, p5 인스턴스로 GPU 노드 풀을 구성할 수 있습니다.
Google GKE: GCP의 관리형 Kubernetes입니다. A100, H100 GPU를 지원하며, Autopilot 모드로 노드 관리를 자동화할 수 있습니다.
Azure AKS: Azure의 관리형 Kubernetes입니다. NC/ND 시리즈 VM으로 GPU 워크로드를 지원합니다.

EKS 클러스터 생성 (eksctl)

bash

eksctl create cluster \
  --name ai-serving-cluster \
  --region ap-northeast-2 \
  --version 1.29 \
  --nodegroup-name system-nodes \
  --node-type c6i.xlarge \
  --nodes 3
 
# GPU 노드 그룹 추가
eksctl create nodegroup \
  --cluster ai-serving-cluster \
  --name gpu-nodes \
  --node-type p4d.24xlarge \
  --nodes-min 1 \
  --nodes-max 4 \
  --node-labels "accelerator=nvidia-a100"

정리

이 장에서는 Kubernetes의 핵심 개념을 AI 워크로드 관점에서 살펴보았습니다. Pod, Deployment, Service의 기본 개념부터 GPU 노드 구성, 스토리지 설정, 네트워킹까지 AI 서비스 배포에 필요한 기초를 다졌습니다.

다음 장에서는 이 기초 위에 실제 AI 서비스를 Kubernetes에 배포하는 실전 과정을 다루겠습니다. GPU 스케줄링 최적화, 프로브 설정, 무중단 배포 전략을 구체적으로 구현합니다.

이 글이 도움이 되셨나요?

AI / ML

6장: Kubernetes 배포 실전 - GPU 노드와 모델 서빙 배포

Kubernetes에서 GPU 기반 AI 서비스를 배포하는 실전 과정을 다루며, 프로브 설정, 리소스 관리, 무중단 배포 전략을 구현합니다.

2026년 1월 28일·18분

AI / ML

4장: 컨테이너화 - Docker로 AI 서비스 패키징

GPU 지원 Docker 컨테이너로 AI 서비스를 패키징하는 방법을 다루며, NVIDIA Container Toolkit 설정부터 멀티 스테이지 빌드까지 실전 기법을 소개합니다.

2026년 1월 24일·15분

AI / ML

7장: 오토스케일링 - 트래픽 기반 GPU 워크로드 확장

Kubernetes에서 GPU 기반 AI 서비스의 자동 확장 전략을 구현하며, HPA 커스텀 메트릭과 Cluster Autoscaler를 활용한 효율적인 스케일링 방법을 다룹니다.

2026년 1월 30일·17분

2026년 1월 26일·AI / ML·

5장: Kubernetes 기초 - AI 워크로드를 위한 클러스터 설계

Kubernetes의 핵심 개념을 AI 워크로드 관점에서 설명하고, GPU 노드 구성과 AI 서비스에 적합한 클러스터 아키텍처를 설계합니다.

14분835자7개 섹션

mlops kubernetes infrastructure performance

ai-deployment5 / 10

1 2 3 4 5 6 7 8 9 10

이전4장: 컨테이너화 - Docker로 AI 서비스 패키징 다음6장: Kubernetes 배포 실전 - GPU 노드와 모델 서빙 배포

yaml

apiVersion: v1
kind: Pod
metadata:
  name: vllm-server
  labels:
    app: vllm
spec:
  containers:
    - name: vllm
      image: vllm/vllm-openai:latest
      args:
        - "--model"
        - "meta-llama/Llama-3.1-8B-Instruct"
        - "--host"
        - "0.0.0.0"
        - "--port"
        - "8000"
        - "--max-model-len"
        - "4096"
      ports:
        - containerPort: 8000
      resources:
        limits:
          nvidia.com/gpu: 1
        requests:
          nvidia.com/gpu: 1
          memory: "24Gi"
          cpu: "4"

Info

Deployment

vLLM Deployment

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
  labels:
    app: vllm
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Llama-3.1-8B-Instruct"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8000"
            - "--max-model-len"
            - "4096"
            - "--gpu-memory-utilization"
            - "0.90"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
              memory: "24Gi"
              cpu: "4"
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
            failureThreshold: 3

Service

vLLM Service

yaml

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP
  type: ClusterIP

이 Service를 통해 클러스터 내부에서 http://vllm-service 주소로 vLLM에 접근할 수 있으며, 자동으로 여러 Pod에 로드 밸런싱됩니다.

Namespace

AI 서비스 네임스페이스 생성

bash

kubectl create namespace ai-serving

네임스페이스에 리소스 배포

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
  namespace: ai-serving
# ... (이하 동일)

GPU 노드 구성

NVIDIA GPU Operator

Helm으로 GPU Operator 설치

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
 
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true

설치 후 GPU 노드의 상태를 확인합니다.

GPU 리소스 확인

bash

kubectl get nodes -o json | \
  python3 -c "
import sys, json
nodes = json.load(sys.stdin)['items']
for n in nodes:
    name = n['metadata']['name']
    gpus = n['status'].get('capacity', {}).get('nvidia.com/gpu', '0')
    print(f'{name}: {gpus} GPU(s)')
"

노드 레이블과 Taint

GPU 노드에 레이블(Label)과 테인트(Taint)를 설정하여, GPU 워크로드만 해당 노드에 스케줄링되도록 합니다.

GPU 노드 레이블링

bash

kubectl label node gpu-node-1 accelerator=nvidia-a100
kubectl label node gpu-node-2 accelerator=nvidia-a100

GPU 노드 테인트 설정

bash

kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
kubectl taint nodes gpu-node-2 nvidia.com/gpu=present:NoSchedule

GPU 노드에 스케줄링되는 Pod

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  template:
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Equal
          value: present
          effect: NoSchedule
      nodeSelector:
        accelerator: nvidia-a100
      containers:
        - name: vllm
          # ... (컨테이너 설정)
          resources:
            limits:
              nvidia.com/gpu: 1

노드 풀 설계

클라우드 환경에서는 용도별 노드 풀(Node Pool)을 분리하는 것이 권장됩니다.

text

클러스터 노드 풀 구성:
 
1. system-pool (CPU)
   - Kubernetes 시스템 컴포넌트
   - 모니터링, 로깅
   - 인그레스 컨트롤러
   - 인스턴스: c6i.xlarge x 3
 
2. gpu-a100-pool (GPU)
   - AI 모델 서빙 (vLLM)
   - 인스턴스: p4d.24xlarge (A100 x 8)
   - 오토스케일링: 1-4 노드
 
3. gpu-spot-pool (GPU, 스팟)
   - 비프로덕션 워크로드
   - 배치 추론
   - 인스턴스: p4d.24xlarge (스팟)
   - 오토스케일링: 0-8 노드

스토리지 설정

모델 저장을 위한 Persistent Volume

모델 가중치 파일은 Pod가 재시작되어도 유지되어야 하므로, Persistent Volume(PV)에 저장합니다.

모델 스토리지 PVC

yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
  namespace: ai-serving
spec:
  accessModes:
    - ReadOnlyMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 100Gi

Pod에서 PVC 마운트

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  template:
    spec:
      volumes:
        - name: model-volume
          persistentVolumeClaim:
            claimName: model-storage
      containers:
        - name: vllm
          volumeMounts:
            - name: model-volume
              mountPath: /models
              readOnly: true
          args:
            - "--model"
            - "/models/Llama-3.1-8B-Instruct"

Warning

Init Container를 활용한 모델 사전 로딩

Init Container로 모델 복사

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  template:
    spec:
      initContainers:
        - name: model-loader
          image: amazon/aws-cli:latest
          command:
            - sh
            - -c
            - |
              echo "S3에서 모델 다운로드 중..."
              aws s3 sync s3://my-models/Llama-3.1-8B-Instruct /local-models/Llama-3.1-8B-Instruct
              echo "모델 다운로드 완료"
          volumeMounts:
            - name: local-model
              mountPath: /local-models
          env:
            - name: AWS_DEFAULT_REGION
              value: "ap-northeast-2"
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          volumeMounts:
            - name: local-model
              mountPath: /models
              readOnly: true
          args:
            - "--model"
            - "/models/Llama-3.1-8B-Instruct"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8000"
          resources:
            limits:
              nvidia.com/gpu: 1
      volumes:
        - name: local-model
          emptyDir:
            sizeLimit: 50Gi

네트워킹

Ingress 설정

외부에서 AI 서비스에 접근하려면 Ingress를 설정해야 합니다.

Ingress 설정

yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  namespace: ai-serving
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
spec:
  ingressClassName: nginx
  rules:
    - host: ai-api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: vllm-service
                port:
                  number: 80
  tls:
    - hosts:
        - ai-api.example.com
      secretName: ai-api-tls

로드 밸런싱 전략

최소 연결 로드 밸런싱

yaml

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
  selector:
    app: vllm
  ports:
    - port: 80
      targetPort: 8000
  type: LoadBalancer
  externalTrafficPolicy: Local

클러스터 관리 도구

kubectl 필수 명령어

자주 사용하는 kubectl 명령어

bash

# 리소스 상태 확인
kubectl get pods -n ai-serving
kubectl get deployments -n ai-serving
kubectl get services -n ai-serving
 
# Pod 상세 정보 및 이벤트
kubectl describe pod vllm-deployment-xxxxx -n ai-serving
 
# Pod 로그 확인
kubectl logs vllm-deployment-xxxxx -n ai-serving -f
 
# GPU 사용 현황
kubectl describe nodes | grep -A 5 "nvidia.com/gpu"
 
# Pod 내부 접속
kubectl exec -it vllm-deployment-xxxxx -n ai-serving -- /bin/bash
 
# 리소스 사용량 확인
kubectl top pods -n ai-serving
kubectl top nodes

관리형 Kubernetes 서비스

직접 Kubernetes 클러스터를 구축하는 것은 복잡하고 운영 부담이 큽니다. 클라우드 환경에서는 관리형 Kubernetes 서비스를 사용하는 것을 권장합니다.

Amazon EKS: AWS의 관리형 Kubernetes입니다. p4d, p5 인스턴스로 GPU 노드 풀을 구성할 수 있습니다.
Google GKE: GCP의 관리형 Kubernetes입니다. A100, H100 GPU를 지원하며, Autopilot 모드로 노드 관리를 자동화할 수 있습니다.
Azure AKS: Azure의 관리형 Kubernetes입니다. NC/ND 시리즈 VM으로 GPU 워크로드를 지원합니다.

EKS 클러스터 생성 (eksctl)

bash

eksctl create cluster \
  --name ai-serving-cluster \
  --region ap-northeast-2 \
  --version 1.29 \
  --nodegroup-name system-nodes \
  --node-type c6i.xlarge \
  --nodes 3
 
# GPU 노드 그룹 추가
eksctl create nodegroup \
  --cluster ai-serving-cluster \
  --name gpu-nodes \
  --node-type p4d.24xlarge \
  --nodes-min 1 \
  --nodes-max 4 \
  --node-labels "accelerator=nvidia-a100"