2026년 1월 30일·AI / ML·

7장: 오토스케일링 - 트래픽 기반 GPU 워크로드 확장

Kubernetes에서 GPU 기반 AI 서비스의 자동 확장 전략을 구현하며, HPA 커스텀 메트릭과 Cluster Autoscaler를 활용한 효율적인 스케일링 방법을 다룹니다.

17분860자7개 섹션

mlops kubernetes infrastructure performance

이전6장: Kubernetes 배포 실전 - GPU 노드와 모델 서빙 배포 다음8장: 비용 최적화 - 스팟 인스턴스, 모델 공유, 리소스 관리

GPU 오토스케일링의 고유한 과제

전통적인 웹 서비스의 오토스케일링은 비교적 단순합니다. CPU 사용률이 70%를 넘으면 인스턴스를 추가하고, 30% 아래로 떨어지면 제거하는 방식입니다. 새 인스턴스가 수 초 내에 준비되므로, 트래픽 급증에도 빠르게 대응할 수 있습니다.

그러나 GPU 기반 AI 서비스의 오토스케일링은 근본적으로 다른 도전 과제를 안고 있습니다.

첫째, 확장 시간이 매우 깁니다. 새 GPU 노드를 프로비저닝하는 데 3~5분, 모델을 로드하는 데 추가로 1~5분이 걸립니다. 총 5~10분의 지연이 발생하므로, 트래픽이 급증한 후에 반응적으로 확장하면 이미 늦습니다.

둘째, GPU 자원의 비용이 높습니다. A100 GPU 인스턴스 하나의 시간당 비용이 3~5달러이므로, 불필요한 스케일업은 큰 비용 낭비를 초래합니다. 반면 너무 보수적으로 운영하면 서비스 품질이 저하됩니다.

셋째, 스케일링 단위가 큽니다. CPU 서비스는 0.5 vCPU 단위로도 세밀하게 조절할 수 있지만, GPU는 최소 1개 단위로만 추가할 수 있습니다.

text

스케일링 비교:
 
웹 서비스:
  트래픽 증가 감지 --> 인스턴스 추가 (30초) --> 트래픽 처리
  비용: $0.05/시간/인스턴스
 
AI 서비스:
  트래픽 증가 감지 --> GPU 노드 추가 (3-5분) --> 모델 로드 (1-5분) --> 트래픽 처리
  비용: $3-5/시간/인스턴스

Horizontal Pod Autoscaler(HPA)

HPA 기본 개념

HPA(Horizontal Pod Autoscaler)는 Kubernetes의 기본 오토스케일링 메커니즘입니다. 지정된 메트릭을 주기적으로 관찰하고, 목표 값을 유지하도록 Pod 복제본 수를 자동으로 조절합니다.

CPU/메모리 기반 HPA는 AI 서비스에 적합하지 않습니다. GPU 서빙에서 CPU 사용률은 실제 부하를 반영하지 못하며, GPU 사용률은 기본 메트릭으로 제공되지 않습니다. 따라서 커스텀 메트릭 기반 HPA가 필요합니다.

커스텀 메트릭 설정

vLLM이 제공하는 메트릭 중 오토스케일링에 가장 적합한 것은 대기 중인 요청 수(num_requests_waiting)와 GPU KV 캐시 사용률(gpu_cache_usage_perc)입니다.

이 메트릭을 HPA에서 사용하려면 Prometheus Adapter가 필요합니다.

Prometheus Adapter 설치

bash

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  -f adapter-values.yaml

adapter-values.yaml

yaml

prometheus:
  url: http://prometheus-server.monitoring.svc
  port: 9090
 
rules:
  custom:
    - seriesQuery: 'vllm:num_requests_waiting'
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: "^(.*)$"
        as: "vllm_requests_waiting"
      metricsQuery: 'avg(vllm:num_requests_waiting{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
 
    - seriesQuery: 'vllm:gpu_cache_usage_perc'
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: "^(.*)$"
        as: "vllm_gpu_cache_usage"
      metricsQuery: 'avg(vllm:gpu_cache_usage_perc{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

대기 요청 수 기반 HPA

가장 직관적인 스케일링 기준은 대기 큐에 쌓인 요청 수입니다. 대기 요청이 늘어나면 현재 GPU 용량이 부족하다는 의미이므로, Pod를 추가하여 처리량을 늘립니다.

hpa-requests-waiting.yaml

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: ai-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_requests_waiting
        target:
          type: AverageValue
          averageValue: "5"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300

이 설정의 의미를 분석하겠습니다.

목표: Pod당 평균 대기 요청 수를 5개 이하로 유지합니다.
최소/최대 복제본: 항상 최소 2개의 Pod를 유지하며, 최대 8개까지 확장합니다.
스케일업: 60초간 안정화 기간 후, 120초마다 최대 2개의 Pod를 추가합니다.
스케일다운: 300초(5분)간 안정화 기간 후, 300초마다 최대 1개의 Pod를 제거합니다.

Info

스케일다운은 스케일업보다 보수적으로 설정하는 것이 중요합니다. GPU Pod를 제거한 후 다시 필요하게 되면 모델 로딩에 수 분이 걸리므로, 성급한 스케일다운은 서비스 품질 저하를 유발합니다.

GPU KV 캐시 사용률 기반 HPA

KV 캐시 사용률은 GPU 메모리의 실질적인 활용도를 반영합니다. 캐시 사용률이 높으면 더 많은 동시 요청을 처리하기 어려워지므로, 이를 스케일링 기준으로 사용할 수 있습니다.

hpa-cache-usage.yaml

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa-cache
  namespace: ai-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_gpu_cache_usage
        target:
          type: AverageValue
          averageValue: "75"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120
      policies:
        - type: Pods
          value: 1
          periodSeconds: 180
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300

복합 메트릭 HPA

실제 프로덕션에서는 단일 메트릭보다 여러 메트릭을 조합하는 것이 더 안정적입니다. HPA는 여러 메트릭을 동시에 지정할 수 있으며, 그 중 가장 높은 복제본 수를 요구하는 메트릭에 따라 스케일링됩니다.

hpa-combined.yaml

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa-combined
  namespace: ai-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_requests_waiting
        target:
          type: AverageValue
          averageValue: "5"
    - type: Pods
      pods:
        metric:
          name: vllm_gpu_cache_usage
        target:
          type: AverageValue
          averageValue: "80"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300

KEDA를 활용한 고급 스케일링

KEDA 소개

KEDA(Kubernetes Event-Driven Autoscaling)는 이벤트 기반 오토스케일링을 제공하는 프레임워크입니다. HPA보다 더 유연한 스케일링 규칙을 정의할 수 있으며, 특히 0까지 스케일다운(Scale to Zero)이 가능합니다.

AI 서비스에서 KEDA의 가치는 비프로덕션 환경에서 두드러집니다. 트래픽이 없을 때 GPU Pod를 0으로 축소하여 비용을 절약할 수 있습니다.

KEDA 설치

bash

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace

Prometheus 기반 KEDA ScaledObject

keda-scaledobject.yaml

yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaledobject
  namespace: ai-serving
spec:
  scaleTargetRef:
    name: vllm-llama
  minReplicaCount: 0
  maxReplicaCount: 8
  idleReplicaCount: 1
  pollingInterval: 15
  cooldownPeriod: 300
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-server.monitoring.svc:9090
        metricName: vllm_requests_waiting
        query: |
          avg(vllm:num_requests_waiting{namespace="ai-serving"})
        threshold: "5"
        activationThreshold: "1"
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-server.monitoring.svc:9090
        metricName: vllm_request_rate
        query: |
          sum(rate(vllm:request_success_total{namespace="ai-serving"}[5m]))
        threshold: "10"
        activationThreshold: "0.1"

이 설정에서 idleReplicaCount: 1은 활성 트래픽이 없을 때 유지할 최소 Pod 수입니다. 완전한 스케일 투 제로보다는 최소 1개의 "웜" Pod를 유지하여 콜드 스타트를 방지하는 전략입니다.

activationThreshold는 스케일다운 상태(0 또는 idleReplicaCount)에서 본격적인 스케일업을 시작하는 임계값입니다.

Cluster Autoscaler

Pod 스케일링과 노드 스케일링의 관계

HPA는 Pod 수를 조절하지만, 클러스터에 충분한 GPU 노드가 없으면 새 Pod를 스케줄링할 수 없습니다. Cluster Autoscaler(CA)는 스케줄링할 수 없는(Pending) Pod를 감지하면 자동으로 새 노드를 추가합니다.

text

오토스케일링 전체 흐름:
 
  트래픽 증가
      |
      v
  HPA: Pod 추가 결정
      |
      v
  새 Pod 생성 --> Pending 상태 (GPU 노드 부족)
      |
      v
  Cluster Autoscaler: 새 GPU 노드 추가
      |
      v
  새 노드 준비 완료 (3-5분)
      |
      v
  Pod 스케줄링 --> 모델 로딩 (1-5분)
      |
      v
  Readiness Probe 통과 --> 트래픽 수신 시작

EKS에서 Cluster Autoscaler 설정

Cluster Autoscaler 설치

bash

helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  --set autoDiscovery.clusterName=ai-serving-cluster \
  --set awsRegion=ap-northeast-2 \
  --set extraArgs.scale-down-delay-after-add=10m \
  --set extraArgs.scale-down-unneeded-time=10m \
  --set extraArgs.skip-nodes-with-local-storage=false

GPU 노드 그룹에 적절한 태그를 설정해야 Cluster Autoscaler가 자동으로 인식합니다.

EKS 노드 그룹 태그

bash

# AWS Auto Scaling Group에 태그 추가
aws autoscaling create-or-update-tags --tags \
  "ResourceId=gpu-node-group-asg,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/enabled,Value=true,PropagateAtLaunch=true" \
  "ResourceId=gpu-node-group-asg,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/ai-serving-cluster,Value=owned,PropagateAtLaunch=true"

Karpenter 활용

Karpenter는 AWS에서 Cluster Autoscaler의 대안으로 사용할 수 있는 노드 프로비저너입니다. Cluster Autoscaler보다 빠른 노드 프로비저닝과 더 유연한 인스턴스 선택이 가능합니다.

karpenter-nodepool.yaml

yaml

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    spec:
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - p4d.24xlarge
            - p4de.24xlarge
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - on-demand
            - spot
        - key: kubernetes.io/arch
          operator: In
          values:
            - amd64
      taints:
        - key: nvidia.com/gpu
          value: present
          effect: NoSchedule
  limits:
    nvidia.com/gpu: 32
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h

Tip

Karpenter는 Cluster Autoscaler에 비해 노드 프로비저닝 속도가 약 60% 빠릅니다. 또한 여러 인스턴스 타입을 유연하게 선택하고, 스팟 인스턴스와 온디맨드 인스턴스를 혼합하여 비용을 최적화할 수 있습니다. AWS 환경에서는 Karpenter 사용을 권장합니다.

예측적 스케일링

반응적 스케일링만으로는 GPU의 긴 콜드 스타트를 극복하기 어렵습니다. 트래픽 패턴을 분석하여 미리 스케일업하는 예측적 스케일링(Predictive Scaling)이 유효한 전략입니다.

CronJob 기반 예약 스케일링

트래픽 패턴이 예측 가능한 경우(예: 평일 오전 9시 트래픽 증가, 새벽 3시 감소), CronJob으로 스케일링을 예약할 수 있습니다.

scheduled-scale-up.yaml

yaml

apiVersion: batch/v1
kind: CronJob
metadata:
  name: vllm-scale-up
  namespace: ai-serving
spec:
  schedule: "0 8 * * 1-5"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: hpa-scaler
          containers:
            - name: scaler
              image: bitnami/kubectl:latest
              command:
                - kubectl
                - patch
                - hpa
                - vllm-hpa
                - -n
                - ai-serving
                - --type=merge
                - -p
                - '{"spec":{"minReplicas":4}}'
          restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: vllm-scale-down
  namespace: ai-serving
spec:
  schedule: "0 22 * * 1-5"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: hpa-scaler
          containers:
            - name: scaler
              image: bitnami/kubectl:latest
              command:
                - kubectl
                - patch
                - hpa
                - vllm-hpa
                - -n
                - ai-serving
                - --type=merge
                - -p
                - '{"spec":{"minReplicas":2}}'
          restartPolicy: OnFailure

워밍 풀 유지

콜드 스타트를 완전히 제거하는 가장 확실한 방법은 미리 워밍된(모델이 로드된) Pod를 유휴 상태로 유지하는 것입니다. 비용이 발생하지만, 중요한 서비스에서는 유효한 전략입니다.

text

워밍 풀 전략:
 
활성 Pod: 4개 (현재 트래픽 처리)
워밍 Pod: 2개 (모델 로드 완료, 트래픽 미수신)
 
트래픽 급증 시:
  워밍 Pod가 즉시 트래픽 수신 시작 (콜드 스타트 0초)
  동시에 새 워밍 Pod 프로비저닝 시작
 
비용: 워밍 Pod 2개 x $4/시간 = $8/시간 추가
효과: 콜드 스타트 완전 제거

스케일링 모니터링과 알림

오토스케일링의 동작을 모니터링하고, 이상 상황에 대한 알림을 설정해야 합니다.

스케일링 알림 규칙

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: vllm-scaling-alerts
  namespace: monitoring
spec:
  groups:
    - name: vllm-scaling
      rules:
        - alert: HighRequestQueueDepth
          expr: avg(vllm:num_requests_waiting) > 20
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "vLLM 요청 대기 큐가 과도하게 깊습니다"
            description: "평균 대기 요청 수가 5분 이상 20을 초과했습니다. 스케일링이 충분하지 않을 수 있습니다."
 
        - alert: MaxReplicasReached
          expr: kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "HPA 최대 복제본에 도달했습니다"
            description: "vLLM HPA가 최대 복제본 수에 도달한 상태가 10분 이상 지속되고 있습니다. maxReplicas 증가를 검토하세요."
 
        - alert: PendingGPUPods
          expr: kube_pod_status_phase{namespace="ai-serving", phase="Pending"} > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "GPU Pod가 Pending 상태입니다"
            description: "GPU Pod가 5분 이상 스케줄링되지 못하고 있습니다. GPU 노드 부족일 수 있습니다."

정리

GPU 기반 AI 서비스의 오토스케일링은 긴 콜드 스타트 시간, 높은 단위 비용, 큰 스케일링 단위라는 고유한 과제를 안고 있습니다. 커스텀 메트릭 기반 HPA, KEDA를 활용한 이벤트 기반 스케일링, Cluster Autoscaler/Karpenter를 통한 노드 수준 확장, 예측적 스케일링의 조합으로 이러한 과제를 극복할 수 있습니다.

다음 장에서는 AI 서비스 운영의 실질적인 비용을 줄이는 전략을 다루겠습니다. 스팟 인스턴스 활용, 모델 공유, 리소스 관리 최적화를 통한 비용 절감 방법을 구체적으로 설명합니다.

이 글이 도움이 되셨나요?

AI / ML

8장: 비용 최적화 - 스팟 인스턴스, 모델 공유, 리소스 관리

GPU 기반 AI 서비스의 운영 비용을 체계적으로 절감하는 전략을 다루며, 스팟 인스턴스 활용, 모델 공유 아키텍처, 리소스 관리 기법을 소개합니다.

2026년 2월 1일·18분

AI / ML

6장: Kubernetes 배포 실전 - GPU 노드와 모델 서빙 배포

Kubernetes에서 GPU 기반 AI 서비스를 배포하는 실전 과정을 다루며, 프로브 설정, 리소스 관리, 무중단 배포 전략을 구현합니다.

2026년 1월 28일·18분

AI / ML

9장: CI/CD 파이프라인 - GitHub Actions로 모델 배포 자동화

GitHub Actions를 활용하여 AI 서비스의 빌드, 테스트, 배포를 자동화하는 CI/CD 파이프라인을 구축하고, 모델 평가를 파이프라인에 통합합니다.

2026년 2월 3일·14분

2026년 1월 30일·AI / ML·

7장: 오토스케일링 - 트래픽 기반 GPU 워크로드 확장

Kubernetes에서 GPU 기반 AI 서비스의 자동 확장 전략을 구현하며, HPA 커스텀 메트릭과 Cluster Autoscaler를 활용한 효율적인 스케일링 방법을 다룹니다.

17분860자7개 섹션

mlops kubernetes infrastructure performance

ai-deployment7 / 10

1 2 3 4 5 6 7 8 9 10

이전6장: Kubernetes 배포 실전 - GPU 노드와 모델 서빙 배포 다음8장: 비용 최적화 - 스팟 인스턴스, 모델 공유, 리소스 관리

GPU 오토스케일링의 고유한 과제

그러나 GPU 기반 AI 서비스의 오토스케일링은 근본적으로 다른 도전 과제를 안고 있습니다.

셋째, 스케일링 단위가 큽니다. CPU 서비스는 0.5 vCPU 단위로도 세밀하게 조절할 수 있지만, GPU는 최소 1개 단위로만 추가할 수 있습니다.

text

스케일링 비교:
 
웹 서비스:
  트래픽 증가 감지 --> 인스턴스 추가 (30초) --> 트래픽 처리
  비용: $0.05/시간/인스턴스
 
AI 서비스:
  트래픽 증가 감지 --> GPU 노드 추가 (3-5분) --> 모델 로드 (1-5분) --> 트래픽 처리
  비용: $3-5/시간/인스턴스

Prometheus Adapter 설치

bash

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  -f adapter-values.yaml

adapter-values.yaml

yaml

prometheus:
  url: http://prometheus-server.monitoring.svc
  port: 9090
 
rules:
  custom:
    - seriesQuery: 'vllm:num_requests_waiting'
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: "^(.*)$"
        as: "vllm_requests_waiting"
      metricsQuery: 'avg(vllm:num_requests_waiting{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
 
    - seriesQuery: 'vllm:gpu_cache_usage_perc'
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: "^(.*)$"
        as: "vllm_gpu_cache_usage"
      metricsQuery: 'avg(vllm:gpu_cache_usage_perc{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

대기 요청 수 기반 HPA

hpa-requests-waiting.yaml

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: ai-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_requests_waiting
        target:
          type: AverageValue
          averageValue: "5"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300

이 설정의 의미를 분석하겠습니다.

목표: Pod당 평균 대기 요청 수를 5개 이하로 유지합니다.
최소/최대 복제본: 항상 최소 2개의 Pod를 유지하며, 최대 8개까지 확장합니다.
스케일업: 60초간 안정화 기간 후, 120초마다 최대 2개의 Pod를 추가합니다.
스케일다운: 300초(5분)간 안정화 기간 후, 300초마다 최대 1개의 Pod를 제거합니다.

Info

GPU KV 캐시 사용률 기반 HPA

hpa-cache-usage.yaml

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa-cache
  namespace: ai-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_gpu_cache_usage
        target:
          type: AverageValue
          averageValue: "75"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120
      policies:
        - type: Pods
          value: 1
          periodSeconds: 180
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300

복합 메트릭 HPA

hpa-combined.yaml

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa-combined
  namespace: ai-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_requests_waiting
        target:
          type: AverageValue
          averageValue: "5"
    - type: Pods
      pods:
        metric:
          name: vllm_gpu_cache_usage
        target:
          type: AverageValue
          averageValue: "80"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300

KEDA를 활용한 고급 스케일링

KEDA 소개

AI 서비스에서 KEDA의 가치는 비프로덕션 환경에서 두드러집니다. 트래픽이 없을 때 GPU Pod를 0으로 축소하여 비용을 절약할 수 있습니다.

KEDA 설치

bash

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace

Prometheus 기반 KEDA ScaledObject

keda-scaledobject.yaml

yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaledobject
  namespace: ai-serving
spec:
  scaleTargetRef:
    name: vllm-llama
  minReplicaCount: 0
  maxReplicaCount: 8
  idleReplicaCount: 1
  pollingInterval: 15
  cooldownPeriod: 300
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-server.monitoring.svc:9090
        metricName: vllm_requests_waiting
        query: |
          avg(vllm:num_requests_waiting{namespace="ai-serving"})
        threshold: "5"
        activationThreshold: "1"
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-server.monitoring.svc:9090
        metricName: vllm_request_rate
        query: |
          sum(rate(vllm:request_success_total{namespace="ai-serving"}[5m]))
        threshold: "10"
        activationThreshold: "0.1"

activationThreshold는 스케일다운 상태(0 또는 idleReplicaCount)에서 본격적인 스케일업을 시작하는 임계값입니다.

Cluster Autoscaler

Pod 스케일링과 노드 스케일링의 관계

text

오토스케일링 전체 흐름:
 
  트래픽 증가
      |
      v
  HPA: Pod 추가 결정
      |
      v
  새 Pod 생성 --> Pending 상태 (GPU 노드 부족)
      |
      v
  Cluster Autoscaler: 새 GPU 노드 추가
      |
      v
  새 노드 준비 완료 (3-5분)
      |
      v
  Pod 스케줄링 --> 모델 로딩 (1-5분)
      |
      v
  Readiness Probe 통과 --> 트래픽 수신 시작

EKS에서 Cluster Autoscaler 설정

Cluster Autoscaler 설치

bash

helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  --set autoDiscovery.clusterName=ai-serving-cluster \
  --set awsRegion=ap-northeast-2 \
  --set extraArgs.scale-down-delay-after-add=10m \
  --set extraArgs.scale-down-unneeded-time=10m \
  --set extraArgs.skip-nodes-with-local-storage=false

GPU 노드 그룹에 적절한 태그를 설정해야 Cluster Autoscaler가 자동으로 인식합니다.

EKS 노드 그룹 태그

bash

# AWS Auto Scaling Group에 태그 추가
aws autoscaling create-or-update-tags --tags \
  "ResourceId=gpu-node-group-asg,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/enabled,Value=true,PropagateAtLaunch=true" \
  "ResourceId=gpu-node-group-asg,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/ai-serving-cluster,Value=owned,PropagateAtLaunch=true"

Karpenter 활용

karpenter-nodepool.yaml

yaml

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    spec:
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - p4d.24xlarge
            - p4de.24xlarge
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - on-demand
            - spot
        - key: kubernetes.io/arch
          operator: In
          values:
            - amd64
      taints:
        - key: nvidia.com/gpu
          value: present
          effect: NoSchedule
  limits:
    nvidia.com/gpu: 32
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h

Tip

예측적 스케일링

CronJob 기반 예약 스케일링

트래픽 패턴이 예측 가능한 경우(예: 평일 오전 9시 트래픽 증가, 새벽 3시 감소), CronJob으로 스케일링을 예약할 수 있습니다.

scheduled-scale-up.yaml

yaml

apiVersion: batch/v1
kind: CronJob
metadata:
  name: vllm-scale-up
  namespace: ai-serving
spec:
  schedule: "0 8 * * 1-5"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: hpa-scaler
          containers:
            - name: scaler
              image: bitnami/kubectl:latest
              command:
                - kubectl
                - patch
                - hpa
                - vllm-hpa
                - -n
                - ai-serving
                - --type=merge
                - -p
                - '{"spec":{"minReplicas":4}}'
          restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: vllm-scale-down
  namespace: ai-serving
spec:
  schedule: "0 22 * * 1-5"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: hpa-scaler
          containers:
            - name: scaler
              image: bitnami/kubectl:latest
              command:
                - kubectl
                - patch
                - hpa
                - vllm-hpa
                - -n
                - ai-serving
                - --type=merge
                - -p
                - '{"spec":{"minReplicas":2}}'
          restartPolicy: OnFailure

워밍 풀 유지

text

워밍 풀 전략:
 
활성 Pod: 4개 (현재 트래픽 처리)
워밍 Pod: 2개 (모델 로드 완료, 트래픽 미수신)
 
트래픽 급증 시:
  워밍 Pod가 즉시 트래픽 수신 시작 (콜드 스타트 0초)
  동시에 새 워밍 Pod 프로비저닝 시작
 
비용: 워밍 Pod 2개 x $4/시간 = $8/시간 추가
효과: 콜드 스타트 완전 제거

스케일링 모니터링과 알림

오토스케일링의 동작을 모니터링하고, 이상 상황에 대한 알림을 설정해야 합니다.

스케일링 알림 규칙

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: vllm-scaling-alerts
  namespace: monitoring
spec:
  groups:
    - name: vllm-scaling
      rules:
        - alert: HighRequestQueueDepth
          expr: avg(vllm:num_requests_waiting) > 20
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "vLLM 요청 대기 큐가 과도하게 깊습니다"
            description: "평균 대기 요청 수가 5분 이상 20을 초과했습니다. 스케일링이 충분하지 않을 수 있습니다."
 
        - alert: MaxReplicasReached
          expr: kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "HPA 최대 복제본에 도달했습니다"
            description: "vLLM HPA가 최대 복제본 수에 도달한 상태가 10분 이상 지속되고 있습니다. maxReplicas 증가를 검토하세요."
 
        - alert: PendingGPUPods
          expr: kube_pod_status_phase{namespace="ai-serving", phase="Pending"} > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "GPU Pod가 Pending 상태입니다"
            description: "GPU Pod가 5분 이상 스케줄링되지 못하고 있습니다. GPU 노드 부족일 수 있습니다."