2026년 1월 31일·AI / ML·

9장: CI/CD에 평가 파이프라인 통합

LLM 평가를 CI/CD 파이프라인에 통합하여, 프롬프트 변경과 모델 교체 시 자동으로 품질을 검증하는 체계를 구축합니다.

15분1,255자8개 섹션

llm evaluation monitoring observability testing

llm-evaluation9 / 10

1 2 3 4 5 6 7 8 9 10

이전8장: 드리프트 감지와 품질 모니터링 다음10장: 실전 프로젝트 - 종합 평가 모니터링 시스템 구축

CI/CD에 LLM 평가를 통합하는 이유

전통적인 소프트웨어에서 CI/CD는 코드 변경이 기존 기능을 깨뜨리지 않는지 자동으로 검증합니다. LLM 애플리케이션에서는 코드뿐 아니라 프롬프트, 모델 설정, 파라미터 변경도 동일한 수준의 검증이 필요합니다.

프롬프트 한 줄을 바꾸는 것이 코드 수백 줄을 바꾸는 것보다 더 큰 영향을 미칠 수 있습니다. 그럼에도 많은 팀이 프롬프트 변경을 코드 리뷰나 자동 테스트 없이 배포합니다.

text

LLM 프로젝트에서 CI/CD 파이프라인이 검증해야 하는 변경 유형:
 
코드 변경         --> 기존 단위 테스트 + 통합 테스트
프롬프트 변경     --> LLM 평가 테스트 (오프라인 메트릭)
모델 교체         --> 전체 벤치마크 재실행
파라미터 변경     --> 지정된 메트릭 회귀 테스트
데이터 소스 변경  --> RAG 품질 평가 재실행

프롬프트 버전 관리

프롬프트를 코드처럼 관리하기

프롬프트를 코드와 분리된 파일로 관리하면, 변경 이력을 추적하고 CI/CD에서 변경 감지가 용이합니다.

text

prompts/
  qa/
    system.txt         # 시스템 프롬프트
    user_template.txt  # 사용자 프롬프트 템플릿
    config.yaml        # 모델, temperature 등 설정
  summarize/
    system.txt
    user_template.txt
    config.yaml

prompts/qa/config.yaml

yaml

name: qa-system
version: "2.1.0"
model: claude-sonnet-4-20250514
temperature: 0.3
max_tokens: 1024
description: "질문-답변 시스템 프롬프트 v2.1"
changelog:
  - version: "2.1.0"
    date: "2026-04-04"
    changes: "답변 형식 구조화 지시 추가"
  - version: "2.0.0"
    date: "2026-03-20"
    changes: "Chain-of-Thought 추론 단계 도입"

python

import yaml
import hashlib
 
class PromptManager:
    """프롬프트 버전을 관리합니다."""
 
    def __init__(self, prompts_dir: str):
        self.prompts_dir = prompts_dir
 
    def load_prompt(self, name: str) -> dict:
        """프롬프트와 설정을 로드합니다."""
        base_path = self.prompts_dir + "/" + name
 
        with open(base_path + "/system.txt") as f:
            system_prompt = f.read()
        with open(base_path + "/user_template.txt") as f:
            user_template = f.read()
        with open(base_path + "/config.yaml") as f:
            config = yaml.safe_load(f)
 
        return {
            "system_prompt": system_prompt,
            "user_template": user_template,
            "config": config,
            "hash": self._compute_hash(system_prompt + user_template),
        }
 
    def _compute_hash(self, content: str) -> str:
        return hashlib.sha256(content.encode()).hexdigest()[:12]
 
    def detect_changes(self, previous_hash: str, current_hash: str) -> bool:
        """프롬프트 변경 여부를 감지합니다."""
        return previous_hash != current_hash

GitHub Actions 기반 평가 파이프라인

기본 워크플로우

.github/workflows/llm-eval.yml

yaml

name: LLM Evaluation Pipeline
 
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/llm/**'
      - 'eval/**'
  push:
    branches: [main]
 
env:
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
 
jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      prompts_changed: ${{ steps.changes.outputs.prompts }}
      model_changed: ${{ steps.changes.outputs.model }}
    steps:
      - uses: actions/checkout@v4
      - id: changes
        uses: dorny/paths-filter@v3
        with:
          filters: |
            prompts:
              - 'prompts/**'
            model:
              - 'src/llm/config.yaml'
 
  quick-eval:
    needs: detect-changes
    if: needs.detect-changes.outputs.prompts_changed == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
 
      - name: Install dependencies
        run: pip install -r requirements-eval.txt
 
      - name: Run quick evaluation
        run: |
          python -m eval.run \
            --dataset eval/datasets/core-50.json \
            --output results/quick-eval.json \
            --parallel 5
 
      - name: Check thresholds
        run: python -m eval.check_thresholds results/quick-eval.json
 
      - name: Post results to PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(
              fs.readFileSync('results/quick-eval.json', 'utf8')
            );
            const body = formatEvalResults(results);
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });
 
  full-eval:
    needs: detect-changes
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
 
      - name: Install dependencies
        run: pip install -r requirements-eval.txt
 
      - name: Run full evaluation
        run: |
          python -m eval.run \
            --dataset eval/datasets/full-500.json \
            --output results/full-eval.json \
            --parallel 10
 
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results/

PR에 평가 결과 코멘트

eval/format_results.py

python

def format_eval_comment(results: dict) -> str:
    """평가 결과를 GitHub PR 코멘트 형식으로 포맷합니다."""
    status = "PASSED" if results["overall_pass"] else "FAILED"
    status_icon = "[PASS]" if results["overall_pass"] else "[FAIL]"
 
    lines = []
    lines.append("## LLM Evaluation Results " + status_icon)
    lines.append("")
    lines.append("| Metric | Score | Threshold | Status |")
    lines.append("|--------|-------|-----------|--------|")
 
    for metric_name, data in results["metrics"].items():
        score = str(round(data["mean"], 3))
        threshold = str(data["threshold"])
        passed = "Pass" if data["passed"] else "Fail"
        lines.append(
            "| " + metric_name + " | " + score
            + " | " + threshold + " | " + passed + " |"
        )
 
    lines.append("")
 
    if results.get("failures"):
        lines.append("### Failed Cases (" + str(len(results["failures"])) + ")")
        lines.append("")
        for f in results["failures"][:5]:
            lines.append(
                "- **" + f["metric"] + "** on case `"
                + f["case_id"] + "`: "
                + str(round(f["score"], 3))
                + " (threshold: " + str(f["threshold"]) + ")"
            )
 
    if results.get("comparison"):
        lines.append("")
        lines.append("### Comparison with Previous Version")
        for metric, change in results["comparison"].items():
            direction = "improved" if change > 0 else "degraded"
            lines.append(
                "- " + metric + ": " + direction
                + " by " + str(abs(round(change, 3)))
            )
 
    return "\n".join(lines)

평가 게이트 설계

단계별 게이트

python

class EvalGate:
    """CI/CD 파이프라인의 평가 게이트를 정의합니다."""
 
    def __init__(self, gate_config: dict):
        self.config = gate_config
 
    def check(self, results: dict) -> dict:
        """게이트 통과 여부를 판단합니다."""
        failures = []
 
        for metric, criteria in self.config["thresholds"].items():
            actual = results.get("metrics", {}).get(metric, {}).get("mean")
            if actual is None:
                failures.append({
                    "metric": metric,
                    "reason": "메트릭 결과 없음",
                })
                continue
 
            if "min" in criteria and actual < criteria["min"]:
                failures.append({
                    "metric": metric,
                    "actual": actual,
                    "required_min": criteria["min"],
                    "reason": "최소 기준 미달",
                })
 
            if "max" in criteria and actual > criteria["max"]:
                failures.append({
                    "metric": metric,
                    "actual": actual,
                    "required_max": criteria["max"],
                    "reason": "최대 기준 초과",
                })
 
        # 회귀 검사 (이전 버전 대비)
        if self.config.get("regression_check") and results.get("previous"):
            for metric in self.config["regression_check"]["metrics"]:
                current = results["metrics"].get(metric, {}).get("mean", 0)
                previous = results["previous"].get(metric, {}).get("mean", 0)
                max_regression = self.config["regression_check"]["max_regression"]
 
                if previous > 0 and (previous - current) / previous > max_regression:
                    failures.append({
                        "metric": metric,
                        "current": current,
                        "previous": previous,
                        "regression_pct": round(
                            (previous - current) / previous * 100, 1
                        ),
                        "reason": "허용 회귀 폭 초과",
                    })
 
        return {
            "passed": len(failures) == 0,
            "failures": failures,
            "gate_name": self.config["name"],
        }
 
# 게이트 설정 예시
pr_gate_config = {
    "name": "PR Quick Gate",
    "thresholds": {
        "answer_relevancy": {"min": 0.75},
        "faithfulness": {"min": 0.80},
        "toxicity": {"max": 0.05},
        "latency_p95_seconds": {"max": 5.0},
    },
    "regression_check": {
        "metrics": ["answer_relevancy", "faithfulness"],
        "max_regression": 0.05,  # 5% 이상 하락 불허
    },
}

빠른 평가와 전체 평가의 분리

text

CI/CD 평가 전략:
 
PR 단계 (빠른 평가):
  - 데이터셋: 핵심 50건 (core-50)
  - 소요 시간: 2-5분
  - 목적: 명백한 회귀 방지
  - 차단 여부: 게이트 실패 시 머지 차단
 
머지 후 (전체 평가):
  - 데이터셋: 전체 500건 (full-500)
  - 소요 시간: 15-30분
  - 목적: 상세 품질 분석, 슬라이스별 성능
  - 차단 여부: 알림만 (이미 배포됨)
 
야간 평가 (종합 벤치마크):
  - 데이터셋: 전체 + 엣지 케이스 1000건
  - 소요 시간: 1-2시간
  - 목적: 모델 드리프트 감지, 장기 추세 분석
  - 차단 여부: 다음 날 리포트

회귀 테스트 전략

프롬프트 회귀 테스트

eval/regression.py

python

class RegressionTester:
    """프롬프트 변경에 대한 회귀 테스트를 수행합니다."""
 
    def __init__(self, baseline_results_path: str):
        self.baseline = self._load_baseline(baseline_results_path)
 
    def _load_baseline(self, path: str) -> dict:
        import json
        with open(path) as f:
            return json.load(f)
 
    def run_regression_test(
        self,
        current_results: dict,
        allowed_regression: float = 0.03
    ) -> dict:
        """현재 결과를 베이스라인과 비교합니다."""
        comparisons = []
 
        for metric in self.baseline.get("metrics", {}):
            baseline_score = self.baseline["metrics"][metric].get("mean", 0)
            current_score = current_results.get("metrics", {}).get(
                metric, {}
            ).get("mean", 0)
 
            delta = current_score - baseline_score
            pct_change = (delta / baseline_score * 100) if baseline_score else 0
 
            comparisons.append({
                "metric": metric,
                "baseline": round(baseline_score, 4),
                "current": round(current_score, 4),
                "delta": round(delta, 4),
                "pct_change": round(pct_change, 2),
                "regressed": delta < -allowed_regression,
                "improved": delta > allowed_regression,
            })
 
        regressed_metrics = [c for c in comparisons if c["regressed"]]
 
        return {
            "passed": len(regressed_metrics) == 0,
            "comparisons": comparisons,
            "regressed_count": len(regressed_metrics),
            "improved_count": len([c for c in comparisons if c["improved"]]),
        }
 
    def update_baseline(self, new_results: dict, output_path: str):
        """베이스라인을 업데이트합니다."""
        import json
        with open(output_path, "w") as f:
            json.dump(new_results, f, ensure_ascii=False, indent=2)

케이스별 회귀 추적

전체 메트릭은 통과하더라도, 특정 케이스에서 큰 폭의 성능 저하가 있을 수 있습니다.

python

def case_level_regression(
    baseline_cases: list,
    current_cases: list,
    threshold: float = 0.2
) -> list:
    """케이스 단위로 회귀를 감지합니다."""
    baseline_map = {c["case_id"]: c for c in baseline_cases}
    regressions = []
 
    for current in current_cases:
        case_id = current["case_id"]
        baseline = baseline_map.get(case_id)
 
        if not baseline:
            continue
 
        for metric in current.get("metrics", {}):
            curr_score = current["metrics"][metric]
            base_score = baseline.get("metrics", {}).get(metric, curr_score)
 
            if base_score - curr_score > threshold:
                regressions.append({
                    "case_id": case_id,
                    "metric": metric,
                    "baseline_score": round(base_score, 3),
                    "current_score": round(curr_score, 3),
                    "drop": round(base_score - curr_score, 3),
                    "input_preview": current.get("input", "")[:80],
                })
 
    return sorted(regressions, key=lambda x: x["drop"], reverse=True)

Warning

LLM 평가에는 API 비용이 발생합니다. 모든 PR에 대해 전체 평가를 실행하면 비용이 빠르게 증가합니다. PR 단계에서는 핵심 케이스만 평가하고, 전체 평가는 main 브랜치 머지 후 또는 야간에 실행하는 전략이 비용 효율적입니다.

비용 최적화

평가 비용 관리

python

class EvalCostTracker:
    """평가 파이프라인의 비용을 추적합니다."""
 
    def __init__(self, monthly_budget: float):
        self.monthly_budget = monthly_budget
        self.monthly_spent = 0.0
        self.eval_costs = []
 
    def estimate_eval_cost(
        self,
        dataset_size: int,
        avg_input_tokens: int,
        avg_output_tokens: int,
        model: str,
        judge_model: str = None,
    ) -> float:
        """평가 실행 비용을 사전 추정합니다."""
        # 대상 모델 비용
        app_cost = calculate_cost(
            avg_input_tokens * dataset_size,
            avg_output_tokens * dataset_size,
            model
        )
 
        # Judge 모델 비용 (LLM-as-Judge 사용 시)
        judge_cost = 0.0
        if judge_model:
            judge_cost = calculate_cost(
                (avg_input_tokens + avg_output_tokens) * dataset_size,
                200 * dataset_size,  # Judge 응답은 보통 짧음
                judge_model
            )
 
        total = app_cost + judge_cost
        return round(total, 2)
 
    def should_run_eval(self, estimated_cost: float) -> dict:
        """예산 내에서 평가를 실행할 수 있는지 확인합니다."""
        remaining = self.monthly_budget - self.monthly_spent
 
        if estimated_cost > remaining:
            return {
                "should_run": False,
                "reason": "월 예산 초과 (남은 예산: $"
                          + str(round(remaining, 2)) + ")",
                "suggestion": "데이터셋 크기를 줄이거나 다음 달에 실행",
            }
 
        return {
            "should_run": True,
            "estimated_cost": estimated_cost,
            "remaining_after": round(remaining - estimated_cost, 2),
        }

전체 파이프라인 아키텍처

정리

CI/CD에 LLM 평가를 통합하면, 프롬프트 변경과 모델 교체가 기존 품질을 저하시키지 않는다는 것을 자동으로 보장할 수 있습니다. 프롬프트를 코드처럼 버전 관리하고, PR 단계에서 빠른 평가로 회귀를 차단하고, 머지 후 전체 평가로 상세 분석을 수행하는 다단계 전략이 효과적입니다.

평가 비용을 관리하기 위해 핵심 데이터셋과 전체 데이터셋을 분리하고, 예산 추적을 자동화하는 것이 필요합니다.

다음 장에서는 지금까지 다룬 모든 개념을 통합하여 종합 평가 모니터링 시스템을 구축하는 실전 프로젝트를 진행합니다.

이 글이 도움이 되셨나요?

10장: 실전 프로젝트 - 종합 평가 모니터링 시스템 구축

지금까지 다룬 평가 메트릭, LLM-as-Judge, 모니터링, CI/CD를 통합하여 프로덕션 수준의 종합 평가 시스템을 구축합니다.

2026년 2월 2일·22분

AI / ML

8장: 드리프트 감지와 품질 모니터링

LLM 애플리케이션의 입력 분포 변화, 모델 성능 저하, 데이터 드리프트를 감지하고 대응하는 방법을 다룹니다.

2026년 1월 29일·16분

AI / ML

7장: 프로덕션 로깅과 관찰 가능성

LLM 애플리케이션의 프로덕션 환경에서 구조화된 로깅, 분산 트레이싱, 관찰 가능성을 구축하는 방법을 다룹니다.

2026년 1월 27일·14분

2026년 1월 31일·AI / ML·

9장: CI/CD에 평가 파이프라인 통합

LLM 평가를 CI/CD 파이프라인에 통합하여, 프롬프트 변경과 모델 교체 시 자동으로 품질을 검증하는 체계를 구축합니다.

15분1,255자8개 섹션

llm evaluation monitoring observability testing

llm-evaluation9 / 10

1 2 3 4 5 6 7 8 9 10

이전8장: 드리프트 감지와 품질 모니터링 다음10장: 실전 프로젝트 - 종합 평가 모니터링 시스템 구축

CI/CD에 LLM 평가를 통합하는 이유

text

LLM 프로젝트에서 CI/CD 파이프라인이 검증해야 하는 변경 유형:
 
코드 변경         --> 기존 단위 테스트 + 통합 테스트
프롬프트 변경     --> LLM 평가 테스트 (오프라인 메트릭)
모델 교체         --> 전체 벤치마크 재실행
파라미터 변경     --> 지정된 메트릭 회귀 테스트
데이터 소스 변경  --> RAG 품질 평가 재실행

프롬프트 버전 관리

프롬프트를 코드처럼 관리하기

프롬프트를 코드와 분리된 파일로 관리하면, 변경 이력을 추적하고 CI/CD에서 변경 감지가 용이합니다.

text

prompts/
  qa/
    system.txt         # 시스템 프롬프트
    user_template.txt  # 사용자 프롬프트 템플릿
    config.yaml        # 모델, temperature 등 설정
  summarize/
    system.txt
    user_template.txt
    config.yaml

prompts/qa/config.yaml

yaml

name: qa-system
version: "2.1.0"
model: claude-sonnet-4-20250514
temperature: 0.3
max_tokens: 1024
description: "질문-답변 시스템 프롬프트 v2.1"
changelog:
  - version: "2.1.0"
    date: "2026-04-04"
    changes: "답변 형식 구조화 지시 추가"
  - version: "2.0.0"
    date: "2026-03-20"
    changes: "Chain-of-Thought 추론 단계 도입"

python

import yaml
import hashlib
 
class PromptManager:
    """프롬프트 버전을 관리합니다."""
 
    def __init__(self, prompts_dir: str):
        self.prompts_dir = prompts_dir
 
    def load_prompt(self, name: str) -> dict:
        """프롬프트와 설정을 로드합니다."""
        base_path = self.prompts_dir + "/" + name
 
        with open(base_path + "/system.txt") as f:
            system_prompt = f.read()
        with open(base_path + "/user_template.txt") as f:
            user_template = f.read()
        with open(base_path + "/config.yaml") as f:
            config = yaml.safe_load(f)
 
        return {
            "system_prompt": system_prompt,
            "user_template": user_template,
            "config": config,
            "hash": self._compute_hash(system_prompt + user_template),
        }
 
    def _compute_hash(self, content: str) -> str:
        return hashlib.sha256(content.encode()).hexdigest()[:12]
 
    def detect_changes(self, previous_hash: str, current_hash: str) -> bool:
        """프롬프트 변경 여부를 감지합니다."""
        return previous_hash != current_hash

GitHub Actions 기반 평가 파이프라인

기본 워크플로우

.github/workflows/llm-eval.yml

yaml

name: LLM Evaluation Pipeline
 
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/llm/**'
      - 'eval/**'
  push:
    branches: [main]
 
env:
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
 
jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      prompts_changed: ${{ steps.changes.outputs.prompts }}
      model_changed: ${{ steps.changes.outputs.model }}
    steps:
      - uses: actions/checkout@v4
      - id: changes
        uses: dorny/paths-filter@v3
        with:
          filters: |
            prompts:
              - 'prompts/**'
            model:
              - 'src/llm/config.yaml'
 
  quick-eval:
    needs: detect-changes
    if: needs.detect-changes.outputs.prompts_changed == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
 
      - name: Install dependencies
        run: pip install -r requirements-eval.txt
 
      - name: Run quick evaluation
        run: |
          python -m eval.run \
            --dataset eval/datasets/core-50.json \
            --output results/quick-eval.json \
            --parallel 5
 
      - name: Check thresholds
        run: python -m eval.check_thresholds results/quick-eval.json
 
      - name: Post results to PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(
              fs.readFileSync('results/quick-eval.json', 'utf8')
            );
            const body = formatEvalResults(results);
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });
 
  full-eval:
    needs: detect-changes
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
 
      - name: Install dependencies
        run: pip install -r requirements-eval.txt
 
      - name: Run full evaluation
        run: |
          python -m eval.run \
            --dataset eval/datasets/full-500.json \
            --output results/full-eval.json \
            --parallel 10
 
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results/

PR에 평가 결과 코멘트

eval/format_results.py

python

def format_eval_comment(results: dict) -> str:
    """평가 결과를 GitHub PR 코멘트 형식으로 포맷합니다."""
    status = "PASSED" if results["overall_pass"] else "FAILED"
    status_icon = "[PASS]" if results["overall_pass"] else "[FAIL]"
 
    lines = []
    lines.append("## LLM Evaluation Results " + status_icon)
    lines.append("")
    lines.append("| Metric | Score | Threshold | Status |")
    lines.append("|--------|-------|-----------|--------|")
 
    for metric_name, data in results["metrics"].items():
        score = str(round(data["mean"], 3))
        threshold = str(data["threshold"])
        passed = "Pass" if data["passed"] else "Fail"
        lines.append(
            "| " + metric_name + " | " + score
            + " | " + threshold + " | " + passed + " |"
        )
 
    lines.append("")
 
    if results.get("failures"):
        lines.append("### Failed Cases (" + str(len(results["failures"])) + ")")
        lines.append("")
        for f in results["failures"][:5]:
            lines.append(
                "- **" + f["metric"] + "** on case `"
                + f["case_id"] + "`: "
                + str(round(f["score"], 3))
                + " (threshold: " + str(f["threshold"]) + ")"
            )
 
    if results.get("comparison"):
        lines.append("")
        lines.append("### Comparison with Previous Version")
        for metric, change in results["comparison"].items():
            direction = "improved" if change > 0 else "degraded"
            lines.append(
                "- " + metric + ": " + direction
                + " by " + str(abs(round(change, 3)))
            )
 
    return "\n".join(lines)

평가 게이트 설계

단계별 게이트

python

class EvalGate:
    """CI/CD 파이프라인의 평가 게이트를 정의합니다."""
 
    def __init__(self, gate_config: dict):
        self.config = gate_config
 
    def check(self, results: dict) -> dict:
        """게이트 통과 여부를 판단합니다."""
        failures = []
 
        for metric, criteria in self.config["thresholds"].items():
            actual = results.get("metrics", {}).get(metric, {}).get("mean")
            if actual is None:
                failures.append({
                    "metric": metric,
                    "reason": "메트릭 결과 없음",
                })
                continue
 
            if "min" in criteria and actual < criteria["min"]:
                failures.append({
                    "metric": metric,
                    "actual": actual,
                    "required_min": criteria["min"],
                    "reason": "최소 기준 미달",
                })
 
            if "max" in criteria and actual > criteria["max"]:
                failures.append({
                    "metric": metric,
                    "actual": actual,
                    "required_max": criteria["max"],
                    "reason": "최대 기준 초과",
                })
 
        # 회귀 검사 (이전 버전 대비)
        if self.config.get("regression_check") and results.get("previous"):
            for metric in self.config["regression_check"]["metrics"]:
                current = results["metrics"].get(metric, {}).get("mean", 0)
                previous = results["previous"].get(metric, {}).get("mean", 0)
                max_regression = self.config["regression_check"]["max_regression"]
 
                if previous > 0 and (previous - current) / previous > max_regression:
                    failures.append({
                        "metric": metric,
                        "current": current,
                        "previous": previous,
                        "regression_pct": round(
                            (previous - current) / previous * 100, 1
                        ),
                        "reason": "허용 회귀 폭 초과",
                    })
 
        return {
            "passed": len(failures) == 0,
            "failures": failures,
            "gate_name": self.config["name"],
        }
 
# 게이트 설정 예시
pr_gate_config = {
    "name": "PR Quick Gate",
    "thresholds": {
        "answer_relevancy": {"min": 0.75},
        "faithfulness": {"min": 0.80},
        "toxicity": {"max": 0.05},
        "latency_p95_seconds": {"max": 5.0},
    },
    "regression_check": {
        "metrics": ["answer_relevancy", "faithfulness"],
        "max_regression": 0.05,  # 5% 이상 하락 불허
    },
}

빠른 평가와 전체 평가의 분리

text

CI/CD 평가 전략:
 
PR 단계 (빠른 평가):
  - 데이터셋: 핵심 50건 (core-50)
  - 소요 시간: 2-5분
  - 목적: 명백한 회귀 방지
  - 차단 여부: 게이트 실패 시 머지 차단
 
머지 후 (전체 평가):
  - 데이터셋: 전체 500건 (full-500)
  - 소요 시간: 15-30분
  - 목적: 상세 품질 분석, 슬라이스별 성능
  - 차단 여부: 알림만 (이미 배포됨)
 
야간 평가 (종합 벤치마크):
  - 데이터셋: 전체 + 엣지 케이스 1000건
  - 소요 시간: 1-2시간
  - 목적: 모델 드리프트 감지, 장기 추세 분석
  - 차단 여부: 다음 날 리포트

회귀 테스트 전략

프롬프트 회귀 테스트

eval/regression.py

python

class RegressionTester:
    """프롬프트 변경에 대한 회귀 테스트를 수행합니다."""
 
    def __init__(self, baseline_results_path: str):
        self.baseline = self._load_baseline(baseline_results_path)
 
    def _load_baseline(self, path: str) -> dict:
        import json
        with open(path) as f:
            return json.load(f)
 
    def run_regression_test(
        self,
        current_results: dict,
        allowed_regression: float = 0.03
    ) -> dict:
        """현재 결과를 베이스라인과 비교합니다."""
        comparisons = []
 
        for metric in self.baseline.get("metrics", {}):
            baseline_score = self.baseline["metrics"][metric].get("mean", 0)
            current_score = current_results.get("metrics", {}).get(
                metric, {}
            ).get("mean", 0)
 
            delta = current_score - baseline_score
            pct_change = (delta / baseline_score * 100) if baseline_score else 0
 
            comparisons.append({
                "metric": metric,
                "baseline": round(baseline_score, 4),
                "current": round(current_score, 4),
                "delta": round(delta, 4),
                "pct_change": round(pct_change, 2),
                "regressed": delta < -allowed_regression,
                "improved": delta > allowed_regression,
            })
 
        regressed_metrics = [c for c in comparisons if c["regressed"]]
 
        return {
            "passed": len(regressed_metrics) == 0,
            "comparisons": comparisons,
            "regressed_count": len(regressed_metrics),
            "improved_count": len([c for c in comparisons if c["improved"]]),
        }
 
    def update_baseline(self, new_results: dict, output_path: str):
        """베이스라인을 업데이트합니다."""
        import json
        with open(output_path, "w") as f:
            json.dump(new_results, f, ensure_ascii=False, indent=2)

케이스별 회귀 추적

전체 메트릭은 통과하더라도, 특정 케이스에서 큰 폭의 성능 저하가 있을 수 있습니다.

python

def case_level_regression(
    baseline_cases: list,
    current_cases: list,
    threshold: float = 0.2
) -> list:
    """케이스 단위로 회귀를 감지합니다."""
    baseline_map = {c["case_id"]: c for c in baseline_cases}
    regressions = []
 
    for current in current_cases:
        case_id = current["case_id"]
        baseline = baseline_map.get(case_id)
 
        if not baseline:
            continue
 
        for metric in current.get("metrics", {}):
            curr_score = current["metrics"][metric]
            base_score = baseline.get("metrics", {}).get(metric, curr_score)
 
            if base_score - curr_score > threshold:
                regressions.append({
                    "case_id": case_id,
                    "metric": metric,
                    "baseline_score": round(base_score, 3),
                    "current_score": round(curr_score, 3),
                    "drop": round(base_score - curr_score, 3),
                    "input_preview": current.get("input", "")[:80],
                })
 
    return sorted(regressions, key=lambda x: x["drop"], reverse=True)

Warning

비용 최적화

평가 비용 관리

python

class EvalCostTracker:
    """평가 파이프라인의 비용을 추적합니다."""
 
    def __init__(self, monthly_budget: float):
        self.monthly_budget = monthly_budget
        self.monthly_spent = 0.0
        self.eval_costs = []
 
    def estimate_eval_cost(
        self,
        dataset_size: int,
        avg_input_tokens: int,
        avg_output_tokens: int,
        model: str,
        judge_model: str = None,
    ) -> float:
        """평가 실행 비용을 사전 추정합니다."""
        # 대상 모델 비용
        app_cost = calculate_cost(
            avg_input_tokens * dataset_size,
            avg_output_tokens * dataset_size,
            model
        )
 
        # Judge 모델 비용 (LLM-as-Judge 사용 시)
        judge_cost = 0.0
        if judge_model:
            judge_cost = calculate_cost(
                (avg_input_tokens + avg_output_tokens) * dataset_size,
                200 * dataset_size,  # Judge 응답은 보통 짧음
                judge_model
            )
 
        total = app_cost + judge_cost
        return round(total, 2)
 
    def should_run_eval(self, estimated_cost: float) -> dict:
        """예산 내에서 평가를 실행할 수 있는지 확인합니다."""
        remaining = self.monthly_budget - self.monthly_spent
 
        if estimated_cost > remaining:
            return {
                "should_run": False,
                "reason": "월 예산 초과 (남은 예산: $"
                          + str(round(remaining, 2)) + ")",
                "suggestion": "데이터셋 크기를 줄이거나 다음 달에 실행",
            }
 
        return {
            "should_run": True,
            "estimated_cost": estimated_cost,
            "remaining_after": round(remaining - estimated_cost, 2),
        }