2026년 3월 6일·AI / ML·

3장: lm-evaluation-harness 심층 분석

EleutherAI의 lm-evaluation-harness를 심층 분석합니다. 200개 이상의 태스크, 25개 이상의 모델 백엔드, HuggingFace 리더보드 백엔드로서의 역할, 설치부터 커스텀 태스크 작성까지 실전 가이드를 제공합니다.

14분716자10개 섹션

ai evaluation mlops

ai-eval-harness3 / 10

1 2 3 4 5 6 7 8 9 10

이전2장: 평가 하네스 아키텍처와 핵심 개념 다음4장: HELM - 종합적 모델 평가 프레임워크

이 장에서 배울 내용

lm-evaluation-harness의 탄생 배경과 생태계에서의 위치
프레임워크의 내부 아키텍처와 핵심 모듈
설치, 설정, 기본 실행 방법
태스크 YAML 설정의 구조와 작성법
커스텀 태스크를 만드는 실전 워크플로우
HuggingFace Open LLM Leaderboard와의 관계

EleutherAI가 개발한 lm-evaluation-harness는 대규모 언어 모델 평가의 사실상 표준(de facto standard)으로 자리잡았습니다. NVIDIA, Cohere, BigScience, Mosaic ML 등 주요 AI 기업과 연구기관이 자사 모델 평가에 이 프레임워크를 사용하고 있으며, HuggingFace의 Open LLM Leaderboard의 백엔드 엔진으로 채택되어 있습니다.

이 프레임워크가 광범위하게 채택된 이유는 명확합니다.

200개 이상의 사전 정의 태스크: MMLU, HellaSwag, ARC, TruthfulQA, GSM8K 등 주요 벤치마크를 즉시 실행 가능
25개 이상의 모델 백엔드: HuggingFace Transformers, GPT-NeoX, Megatron-DeepSpeed, vLLM, OpenAI API 등 지원
토큰화 비의존적(Tokenization-Agnostic) 설계: 모델의 토크나이저에 관계없이 공정한 평가 가능
YAML 기반 태스크 정의: 코드 변경 없이 새로운 태스크 추가 가능

내부 아키텍처

lm-evaluation-harness의 내부 구조는 2장에서 살펴본 아키텍처 원칙을 충실히 따릅니다.

핵심 클래스 구조

프레임워크의 핵심은 세 가지 추상화에 있습니다.

Task: 평가 태스크의 전체 생명주기를 관리합니다. 데이터셋 로드, 프롬프트 구성, 정답 추출, 메트릭 계산까지 담당합니다.

LM (Language Model): 모델 백엔드의 추상 인터페이스입니다. generate_until, loglikelihood, loglikelihood_rolling 세 가지 메서드를 정의합니다.

Evaluator: Task와 LM을 연결하여 평가를 실행하는 오케스트레이터입니다.

설치와 기본 실행

설치

terminal

bash

# 기본 설치
pip install lm-eval
 
# 특정 모델 백엔드 포함 설치
pip install lm-eval[vllm]
pip install lm-eval[api]  # OpenAI, Anthropic 등 API 백엔드
 
# 개발 모드 설치 (소스에서)
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e ".[dev]"

CLI를 통한 기본 실행

terminal

bash

# 단일 태스크 실행
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
    --tasks mmlu \
    --num_fewshot 5 \
    --batch_size 16 \
    --output_path results/
 
# 여러 태스크 동시 실행
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
    --tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2 \
    --batch_size auto \
    --output_path results/
 
# vLLM 백엔드 사용
lm_eval --model vllm \
    --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=2 \
    --tasks mmlu \
    --batch_size auto

Tip

--batch_size auto를 사용하면 프레임워크가 GPU 메모리에 맞춰 자동으로 최적의 배치 크기를 결정합니다. 초기에는 작은 배치로 시작하여 점진적으로 늘려가므로, 수동 설정보다 안정적입니다.

Python API를 통한 실행

run_eval.py

python

import lm_eval
 
results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=meta-llama/Llama-3.1-8B-Instruct",
    tasks=["mmlu", "hellaswag"],
    num_fewshot=5,
    batch_size=16,
    log_samples=True,
)
 
# 결과 확인
for task_name, task_result in results["results"].items():
    print(f"{task_name}: {task_result}")

태스크 YAML 구조

lm-evaluation-harness의 가장 강력한 특징 중 하나는 YAML 기반 태스크 정의입니다. 코드를 작성하지 않고도 새로운 벤치마크를 추가할 수 있습니다.

기본 구조

tasks/my_benchmark/my_task.yaml

yaml

# 태스크 메타데이터
task: my_custom_task
group: my_benchmark_group
dataset_path: my_org/my_dataset
dataset_name: default
test_split: test
validation_split: validation
 
# 프롬프트 구성
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}"
 
# 평가 설정
output_type: generate
generation_kwargs:
  max_gen_toks: 128
  temperature: 0.0
  do_sample: false
 
# few-shot 설정
num_fewshot: 5
fewshot_split: validation
 
# 메트릭 정의
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
  - metric: f1
    aggregation: mean
    higher_is_better: true
 
# 메타데이터
metadata:
  version: 1.0

주요 필드 설명

필드	설명
`task`	태스크의 고유 식별자
`group`	태스크 그룹 (관련 태스크를 묶을 때 사용)
`dataset_path`	HuggingFace 데이터셋 경로 또는 로컬 경로
`doc_to_text`	입력 문서를 프롬프트 텍스트로 변환하는 Jinja2 템플릿
`doc_to_target`	정답을 추출하는 Jinja2 템플릿
`output_type`	평가 유형 (`generate`, `loglikelihood`, `loglikelihood_rolling`)
`metric_list`	사용할 메트릭 목록

객관식 문제 태스크 예시

객관식(Multiple Choice) 문제는 loglikelihood 출력 유형을 사용합니다.

tasks/custom_mc/custom_mc.yaml

yaml

task: custom_multiple_choice
dataset_path: my_org/my_mc_dataset
test_split: test
 
output_type: loglikelihood
 
doc_to_text: |
  다음 질문에 대해 가장 적절한 답을 선택하세요.
 
  질문: {{question}}
  A) {{choices[0]}}
  B) {{choices[1]}}
  C) {{choices[2]}}
  D) {{choices[3]}}
  정답:
 
doc_to_target: "{{['A', 'B', 'C', 'D'][answer]}}"
doc_to_choice: ["A", "B", "C", "D"]
 
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true

Info

acc_norm은 길이로 정규화된 정확도(Length-Normalized Accuracy)입니다. 각 선택지의 로그 확률을 토큰 수로 나누어 계산하므로, 긴 선택지가 불이익을 받는 문제를 완화합니다.

커스텀 태스크 작성 실습

실제로 커스텀 태스크를 만드는 과정을 단계별로 진행해보겠습니다.

시나리오: 한국어 상식 추론 평가

한국어 상식 추론 능력을 평가하는 커스텀 태스크를 만들어봅니다.

1단계: 디렉토리 구조 생성

terminal

bash

mkdir -p lm_eval/tasks/korean_reasoning
touch lm_eval/tasks/korean_reasoning/_default_template.yaml
touch lm_eval/tasks/korean_reasoning/korean_commonsense.yaml

2단계: 그룹 기본 템플릿 정의

lm_eval/tasks/korean_reasoning/_default_template.yaml

yaml

group: korean_reasoning
dataset_path: my_org/korean_commonsense_qa
output_type: loglikelihood
 
num_fewshot: 5
fewshot_split: validation
 
metadata:
  version: 1.0

3단계: 태스크 YAML 작성

lm_eval/tasks/korean_reasoning/korean_commonsense.yaml

yaml

include: _default_template.yaml
task: korean_commonsense_qa
 
dataset_name: commonsense
test_split: test
validation_split: validation
 
doc_to_text: !function utils.format_question
doc_to_target: "{{answer}}"
doc_to_choice: "{{choices}}"
 
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
 
filter_list:
  - name: remove_whitespace
    filter:
      - function: remove_whitespace

4단계: 유틸리티 함수 작성

lm_eval/tasks/korean_reasoning/utils.py

python

def format_question(doc: dict) -> str:
    """데이터셋 문서를 프롬프트 형식으로 변환합니다."""
    question = doc["question"]
    choices = doc["choices"]
    
    prompt = f"질문: {question}\n"
    for i, choice in enumerate(choices):
        label = chr(ord("A") + i)
        prompt += f"{label}) {choice}\n"
    prompt += "정답:"
    
    return prompt

5단계: 실행

terminal

bash

lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
    --tasks korean_commonsense_qa \
    --num_fewshot 5 \
    --include_path ./lm_eval/tasks/korean_reasoning \
    --output_path results/korean/

Warning

커스텀 태스크를 작성할 때 --include_path 플래그로 태스크 디렉토리를 지정해야 프레임워크가 해당 태스크를 인식합니다. 이 플래그 없이 실행하면 내장 태스크 목록에서만 검색합니다.

HuggingFace Open LLM Leaderboard

lm-evaluation-harness는 HuggingFace Open LLM Leaderboard의 평가 엔진으로 사용됩니다. 리더보드에 모델을 제출하면, 이 프레임워크를 통해 표준화된 벤치마크 세트로 자동 평가가 이루어집니다.

Leaderboard v2 벤치마크 구성

Open LLM Leaderboard v2는 다음 6개 벤치마크를 사용합니다.

벤치마크	영역	메트릭	난이도
MMLU-Pro	학술 지식	accuracy	높음
GPQA	대학원 수준 과학	accuracy	매우 높음
MuSR	다단계 추론	accuracy	높음
MATH	수학 문제 풀이	exact_match	높음
IFEval	지시 따르기	strict_acc	중간
BBH	복합 추론	accuracy	높음

이러한 표준화된 평가 환경 덕분에, 서로 다른 연구팀이 개발한 모델의 성능을 공정하게 비교할 수 있습니다. 프롬프트 형식, few-shot 설정, 출력 파싱 방식 등이 모두 통일되어 있으므로, 구현 차이에 의한 결과 변동이 최소화됩니다.

고급 기능

태스크 그룹과 상속

관련된 태스크를 그룹으로 묶어 한 번에 실행할 수 있습니다.

tasks/my_suite/_group.yaml

yaml

group: my_evaluation_suite
task:
  - mmlu
  - hellaswag
  - arc_challenge
  - custom_korean_qa
aggregate_metric_list:
  - metric: acc
    aggregation: mean
    weight_by_size: true

terminal

bash

# 그룹 단위로 실행
lm_eval --model hf \
    --model_args pretrained=my-model \
    --tasks my_evaluation_suite

필터 체인

모델 출력을 후처리하는 필터를 체인으로 연결할 수 있습니다.

filter_chain_example.yaml

yaml

filter_list:
  - name: extraction_pipeline
    filter:
      - function: regex
        regex_pattern: "답: ([A-D])"
        group_select: 1
      - function: map
        mapping:
          A: 0
          B: 1
          C: 2
          D: 3

이 필터 체인은 모델의 자유 형식 출력에서 정답 레이블을 추출하고, 이를 숫자 인덱스로 매핑합니다. 생성(Generate) 방식의 평가에서 출력 파싱의 안정성을 높이는 핵심 메커니즘입니다.

핵심 요약

lm-evaluation-harness는 200개 이상의 태스크와 25개 이상의 모델 백엔드를 지원하는 LLM 평가의 사실상 표준 프레임워크입니다.
YAML 기반 태스크 정의를 통해 코드 변경 없이 새로운 벤치마크를 추가할 수 있습니다.
doc_to_text, doc_to_target, output_type, metric_list가 태스크 YAML의 4대 핵심 필드입니다.
HuggingFace Open LLM Leaderboard의 백엔드로 사용되어, 모델 간 공정한 비교의 기준을 제공합니다.
커스텀 태스크 작성 시 --include_path 플래그로 태스크 디렉토리를 지정해야 합니다.

다음 장 예고

4장에서는 Stanford CRFM이 개발한 HELM 프레임워크를 살펴봅니다. 정확도 한 가지 차원이 아닌, 보정, 강건성, 공정성, 편향, 독성, 효율성까지 7가지 차원에서 모델을 종합적으로 평가하는 접근법과 16가지 핵심 시나리오를 분석합니다.

이 글이 도움이 되셨나요?

AI / ML

4장: HELM - 종합적 모델 평가 프레임워크

Stanford CRFM의 HELM을 분석합니다. 7가지 메트릭 차원, 16가지 핵심 시나리오, HELM Lite와 MedHELM 변형, 실행 방법과 결과 분석까지 종합적 평가 접근법을 탐구합니다.

2026년 3월 8일·16분

AI / ML

2장: 평가 하네스 아키텍처와 핵심 개념

평가 하네스의 내부 구조를 해부합니다. 태스크 정의 시스템, 모델 백엔드 추상화, 실행 엔진의 배칭과 병렬화, 결과 집계와 리포팅까지 설계 패턴을 코드와 함께 분석합니다.

2026년 3월 4일·18분

AI / ML

5장: Inspect AI - 에이전트 수준 평가

UK AISI의 Inspect AI를 분석합니다. 에이전트 벤치마크 GAIA, SWE-Bench, Cybench의 실행, 샌드박싱 환경, 태스크/솔버/스코러 아키텍처, 멀티에이전트 평가까지 다룹니다.

2026년 3월 10일·17분

2026년 3월 6일·AI / ML·

3장: lm-evaluation-harness 심층 분석

14분716자10개 섹션

ai evaluation mlops

ai-eval-harness3 / 10

1 2 3 4 5 6 7 8 9 10

이전2장: 평가 하네스 아키텍처와 핵심 개념 다음4장: HELM - 종합적 모델 평가 프레임워크

이 장에서 배울 내용

lm-evaluation-harness의 탄생 배경과 생태계에서의 위치
프레임워크의 내부 아키텍처와 핵심 모듈
설치, 설정, 기본 실행 방법
태스크 YAML 설정의 구조와 작성법
커스텀 태스크를 만드는 실전 워크플로우
HuggingFace Open LLM Leaderboard와의 관계

사실상의 표준이 된 프레임워크

이 프레임워크가 광범위하게 채택된 이유는 명확합니다.

200개 이상의 사전 정의 태스크: MMLU, HellaSwag, ARC, TruthfulQA, GSM8K 등 주요 벤치마크를 즉시 실행 가능
25개 이상의 모델 백엔드: HuggingFace Transformers, GPT-NeoX, Megatron-DeepSpeed, vLLM, OpenAI API 등 지원
토큰화 비의존적(Tokenization-Agnostic) 설계: 모델의 토크나이저에 관계없이 공정한 평가 가능
YAML 기반 태스크 정의: 코드 변경 없이 새로운 태스크 추가 가능

내부 아키텍처

lm-evaluation-harness의 내부 구조는 2장에서 살펴본 아키텍처 원칙을 충실히 따릅니다.

핵심 클래스 구조

프레임워크의 핵심은 세 가지 추상화에 있습니다.

Task: 평가 태스크의 전체 생명주기를 관리합니다. 데이터셋 로드, 프롬프트 구성, 정답 추출, 메트릭 계산까지 담당합니다.

LM (Language Model): 모델 백엔드의 추상 인터페이스입니다. generate_until, loglikelihood, loglikelihood_rolling 세 가지 메서드를 정의합니다.

Evaluator: Task와 LM을 연결하여 평가를 실행하는 오케스트레이터입니다.

설치와 기본 실행

설치

terminal

bash

# 기본 설치
pip install lm-eval
 
# 특정 모델 백엔드 포함 설치
pip install lm-eval[vllm]
pip install lm-eval[api]  # OpenAI, Anthropic 등 API 백엔드
 
# 개발 모드 설치 (소스에서)
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e ".[dev]"

CLI를 통한 기본 실행

terminal

bash

# 단일 태스크 실행
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
    --tasks mmlu \
    --num_fewshot 5 \
    --batch_size 16 \
    --output_path results/
 
# 여러 태스크 동시 실행
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
    --tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2 \
    --batch_size auto \
    --output_path results/
 
# vLLM 백엔드 사용
lm_eval --model vllm \
    --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=2 \
    --tasks mmlu \
    --batch_size auto

Tip

Python API를 통한 실행

run_eval.py

python

import lm_eval
 
results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=meta-llama/Llama-3.1-8B-Instruct",
    tasks=["mmlu", "hellaswag"],
    num_fewshot=5,
    batch_size=16,
    log_samples=True,
)
 
# 결과 확인
for task_name, task_result in results["results"].items():
    print(f"{task_name}: {task_result}")

태스크 YAML 구조

lm-evaluation-harness의 가장 강력한 특징 중 하나는 YAML 기반 태스크 정의입니다. 코드를 작성하지 않고도 새로운 벤치마크를 추가할 수 있습니다.

기본 구조

tasks/my_benchmark/my_task.yaml

yaml

# 태스크 메타데이터
task: my_custom_task
group: my_benchmark_group
dataset_path: my_org/my_dataset
dataset_name: default
test_split: test
validation_split: validation
 
# 프롬프트 구성
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}"
 
# 평가 설정
output_type: generate
generation_kwargs:
  max_gen_toks: 128
  temperature: 0.0
  do_sample: false
 
# few-shot 설정
num_fewshot: 5
fewshot_split: validation
 
# 메트릭 정의
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
  - metric: f1
    aggregation: mean
    higher_is_better: true
 
# 메타데이터
metadata:
  version: 1.0

주요 필드 설명

필드	설명
`task`	태스크의 고유 식별자
`group`	태스크 그룹 (관련 태스크를 묶을 때 사용)
`dataset_path`	HuggingFace 데이터셋 경로 또는 로컬 경로
`doc_to_text`	입력 문서를 프롬프트 텍스트로 변환하는 Jinja2 템플릿
`doc_to_target`	정답을 추출하는 Jinja2 템플릿
`output_type`	평가 유형 (`generate`, `loglikelihood`, `loglikelihood_rolling`)
`metric_list`	사용할 메트릭 목록

객관식 문제 태스크 예시

객관식(Multiple Choice) 문제는 loglikelihood 출력 유형을 사용합니다.

tasks/custom_mc/custom_mc.yaml

yaml

task: custom_multiple_choice
dataset_path: my_org/my_mc_dataset
test_split: test
 
output_type: loglikelihood
 
doc_to_text: |
  다음 질문에 대해 가장 적절한 답을 선택하세요.
 
  질문: {{question}}
  A) {{choices[0]}}
  B) {{choices[1]}}
  C) {{choices[2]}}
  D) {{choices[3]}}
  정답:
 
doc_to_target: "{{['A', 'B', 'C', 'D'][answer]}}"
doc_to_choice: ["A", "B", "C", "D"]
 
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true

Info

커스텀 태스크 작성 실습

실제로 커스텀 태스크를 만드는 과정을 단계별로 진행해보겠습니다.

시나리오: 한국어 상식 추론 평가

한국어 상식 추론 능력을 평가하는 커스텀 태스크를 만들어봅니다.

1단계: 디렉토리 구조 생성

terminal

bash

mkdir -p lm_eval/tasks/korean_reasoning
touch lm_eval/tasks/korean_reasoning/_default_template.yaml
touch lm_eval/tasks/korean_reasoning/korean_commonsense.yaml

2단계: 그룹 기본 템플릿 정의

lm_eval/tasks/korean_reasoning/_default_template.yaml

yaml

group: korean_reasoning
dataset_path: my_org/korean_commonsense_qa
output_type: loglikelihood
 
num_fewshot: 5
fewshot_split: validation
 
metadata:
  version: 1.0

3단계: 태스크 YAML 작성

lm_eval/tasks/korean_reasoning/korean_commonsense.yaml

yaml

include: _default_template.yaml
task: korean_commonsense_qa
 
dataset_name: commonsense
test_split: test
validation_split: validation
 
doc_to_text: !function utils.format_question
doc_to_target: "{{answer}}"
doc_to_choice: "{{choices}}"
 
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
 
filter_list:
  - name: remove_whitespace
    filter:
      - function: remove_whitespace

4단계: 유틸리티 함수 작성

lm_eval/tasks/korean_reasoning/utils.py

python

def format_question(doc: dict) -> str:
    """데이터셋 문서를 프롬프트 형식으로 변환합니다."""
    question = doc["question"]
    choices = doc["choices"]
    
    prompt = f"질문: {question}\n"
    for i, choice in enumerate(choices):
        label = chr(ord("A") + i)
        prompt += f"{label}) {choice}\n"
    prompt += "정답:"
    
    return prompt

5단계: 실행

terminal

bash

lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
    --tasks korean_commonsense_qa \
    --num_fewshot 5 \
    --include_path ./lm_eval/tasks/korean_reasoning \
    --output_path results/korean/

Warning

HuggingFace Open LLM Leaderboard

Leaderboard v2 벤치마크 구성

Open LLM Leaderboard v2는 다음 6개 벤치마크를 사용합니다.

벤치마크	영역	메트릭	난이도
MMLU-Pro	학술 지식	accuracy	높음
GPQA	대학원 수준 과학	accuracy	매우 높음
MuSR	다단계 추론	accuracy	높음
MATH	수학 문제 풀이	exact_match	높음
IFEval	지시 따르기	strict_acc	중간
BBH	복합 추론	accuracy	높음

고급 기능

태스크 그룹과 상속

관련된 태스크를 그룹으로 묶어 한 번에 실행할 수 있습니다.

tasks/my_suite/_group.yaml

yaml

group: my_evaluation_suite
task:
  - mmlu
  - hellaswag
  - arc_challenge
  - custom_korean_qa
aggregate_metric_list:
  - metric: acc
    aggregation: mean
    weight_by_size: true

terminal

bash

# 그룹 단위로 실행
lm_eval --model hf \
    --model_args pretrained=my-model \
    --tasks my_evaluation_suite

필터 체인

모델 출력을 후처리하는 필터를 체인으로 연결할 수 있습니다.

filter_chain_example.yaml

yaml

filter_list:
  - name: extraction_pipeline
    filter:
      - function: regex
        regex_pattern: "답: ([A-D])"
        group_select: 1
      - function: map
        mapping:
          A: 0
          B: 1
          C: 2
          D: 3

핵심 요약

lm-evaluation-harness는 200개 이상의 태스크와 25개 이상의 모델 백엔드를 지원하는 LLM 평가의 사실상 표준 프레임워크입니다.
YAML 기반 태스크 정의를 통해 코드 변경 없이 새로운 벤치마크를 추가할 수 있습니다.
doc_to_text, doc_to_target, output_type, metric_list가 태스크 YAML의 4대 핵심 필드입니다.
HuggingFace Open LLM Leaderboard의 백엔드로 사용되어, 모델 간 공정한 비교의 기준을 제공합니다.
커스텀 태스크 작성 시 --include_path 플래그로 태스크 디렉토리를 지정해야 합니다.

관련 글

4장: HELM - 종합적 모델 평가 프레임워크

2장: 평가 하네스 아키텍처와 핵심 개념

5장: Inspect AI - 에이전트 수준 평가

댓글

관련 글

4장: HELM - 종합적 모델 평가 프레임워크

2장: 평가 하네스 아키텍처와 핵심 개념

5장: Inspect AI - 에이전트 수준 평가

댓글