2026년 2월 24일·인프라·

9장: AI 서비스 관측 가능성

LLM 호출 추적, 토큰 사용량/비용 모니터링, AI 에이전트 행동 추적, LangChain/LlamaIndex OTel 통합을 통한 AI 관측 가능성을 학습합니다.

12분984자9개 섹션

monitoring observability

opentelemetry9 / 11

1 2 3 4 5 6 7 8 9 10 11

이전8장: Grafana, Jaeger, Prometheus 연동 다음10장: SLO 기반 알림 설계

학습 목표

LLM 호출의 핵심 관측 지표(토큰, 지연 시간, 비용)를 이해합니다
AI 에이전트의 행동 패턴을 트레이스로 추적하는 방법을 학습합니다
프롬프트-응답 로깅의 설계 원칙을 파악합니다
LangChain, LlamaIndex의 OTel 통합을 실습합니다
AI 서비스 전용 메트릭과 대시보드를 구성합니다

AI 관측 가능성이 필요한 이유

AI 서비스는 전통적인 웹 서비스와 근본적으로 다른 관측 요구사항을 가집니다.

전통 서비스	AI 서비스
결정론적 응답	비결정론적 응답 (같은 입력에 다른 출력)
고정된 처리 비용	토큰 기반 가변 비용
밀리초 단위 지연	초~수십 초 단위 지연
코드 기반 로직	프롬프트 기반 로직
단순 요청-응답	다단계 추론 (에이전트, RAG, 체이닝)

이러한 차이로 인해, 기존 HTTP 요청 모니터링만으로는 AI 서비스의 품질, 비용, 안정성을 충분히 파악할 수 없습니다.

LLM 호출 추적

핵심 관측 지표

LLM API 호출 시 반드시 수집해야 할 지표입니다.

지표	설명	메트릭 종류
입력 토큰 수	프롬프트에 사용된 토큰	Counter
출력 토큰 수	생성된 응답 토큰	Counter
총 토큰 수	입력 + 출력	Counter
호출 지연 시간	TTFT 및 전체 응답 시간	Histogram
호출 비용	토큰 기반 비용 계산	Counter
에러율	API 실패, 타임아웃 비율	Counter
모델명/버전	사용된 모델 식별	속성

수동 계측 구현

llm_instrumentation.py

python

import time
from opentelemetry import trace, metrics
from opentelemetry.trace import SpanKind
 
tracer = trace.get_tracer("ai-service.llm")
meter = metrics.get_meter("ai-service.llm")
 
# 메트릭 정의
token_counter = meter.create_counter(
    "llm.token.usage",
    description="Total tokens used in LLM calls",
    unit="tokens",
)
llm_duration = meter.create_histogram(
    "llm.request.duration",
    description="Duration of LLM API calls",
    unit="s",
)
llm_cost = meter.create_counter(
    "llm.request.cost",
    description="Estimated cost of LLM API calls",
    unit="USD",
)
 
# 모델별 토큰당 비용 (USD)
MODEL_PRICING = {
    "gpt-4o": {"input": 0.0025 / 1000, "output": 0.01 / 1000},
    "gpt-4o-mini": {"input": 0.00015 / 1000, "output": 0.0006 / 1000},
    "claude-sonnet-4-20250514": {"input": 0.003 / 1000, "output": 0.015 / 1000},
}
 
 
def traced_llm_call(client, model: str, messages: list, **kwargs):
    """OTel 계측이 적용된 LLM 호출 래퍼"""
    
    with tracer.start_as_current_span(
        "llm.chat.completion",
        kind=SpanKind.CLIENT,
    ) as span:
        # 요청 속성 설정
        span.set_attribute("llm.model", model)
        span.set_attribute("llm.provider", "openai")
        span.set_attribute("llm.request.temperature", kwargs.get("temperature", 1.0))
        span.set_attribute("llm.request.max_tokens", kwargs.get("max_tokens", 0))
        
        start_time = time.time()
        
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                **kwargs,
            )
            duration = time.time() - start_time
            
            # 응답 속성 설정
            usage = response.usage
            span.set_attribute("llm.usage.input_tokens", usage.prompt_tokens)
            span.set_attribute("llm.usage.output_tokens", usage.completion_tokens)
            span.set_attribute("llm.usage.total_tokens", usage.total_tokens)
            span.set_attribute("llm.response.finish_reason", response.choices[0].finish_reason)
            
            # 메트릭 기록
            attrs = {"llm.model": model, "llm.provider": "openai"}
            
            token_counter.add(usage.prompt_tokens, {**attrs, "llm.token.type": "input"})
            token_counter.add(usage.completion_tokens, {**attrs, "llm.token.type": "output"})
            llm_duration.record(duration, attrs)
            
            # 비용 계산
            pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
            cost = (
                usage.prompt_tokens * pricing["input"]
                + usage.completion_tokens * pricing["output"]
            )
            llm_cost.add(cost, attrs)
            span.set_attribute("llm.cost.estimated_usd", cost)
            
            return response
            
        except Exception as e:
            span.set_status(trace.StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise

Info

OpenTelemetry의 시맨틱 컨벤션에 GenAI(Generative AI) 관련 속성이 추가되고 있습니다. gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens 등의 표준 속성명이 정의되어 있으므로, 커스텀 속성 대신 시맨틱 컨벤션을 따르는 것이 장기적으로 유리합니다.

AI 에이전트 행동 추적

AI 에이전트는 단순한 단일 LLM 호출이 아니라, 여러 단계의 추론, 도구 호출, 반복 루프를 포함합니다. 에이전트의 각 단계를 스팬으로 기록하면 행동 패턴을 분석할 수 있습니다.

에이전트 추적 구현

agent_tracing.py

python

from opentelemetry import trace
 
tracer = trace.get_tracer("ai-service.agent")
 
 
class TracedAgent:
    def __init__(self, llm_client):
        self.llm = llm_client
        self.max_iterations = 10
    
    def run(self, user_query: str):
        with tracer.start_as_current_span("agent.run") as root_span:
            root_span.set_attribute("agent.query", user_query)
            root_span.set_attribute("agent.max_iterations", self.max_iterations)
            
            iteration = 0
            result = None
            
            while iteration < self.max_iterations:
                iteration += 1
                
                # 추론 단계
                with tracer.start_as_current_span(f"agent.think") as think_span:
                    think_span.set_attribute("agent.iteration", iteration)
                    action = self._plan_next_action(user_query, result)
                    think_span.set_attribute("agent.action.type", action["type"])
                
                if action["type"] == "final_answer":
                    root_span.set_attribute("agent.total_iterations", iteration)
                    root_span.set_attribute("agent.status", "completed")
                    return action["content"]
                
                # 도구 호출 단계
                with tracer.start_as_current_span("agent.tool_call") as tool_span:
                    tool_span.set_attribute("agent.tool.name", action["tool"])
                    tool_span.set_attribute("agent.tool.input", str(action["input"]))
                    
                    result = self._execute_tool(action["tool"], action["input"])
                    
                    tool_span.set_attribute("agent.tool.output_length", len(str(result)))
                    tool_span.add_event("tool-executed", {
                        "tool.name": action["tool"],
                        "tool.success": True,
                    })
            
            root_span.set_attribute("agent.status", "max_iterations_reached")
            root_span.set_status(trace.StatusCode.ERROR, "Max iterations reached")
            return None

프롬프트-응답 로깅

프롬프트와 응답을 로깅하면 디버깅과 품질 관리에 유용하지만, 비용과 개인정보 문제를 고려해야 합니다.

로깅 전략

prompt_logging.py

python

import logging
import hashlib
 
logger = logging.getLogger("ai-service.prompts")
 
 
def log_llm_interaction(
    model: str,
    messages: list,
    response_text: str,
    log_content: bool = False,
):
    """LLM 상호작용 로깅 (개인정보 고려)"""
    
    log_data = {
        "llm.model": model,
        "llm.message_count": len(messages),
        "llm.response_length": len(response_text),
    }
    
    if log_content:
        # 개발/스테이징 환경에서만 프롬프트 내용 로깅
        log_data["llm.prompt"] = messages[-1]["content"][:1000]  # 최대 1000자
        log_data["llm.response"] = response_text[:1000]
    else:
        # 프로덕션 환경에서는 해시만 기록
        prompt_text = messages[-1]["content"]
        log_data["llm.prompt_hash"] = hashlib.sha256(prompt_text.encode()).hexdigest()[:16]
        log_data["llm.response_hash"] = hashlib.sha256(response_text.encode()).hexdigest()[:16]
    
    logger.info("LLM 호출 완료", extra=log_data)

Warning

프로덕션 환경에서 프롬프트와 응답 전체를 로깅하면 저장 비용이 급증하고, 개인정보 보호 규정(GDPR, 개인정보보호법)을 위반할 수 있습니다. 프로덕션에서는 해시 또는 샘플링된 로깅을 사용하고, 전체 내용 로깅은 개발/디버깅 환경으로 제한하세요.

모델 드리프트 감지

LLM의 응답 품질이 시간에 따라 변화하는 모델 드리프트(Model Drift)를 메트릭으로 감지합니다.

감지 지표

drift_detection.py

python

meter = metrics.get_meter("ai-service.quality")
 
# 응답 길이 분포 -- 급격한 변화는 드리프트 의심
response_length = meter.create_histogram(
    "llm.response.length",
    description="Length of LLM responses in characters",
    unit="characters",
)
 
# 구조화 응답 파싱 성공률 -- JSON 파싱 실패율 증가는 드리프트 징후
parse_success = meter.create_counter(
    "llm.response.parse.count",
    description="Count of response parsing attempts",
)
 
# 사용자 피드백 기반 품질 점수
quality_score = meter.create_histogram(
    "llm.response.quality_score",
    description="Quality score of LLM responses (0-1)",
)
 
# 토큰 효율성 -- 같은 작업에 토큰 사용량이 증가하면 드리프트
token_efficiency = meter.create_histogram(
    "llm.token.efficiency",
    description="Tokens used per unit of useful output",
    unit="tokens/char",
)

Prometheus 알림 규칙

ai-alert-rules.yaml

yaml

groups:
  - name: ai-drift-detection
    rules:
      # 응답 길이가 지난 7일 평균 대비 50% 이상 변화
      - alert: LLMResponseLengthDrift
        expr: |
          abs(
            avg_over_time(llm_response_length_characters[1h])
            - avg_over_time(llm_response_length_characters[7d])
          ) / avg_over_time(llm_response_length_characters[7d]) > 0.5
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "LLM 응답 길이 드리프트 감지"
 
      # JSON 파싱 실패율이 5%를 초과
      - alert: LLMParseFailureRate
        expr: |
          rate(llm_response_parse_count_total{result="failure"}[5m])
          / rate(llm_response_parse_count_total[5m]) > 0.05
        for: 10m
        labels:
          severity: critical

LangChain OTel 통합

LangChain은 내장 콜백 시스템을 통해 OTel과 통합됩니다.

langchain_otel.py

python

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from opentelemetry.instrumentation.langchain import LangchainInstrumentor
 
# LangChain 자동 계측 활성화
LangchainInstrumentor().instrument()
 
# 일반적인 LangChain 코드 -- 자동으로 스팬 생성
prompt = ChatPromptTemplate.from_messages([
    ("system", "당신은 기술 문서 작성 전문가입니다."),
    ("human", "{topic}에 대해 설명해 주세요."),
])
 
model = ChatOpenAI(model="gpt-4o", temperature=0.3)
chain = prompt | model | StrOutputParser()
 
# 실행 시 자동으로 트레이스 생성
# - chain.invoke 스팬
#   - prompt.format 스팬
#   - llm.call 스팬 (토큰 사용량 포함)
#   - output_parser.parse 스팬
result = chain.invoke({"topic": "Kubernetes 네트워킹"})

LlamaIndex OTel 통합

llamaindex_otel.py

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from opentelemetry.instrumentation.llamaindex import LlamaIndexInstrumentor
 
# LlamaIndex 자동 계측
LlamaIndexInstrumentor().instrument()
 
# RAG 파이프라인 -- 각 단계가 자동 추적됨
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
 
# 쿼리 시 생성되는 트레이스:
# - query 스팬
#   - retrieval 스팬 (벡터 검색)
#     - embedding 스팬 (임베딩 생성)
#   - synthesis 스팬 (LLM 호출)
query_engine = index.as_query_engine()
response = query_engine.query("OpenTelemetry의 장점은 무엇인가요?")

AI 서비스 대시보드

Grafana에서 AI 서비스 전용 대시보드를 구성합니다.

핵심 패널

패널	PromQL	용도
모델별 호출량	`sum(rate(llm_token_usage_tokens_total[5m])) by (llm_model)`	모델 사용 추이
시간당 예상 비용	`sum(rate(llm_request_cost_USD_total[1h])) * 3600`	비용 모니터링
p99 지연 시간	`histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m]))`	성능 모니터링
에이전트 반복 횟수	`histogram_quantile(0.95, agent_iterations_bucket)`	에이전트 효율성
토큰 효율성	`rate(llm_token_usage_tokens_total[5m]) / rate(llm_request_count[5m])`	호출당 토큰

정리

이번 장에서는 AI 서비스의 고유한 관측 요구사항과 이를 OpenTelemetry로 해결하는 방법을 학습했습니다. LLM 호출의 토큰/비용/지연 시간 추적, AI 에이전트의 다단계 행동 추적, 프롬프트-응답 로깅 전략, 모델 드리프트 감지를 다루었으며, LangChain과 LlamaIndex의 OTel 통합도 실습했습니다.

다음 장에서는 SLO 기반 알림 설계를 다룹니다. SLI/SLO/에러 버짓의 개념, 번 레이트 알림 전략, Prometheus 알림 규칙과 Grafana 알림 채널 구성을 학습합니다.

이 글이 도움이 되셨나요?

인프라

10장: SLO 기반 알림 설계

SLI/SLO/에러 버짓의 개념을 정립하고, 번 레이트 알림(fast-burn/slow-burn), Prometheus 알림 규칙, Grafana 알림 채널을 설계합니다.

2026년 2월 26일·15분

인프라

8장: Grafana, Jaeger, Prometheus 연동

Jaeger로 분산 추적을 시각화하고, Prometheus로 메트릭을 저장/쿼리하며, Grafana로 통합 대시보드를 구성합니다. Docker Compose로 전체 스택을 실습합니다.

2026년 2월 22일·12분

인프라

11장: 실전 프로젝트 -- 관측 가능성 플랫폼 구축

마이크로서비스 계측, AI 서비스 관측, SLO 알림을 종합하여 전체 관측 가능성 플랫폼을 구축하고, 운영 체크리스트와 비용 최적화 전략을 정리합니다.

2026년 2월 28일·14분

2026년 2월 24일·인프라·

9장: AI 서비스 관측 가능성

LLM 호출 추적, 토큰 사용량/비용 모니터링, AI 에이전트 행동 추적, LangChain/LlamaIndex OTel 통합을 통한 AI 관측 가능성을 학습합니다.

12분984자9개 섹션

monitoring observability

opentelemetry9 / 11

1 2 3 4 5 6 7 8 9 10 11

이전8장: Grafana, Jaeger, Prometheus 연동 다음10장: SLO 기반 알림 설계

학습 목표

LLM 호출의 핵심 관측 지표(토큰, 지연 시간, 비용)를 이해합니다
AI 에이전트의 행동 패턴을 트레이스로 추적하는 방법을 학습합니다
프롬프트-응답 로깅의 설계 원칙을 파악합니다
LangChain, LlamaIndex의 OTel 통합을 실습합니다
AI 서비스 전용 메트릭과 대시보드를 구성합니다

AI 관측 가능성이 필요한 이유

AI 서비스는 전통적인 웹 서비스와 근본적으로 다른 관측 요구사항을 가집니다.

전통 서비스	AI 서비스
결정론적 응답	비결정론적 응답 (같은 입력에 다른 출력)
고정된 처리 비용	토큰 기반 가변 비용
밀리초 단위 지연	초~수십 초 단위 지연
코드 기반 로직	프롬프트 기반 로직
단순 요청-응답	다단계 추론 (에이전트, RAG, 체이닝)

이러한 차이로 인해, 기존 HTTP 요청 모니터링만으로는 AI 서비스의 품질, 비용, 안정성을 충분히 파악할 수 없습니다.

LLM 호출 추적

핵심 관측 지표

LLM API 호출 시 반드시 수집해야 할 지표입니다.

지표	설명	메트릭 종류
입력 토큰 수	프롬프트에 사용된 토큰	Counter
출력 토큰 수	생성된 응답 토큰	Counter
총 토큰 수	입력 + 출력	Counter
호출 지연 시간	TTFT 및 전체 응답 시간	Histogram
호출 비용	토큰 기반 비용 계산	Counter
에러율	API 실패, 타임아웃 비율	Counter
모델명/버전	사용된 모델 식별	속성

수동 계측 구현

llm_instrumentation.py

python

import time
from opentelemetry import trace, metrics
from opentelemetry.trace import SpanKind
 
tracer = trace.get_tracer("ai-service.llm")
meter = metrics.get_meter("ai-service.llm")
 
# 메트릭 정의
token_counter = meter.create_counter(
    "llm.token.usage",
    description="Total tokens used in LLM calls",
    unit="tokens",
)
llm_duration = meter.create_histogram(
    "llm.request.duration",
    description="Duration of LLM API calls",
    unit="s",
)
llm_cost = meter.create_counter(
    "llm.request.cost",
    description="Estimated cost of LLM API calls",
    unit="USD",
)
 
# 모델별 토큰당 비용 (USD)
MODEL_PRICING = {
    "gpt-4o": {"input": 0.0025 / 1000, "output": 0.01 / 1000},
    "gpt-4o-mini": {"input": 0.00015 / 1000, "output": 0.0006 / 1000},
    "claude-sonnet-4-20250514": {"input": 0.003 / 1000, "output": 0.015 / 1000},
}
 
 
def traced_llm_call(client, model: str, messages: list, **kwargs):
    """OTel 계측이 적용된 LLM 호출 래퍼"""
    
    with tracer.start_as_current_span(
        "llm.chat.completion",
        kind=SpanKind.CLIENT,
    ) as span:
        # 요청 속성 설정
        span.set_attribute("llm.model", model)
        span.set_attribute("llm.provider", "openai")
        span.set_attribute("llm.request.temperature", kwargs.get("temperature", 1.0))
        span.set_attribute("llm.request.max_tokens", kwargs.get("max_tokens", 0))
        
        start_time = time.time()
        
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                **kwargs,
            )
            duration = time.time() - start_time
            
            # 응답 속성 설정
            usage = response.usage
            span.set_attribute("llm.usage.input_tokens", usage.prompt_tokens)
            span.set_attribute("llm.usage.output_tokens", usage.completion_tokens)
            span.set_attribute("llm.usage.total_tokens", usage.total_tokens)
            span.set_attribute("llm.response.finish_reason", response.choices[0].finish_reason)
            
            # 메트릭 기록
            attrs = {"llm.model": model, "llm.provider": "openai"}
            
            token_counter.add(usage.prompt_tokens, {**attrs, "llm.token.type": "input"})
            token_counter.add(usage.completion_tokens, {**attrs, "llm.token.type": "output"})
            llm_duration.record(duration, attrs)
            
            # 비용 계산
            pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
            cost = (
                usage.prompt_tokens * pricing["input"]
                + usage.completion_tokens * pricing["output"]
            )
            llm_cost.add(cost, attrs)
            span.set_attribute("llm.cost.estimated_usd", cost)
            
            return response
            
        except Exception as e:
            span.set_status(trace.StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise

Info

AI 에이전트 행동 추적

에이전트 추적 구현

agent_tracing.py

python

from opentelemetry import trace
 
tracer = trace.get_tracer("ai-service.agent")
 
 
class TracedAgent:
    def __init__(self, llm_client):
        self.llm = llm_client
        self.max_iterations = 10
    
    def run(self, user_query: str):
        with tracer.start_as_current_span("agent.run") as root_span:
            root_span.set_attribute("agent.query", user_query)
            root_span.set_attribute("agent.max_iterations", self.max_iterations)
            
            iteration = 0
            result = None
            
            while iteration < self.max_iterations:
                iteration += 1
                
                # 추론 단계
                with tracer.start_as_current_span(f"agent.think") as think_span:
                    think_span.set_attribute("agent.iteration", iteration)
                    action = self._plan_next_action(user_query, result)
                    think_span.set_attribute("agent.action.type", action["type"])
                
                if action["type"] == "final_answer":
                    root_span.set_attribute("agent.total_iterations", iteration)
                    root_span.set_attribute("agent.status", "completed")
                    return action["content"]
                
                # 도구 호출 단계
                with tracer.start_as_current_span("agent.tool_call") as tool_span:
                    tool_span.set_attribute("agent.tool.name", action["tool"])
                    tool_span.set_attribute("agent.tool.input", str(action["input"]))
                    
                    result = self._execute_tool(action["tool"], action["input"])
                    
                    tool_span.set_attribute("agent.tool.output_length", len(str(result)))
                    tool_span.add_event("tool-executed", {
                        "tool.name": action["tool"],
                        "tool.success": True,
                    })
            
            root_span.set_attribute("agent.status", "max_iterations_reached")
            root_span.set_status(trace.StatusCode.ERROR, "Max iterations reached")
            return None

프롬프트-응답 로깅

프롬프트와 응답을 로깅하면 디버깅과 품질 관리에 유용하지만, 비용과 개인정보 문제를 고려해야 합니다.

로깅 전략

prompt_logging.py

python

import logging
import hashlib
 
logger = logging.getLogger("ai-service.prompts")
 
 
def log_llm_interaction(
    model: str,
    messages: list,
    response_text: str,
    log_content: bool = False,
):
    """LLM 상호작용 로깅 (개인정보 고려)"""
    
    log_data = {
        "llm.model": model,
        "llm.message_count": len(messages),
        "llm.response_length": len(response_text),
    }
    
    if log_content:
        # 개발/스테이징 환경에서만 프롬프트 내용 로깅
        log_data["llm.prompt"] = messages[-1]["content"][:1000]  # 최대 1000자
        log_data["llm.response"] = response_text[:1000]
    else:
        # 프로덕션 환경에서는 해시만 기록
        prompt_text = messages[-1]["content"]
        log_data["llm.prompt_hash"] = hashlib.sha256(prompt_text.encode()).hexdigest()[:16]
        log_data["llm.response_hash"] = hashlib.sha256(response_text.encode()).hexdigest()[:16]
    
    logger.info("LLM 호출 완료", extra=log_data)

Warning

모델 드리프트 감지

LLM의 응답 품질이 시간에 따라 변화하는 모델 드리프트(Model Drift)를 메트릭으로 감지합니다.

감지 지표

drift_detection.py

python

meter = metrics.get_meter("ai-service.quality")
 
# 응답 길이 분포 -- 급격한 변화는 드리프트 의심
response_length = meter.create_histogram(
    "llm.response.length",
    description="Length of LLM responses in characters",
    unit="characters",
)
 
# 구조화 응답 파싱 성공률 -- JSON 파싱 실패율 증가는 드리프트 징후
parse_success = meter.create_counter(
    "llm.response.parse.count",
    description="Count of response parsing attempts",
)
 
# 사용자 피드백 기반 품질 점수
quality_score = meter.create_histogram(
    "llm.response.quality_score",
    description="Quality score of LLM responses (0-1)",
)
 
# 토큰 효율성 -- 같은 작업에 토큰 사용량이 증가하면 드리프트
token_efficiency = meter.create_histogram(
    "llm.token.efficiency",
    description="Tokens used per unit of useful output",
    unit="tokens/char",
)

Prometheus 알림 규칙

ai-alert-rules.yaml

yaml

groups:
  - name: ai-drift-detection
    rules:
      # 응답 길이가 지난 7일 평균 대비 50% 이상 변화
      - alert: LLMResponseLengthDrift
        expr: |
          abs(
            avg_over_time(llm_response_length_characters[1h])
            - avg_over_time(llm_response_length_characters[7d])
          ) / avg_over_time(llm_response_length_characters[7d]) > 0.5
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "LLM 응답 길이 드리프트 감지"
 
      # JSON 파싱 실패율이 5%를 초과
      - alert: LLMParseFailureRate
        expr: |
          rate(llm_response_parse_count_total{result="failure"}[5m])
          / rate(llm_response_parse_count_total[5m]) > 0.05
        for: 10m
        labels:
          severity: critical

LangChain OTel 통합

LangChain은 내장 콜백 시스템을 통해 OTel과 통합됩니다.

langchain_otel.py

python

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from opentelemetry.instrumentation.langchain import LangchainInstrumentor
 
# LangChain 자동 계측 활성화
LangchainInstrumentor().instrument()
 
# 일반적인 LangChain 코드 -- 자동으로 스팬 생성
prompt = ChatPromptTemplate.from_messages([
    ("system", "당신은 기술 문서 작성 전문가입니다."),
    ("human", "{topic}에 대해 설명해 주세요."),
])
 
model = ChatOpenAI(model="gpt-4o", temperature=0.3)
chain = prompt | model | StrOutputParser()
 
# 실행 시 자동으로 트레이스 생성
# - chain.invoke 스팬
#   - prompt.format 스팬
#   - llm.call 스팬 (토큰 사용량 포함)
#   - output_parser.parse 스팬
result = chain.invoke({"topic": "Kubernetes 네트워킹"})

LlamaIndex OTel 통합

llamaindex_otel.py

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from opentelemetry.instrumentation.llamaindex import LlamaIndexInstrumentor
 
# LlamaIndex 자동 계측
LlamaIndexInstrumentor().instrument()
 
# RAG 파이프라인 -- 각 단계가 자동 추적됨
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
 
# 쿼리 시 생성되는 트레이스:
# - query 스팬
#   - retrieval 스팬 (벡터 검색)
#     - embedding 스팬 (임베딩 생성)
#   - synthesis 스팬 (LLM 호출)
query_engine = index.as_query_engine()
response = query_engine.query("OpenTelemetry의 장점은 무엇인가요?")

AI 서비스 대시보드

Grafana에서 AI 서비스 전용 대시보드를 구성합니다.

핵심 패널

패널	PromQL	용도
모델별 호출량	`sum(rate(llm_token_usage_tokens_total[5m])) by (llm_model)`	모델 사용 추이
시간당 예상 비용	`sum(rate(llm_request_cost_USD_total[1h])) * 3600`	비용 모니터링
p99 지연 시간	`histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m]))`	성능 모니터링
에이전트 반복 횟수	`histogram_quantile(0.95, agent_iterations_bucket)`	에이전트 효율성
토큰 효율성	`rate(llm_token_usage_tokens_total[5m]) / rate(llm_request_count[5m])`	호출당 토큰