2026년 3월 2일·AI / ML·

4장: 가드레일 설계 — 입력/출력 필터링 시스템

LLM 가드레일 시스템의 설계 원리, Llama Guard, NeMo Guardrails, Guardrails AI 등 주요 도구의 비교와 활용, 그리고 커스텀 가드레일 구축을 다룹니다.

12분766자8개 섹션

이전3장: 간접 프롬프트 인젝션과 데이터 오염 다음5장: 콘텐츠 안전성과 유해 출력 방지

3장에서 간접 프롬프트 인젝션을 다뤘습니다. 이 장에서는 2~3장에서 배운 개별 방어 기법을 체계화한 가드레일(Guardrails) 시스템을 다룹니다. 가드레일은 LLM의 입력과 출력을 실시간으로 검사하여 안전하지 않은 콘텐츠를 차단하는 보호 계층입니다.

가드레일의 아키텍처

기본 구조

사용자 입력
  ↓
[입력 가드레일] ← 차단/통과 결정
  ↓
[LLM 호출]
  ↓
[출력 가드레일] ← 차단/수정 결정
  ↓
사용자 응답

가드레일은 입력 가드레일과 출력 가드레일로 나뉩니다.

유형	역할	검사 대상
입력 가드레일	악의적/부적절한 입력 차단	프롬프트 인젝션, 유해 요청, 범위 이탈
출력 가드레일	부적절한 응답 차단/수정	유해 콘텐츠, 환각, 민감 정보, PII

주요 가드레일 도구 비교

도구	개발사	접근 방식	특징
Llama Guard 3	Meta	LLM 기반 분류	안전 분류 특화 모델, 커스텀 분류 체계
NeMo Guardrails	NVIDIA	대화 흐름 제어	Colang DSL, 주제 제한, 대화 레일
Guardrails AI	Guardrails AI	구조화된 출력 검증	스키마 기반, Validators, RAIL 스펙
Azure AI Content Safety	Microsoft	API 기반 분류	유해 콘텐츠 4단계 분류, 프롬프트 실드
Anthropic Guardrails	Anthropic	시스템 프롬프트	Constitutional AI, 내장 안전 학습

Llama Guard: LLM 기반 안전 분류

Llama Guard는 입력과 출력 모두를 분류할 수 있는 안전 특화 모델입니다.

Llama Guard 활용

python

from transformers import AutoTokenizer, AutoModelForCausalLM
 
model_id = "meta-llama/Llama-Guard-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
 
def check_safety(role: str, content: str) -> dict:
    """Llama Guard로 안전성 검사"""
    # Llama Guard 프롬프트 형식
    prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
 
Task: Check if there is unsafe content in '{role}' messages
in conversations according to our safety policy.
 
<BEGIN CONVERSATION>
{role}: {content}
<END CONVERSATION>
 
Provide your safety assessment.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
 
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    output = model.generate(**inputs, max_new_tokens=100)
    result = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:])
 
    is_safe = result.strip().startswith("safe")
    categories = []
    if not is_safe:
        # "unsafe\nS1" 형태에서 카테고리 추출
        parts = result.strip().split("\n")
        if len(parts) > 1:
            categories = [p.strip() for p in parts[1:] if p.strip()]
 
    return {
        "is_safe": is_safe,
        "violated_categories": categories,
        "raw_output": result.strip(),
    }

Llama Guard의 안전 분류 체계

코드	카테고리	설명
S1	폭력/범죄	폭력 행위, 범죄 계획, 무기
S2	유해 콘텐츠	자해, 자살, 섭식 장애
S3	성적 콘텐츠	명시적 성적 내용
S4	아동 안전	아동 착취, 그루밍
S5	규제 위반	불법 약물, 도박, 금융 사기
S6	개인 정보	PII 요청/유출

Info

Llama Guard의 분류 체계는 커스터마이징할 수 있습니다. 프롬프트에 자체 정책을 포함하여, 비즈니스 특화 분류 기준을 적용할 수 있습니다. 예를 들어 금융 서비스에서는 "투자 조언" 카테고리를, 의료에서는 "진단 제공" 카테고리를 추가할 수 있습니다.

NeMo Guardrails: 대화 흐름 제어

NVIDIA의 NeMo Guardrails는 Colang이라는 DSL(Domain-Specific Language)로 대화 흐름을 정의하고 제어합니다.

NeMo Guardrails 기본 설정

python

# config.yml
"""
models:
  - type: main
    engine: openai
    model: gpt-4o
 
rails:
  input:
    flows:
      - check topic
      - check jailbreak
 
  output:
    flows:
      - check hallucination
      - check sensitive info
"""
 
# Colang으로 대화 흐름 정의
# rails.co
"""
define user ask about competitors
  "경쟁사 제품에 대해 알려줘"
  "XX회사가 더 좋아?"
  "다른 서비스와 비교해줘"
 
define bot refuse competitor question
  "죄송합니다. 경쟁사 제품에 대한 비교나 평가는 제공하지 않습니다.
   저희 제품에 대한 질문이 있으시면 도와드리겠습니다."
 
define flow check topic
  user ask about competitors
  bot refuse competitor question
"""

NeMo Guardrails 실행

python

from nemoguardrails import RailsConfig, LLMRails
 
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
 
response = await rails.generate_async(
    messages=[{"role": "user", "content": "경쟁사 제품이 더 좋은 것 같은데?"}]
)
# NeMo가 자동으로 주제 이탈을 감지하고 정의된 응답 반환

Guardrails AI: 구조화된 출력 검증

Guardrails AI는 LLM 출력의 구조와 내용을 검증하는 데 특화되어 있습니다.

Guardrails AI 활용

python

from guardrails import Guard
from guardrails.hub import DetectPII, ToxicLanguage, RestrictToTopic
 
# 가드 구성
guard = Guard().use_many(
    DetectPII(
        pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "CREDIT_CARD"],
        on_fail="fix",  # 탐지 시 마스킹 처리
    ),
    ToxicLanguage(
        threshold=0.8,
        on_fail="refrain",  # 탐지 시 응답 거부
    ),
    RestrictToTopic(
        valid_topics=["고객 서비스", "제품 문의", "주문 조회"],
        invalid_topics=["정치", "종교", "경쟁사"],
        on_fail="refrain",
    ),
)
 
# LLM 호출과 가드레일 적용
result = guard(
    llm_api=openai.chat.completions.create,
    model="gpt-4o",
    messages=[{"role": "user", "content": user_input}],
)
 
if result.validation_passed:
    print(result.validated_output)
else:
    print("가드레일에 의해 차단되었습니다.")
    for error in result.error_spans:
        print(f"  - {error.reason}")

커스텀 가드레일 설계

모듈형 가드레일 아키텍처

모듈형 가드레일 시스템

python

from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Literal
 
@dataclass
class GuardrailResult:
    passed: bool
    action: Literal["allow", "block", "modify"]
    reason: str | None = None
    modified_content: str | None = None
 
class BaseGuardrail(ABC):
    @abstractmethod
    async def check(self, content: str, context: dict) -> GuardrailResult:
        pass
 
class GuardrailPipeline:
    def __init__(self):
        self.input_guardrails: list[BaseGuardrail] = []
        self.output_guardrails: list[BaseGuardrail] = []
 
    def add_input_guardrail(self, guardrail: BaseGuardrail):
        self.input_guardrails.append(guardrail)
 
    def add_output_guardrail(self, guardrail: BaseGuardrail):
        self.output_guardrails.append(guardrail)
 
    async def check_input(self, content: str, context: dict) -> GuardrailResult:
        for guardrail in self.input_guardrails:
            result = await guardrail.check(content, context)
            if not result.passed:
                return result
        return GuardrailResult(passed=True, action="allow")
 
    async def check_output(self, content: str, context: dict) -> GuardrailResult:
        current_content = content
        for guardrail in self.output_guardrails:
            result = await guardrail.check(current_content, context)
            if result.action == "block":
                return result
            elif result.action == "modify" and result.modified_content:
                current_content = result.modified_content
        return GuardrailResult(passed=True, action="allow", modified_content=current_content)

PII 탐지 가드레일

python

import re
 
class PIIGuardrail(BaseGuardrail):
    PII_PATTERNS = {
        "phone_kr": r"01[0-9]-?\d{3,4}-?\d{4}",
        "email": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
        "card_number": r"\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}",
        "resident_id": r"\d{6}[-\s]?\d{7}",
    }
 
    async def check(self, content: str, context: dict) -> GuardrailResult:
        detected = []
        masked_content = content
 
        for pii_type, pattern in self.PII_PATTERNS.items():
            matches = re.finditer(pattern, content)
            for match in matches:
                detected.append(pii_type)
                masked_content = masked_content.replace(
                    match.group(), f"[{pii_type.upper()}_MASKED]"
                )
 
        if detected:
            return GuardrailResult(
                passed=False,
                action="modify",
                reason=f"PII 감지: {', '.join(set(detected))}",
                modified_content=masked_content,
            )
        return GuardrailResult(passed=True, action="allow")

주제 제한 가드레일

python

class TopicGuardrail(BaseGuardrail):
    def __init__(self, client, allowed_topics: list[str]):
        self.client = client
        self.allowed_topics = allowed_topics
 
    async def check(self, content: str, context: dict) -> GuardrailResult:
        response = await self.client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=100,
            messages=[{
                "role": "user",
                "content": f"""다음 텍스트가 허용된 주제 범위 내인지 판단하세요.
 
허용된 주제: {', '.join(self.allowed_topics)}
 
텍스트: {content}
 
JSON 응답: {{"on_topic": true/false, "detected_topic": "감지된 주제"}}""",
            }],
        )
        result = parse_json(response.content[0].text)
 
        if not result.get("on_topic", True):
            return GuardrailResult(
                passed=False,
                action="block",
                reason=f"주제 이탈: {result.get('detected_topic', '알 수 없음')}",
            )
        return GuardrailResult(passed=True, action="allow")

가드레일 선택 가이드

요구사항	권장 도구
유해 콘텐츠 분류	Llama Guard 3, Azure AI Content Safety
대화 주제 제한	NeMo Guardrails (Colang)
출력 구조 검증	Guardrails AI
PII 탐지/마스킹	Guardrails AI + 커스텀 정규식
프롬프트 인젝션 탐지	LLM 분류기 + 규칙 기반 필터
종합 솔루션	커스텀 파이프라인 (위 도구 조합)

Tip

실무에서는 단일 도구만 사용하기보다, 여러 도구를 조합하는 것이 효과적입니다. 예를 들어, 입력에는 Llama Guard + 규칙 기반 필터를, 출력에는 Guardrails AI의 PII 탐지 + 커스텀 주제 검증을 적용하는 식입니다. 비용과 지연 시간을 고려하여, 빠른 규칙 기반 검사를 먼저 수행하고 통과한 것만 LLM 기반 검사로 보내세요.

정리

가드레일은 LLM 애플리케이션의 안전 계층으로, 입력/출력 양방향에서 콘텐츠를 검증합니다. Llama Guard는 안전 분류에, NeMo Guardrails는 대화 흐름 제어에, Guardrails AI는 출력 구조 검증에 특화되어 있습니다. 실무에서는 이들을 조합한 모듈형 가드레일 파이프라인을 구축하여, 비즈니스 요구사항에 맞는 방어를 설계하세요.

다음 장에서는 가드레일과 밀접한 콘텐츠 안전성 주제를 다룹니다. 유해 출력 방지, 편향 완화, 환각 탐지를 구체적으로 다룹니다.

이 글이 도움이 되셨나요?

AI / ML

5장: 콘텐츠 안전성과 유해 출력 방지

LLM의 유해 콘텐츠 생성 방지, 편향 완화, 환각 탐지, 그리고 Constitutional AI와 RLHF의 원리를 다루며 안전한 AI 출력을 위한 다층 전략을 설계합니다.

2026년 3월 4일·10분

AI / ML

3장: 간접 프롬프트 인젝션과 데이터 오염

간접 프롬프트 인젝션의 공격 벡터, RAG 오염, 이메일/웹 기반 공격, 그리고 데이터 소스 신뢰도 관리와 방어 전략을 실전 중심으로 다룹니다.

2026년 2월 28일·12분

AI / ML

6장: LLM 애플리케이션의 인증과 권한 관리

LLM 기반 시스템의 인증 아키텍처, 에이전트 도구 접근 제어, 최소 권한 원칙, API 키 관리, 그리고 Human-in-the-Loop 패턴을 실전 중심으로 다룹니다.

2026년 3월 6일·9분

2026년 3월 2일·AI / ML·

4장: 가드레일 설계 — 입력/출력 필터링 시스템

LLM 가드레일 시스템의 설계 원리, Llama Guard, NeMo Guardrails, Guardrails AI 등 주요 도구의 비교와 활용, 그리고 커스텀 가드레일 구축을 다룹니다.

12분766자8개 섹션

llm testing security

ai-security4 / 10

1 2 3 4 5 6 7 8 9 10

이전3장: 간접 프롬프트 인젝션과 데이터 오염 다음5장: 콘텐츠 안전성과 유해 출력 방지

가드레일의 아키텍처

기본 구조

사용자 입력
  ↓
[입력 가드레일] ← 차단/통과 결정
  ↓
[LLM 호출]
  ↓
[출력 가드레일] ← 차단/수정 결정
  ↓
사용자 응답

가드레일은 입력 가드레일과 출력 가드레일로 나뉩니다.

유형	역할	검사 대상
입력 가드레일	악의적/부적절한 입력 차단	프롬프트 인젝션, 유해 요청, 범위 이탈
출력 가드레일	부적절한 응답 차단/수정	유해 콘텐츠, 환각, 민감 정보, PII

주요 가드레일 도구 비교

도구	개발사	접근 방식	특징
Llama Guard 3	Meta	LLM 기반 분류	안전 분류 특화 모델, 커스텀 분류 체계
NeMo Guardrails	NVIDIA	대화 흐름 제어	Colang DSL, 주제 제한, 대화 레일
Guardrails AI	Guardrails AI	구조화된 출력 검증	스키마 기반, Validators, RAIL 스펙
Azure AI Content Safety	Microsoft	API 기반 분류	유해 콘텐츠 4단계 분류, 프롬프트 실드
Anthropic Guardrails	Anthropic	시스템 프롬프트	Constitutional AI, 내장 안전 학습

Llama Guard: LLM 기반 안전 분류

Llama Guard는 입력과 출력 모두를 분류할 수 있는 안전 특화 모델입니다.

Llama Guard 활용

python

from transformers import AutoTokenizer, AutoModelForCausalLM
 
model_id = "meta-llama/Llama-Guard-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
 
def check_safety(role: str, content: str) -> dict:
    """Llama Guard로 안전성 검사"""
    # Llama Guard 프롬프트 형식
    prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
 
Task: Check if there is unsafe content in '{role}' messages
in conversations according to our safety policy.
 
<BEGIN CONVERSATION>
{role}: {content}
<END CONVERSATION>
 
Provide your safety assessment.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
 
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    output = model.generate(**inputs, max_new_tokens=100)
    result = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:])
 
    is_safe = result.strip().startswith("safe")
    categories = []
    if not is_safe:
        # "unsafe\nS1" 형태에서 카테고리 추출
        parts = result.strip().split("\n")
        if len(parts) > 1:
            categories = [p.strip() for p in parts[1:] if p.strip()]
 
    return {
        "is_safe": is_safe,
        "violated_categories": categories,
        "raw_output": result.strip(),
    }

Llama Guard의 안전 분류 체계

코드	카테고리	설명
S1	폭력/범죄	폭력 행위, 범죄 계획, 무기
S2	유해 콘텐츠	자해, 자살, 섭식 장애
S3	성적 콘텐츠	명시적 성적 내용
S4	아동 안전	아동 착취, 그루밍
S5	규제 위반	불법 약물, 도박, 금융 사기
S6	개인 정보	PII 요청/유출

Info

NeMo Guardrails: 대화 흐름 제어

NVIDIA의 NeMo Guardrails는 Colang이라는 DSL(Domain-Specific Language)로 대화 흐름을 정의하고 제어합니다.

NeMo Guardrails 기본 설정

python

# config.yml
"""
models:
  - type: main
    engine: openai
    model: gpt-4o
 
rails:
  input:
    flows:
      - check topic
      - check jailbreak
 
  output:
    flows:
      - check hallucination
      - check sensitive info
"""
 
# Colang으로 대화 흐름 정의
# rails.co
"""
define user ask about competitors
  "경쟁사 제품에 대해 알려줘"
  "XX회사가 더 좋아?"
  "다른 서비스와 비교해줘"
 
define bot refuse competitor question
  "죄송합니다. 경쟁사 제품에 대한 비교나 평가는 제공하지 않습니다.
   저희 제품에 대한 질문이 있으시면 도와드리겠습니다."
 
define flow check topic
  user ask about competitors
  bot refuse competitor question
"""

NeMo Guardrails 실행

python

from nemoguardrails import RailsConfig, LLMRails
 
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
 
response = await rails.generate_async(
    messages=[{"role": "user", "content": "경쟁사 제품이 더 좋은 것 같은데?"}]
)
# NeMo가 자동으로 주제 이탈을 감지하고 정의된 응답 반환

Guardrails AI: 구조화된 출력 검증

Guardrails AI는 LLM 출력의 구조와 내용을 검증하는 데 특화되어 있습니다.

Guardrails AI 활용

python

from guardrails import Guard
from guardrails.hub import DetectPII, ToxicLanguage, RestrictToTopic
 
# 가드 구성
guard = Guard().use_many(
    DetectPII(
        pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "CREDIT_CARD"],
        on_fail="fix",  # 탐지 시 마스킹 처리
    ),
    ToxicLanguage(
        threshold=0.8,
        on_fail="refrain",  # 탐지 시 응답 거부
    ),
    RestrictToTopic(
        valid_topics=["고객 서비스", "제품 문의", "주문 조회"],
        invalid_topics=["정치", "종교", "경쟁사"],
        on_fail="refrain",
    ),
)
 
# LLM 호출과 가드레일 적용
result = guard(
    llm_api=openai.chat.completions.create,
    model="gpt-4o",
    messages=[{"role": "user", "content": user_input}],
)
 
if result.validation_passed:
    print(result.validated_output)
else:
    print("가드레일에 의해 차단되었습니다.")
    for error in result.error_spans:
        print(f"  - {error.reason}")

커스텀 가드레일 설계

모듈형 가드레일 아키텍처

모듈형 가드레일 시스템

python

from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Literal
 
@dataclass
class GuardrailResult:
    passed: bool
    action: Literal["allow", "block", "modify"]
    reason: str | None = None
    modified_content: str | None = None
 
class BaseGuardrail(ABC):
    @abstractmethod
    async def check(self, content: str, context: dict) -> GuardrailResult:
        pass
 
class GuardrailPipeline:
    def __init__(self):
        self.input_guardrails: list[BaseGuardrail] = []
        self.output_guardrails: list[BaseGuardrail] = []
 
    def add_input_guardrail(self, guardrail: BaseGuardrail):
        self.input_guardrails.append(guardrail)
 
    def add_output_guardrail(self, guardrail: BaseGuardrail):
        self.output_guardrails.append(guardrail)
 
    async def check_input(self, content: str, context: dict) -> GuardrailResult:
        for guardrail in self.input_guardrails:
            result = await guardrail.check(content, context)
            if not result.passed:
                return result
        return GuardrailResult(passed=True, action="allow")
 
    async def check_output(self, content: str, context: dict) -> GuardrailResult:
        current_content = content
        for guardrail in self.output_guardrails:
            result = await guardrail.check(current_content, context)
            if result.action == "block":
                return result
            elif result.action == "modify" and result.modified_content:
                current_content = result.modified_content
        return GuardrailResult(passed=True, action="allow", modified_content=current_content)

PII 탐지 가드레일

python

import re
 
class PIIGuardrail(BaseGuardrail):
    PII_PATTERNS = {
        "phone_kr": r"01[0-9]-?\d{3,4}-?\d{4}",
        "email": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
        "card_number": r"\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}",
        "resident_id": r"\d{6}[-\s]?\d{7}",
    }
 
    async def check(self, content: str, context: dict) -> GuardrailResult:
        detected = []
        masked_content = content
 
        for pii_type, pattern in self.PII_PATTERNS.items():
            matches = re.finditer(pattern, content)
            for match in matches:
                detected.append(pii_type)
                masked_content = masked_content.replace(
                    match.group(), f"[{pii_type.upper()}_MASKED]"
                )
 
        if detected:
            return GuardrailResult(
                passed=False,
                action="modify",
                reason=f"PII 감지: {', '.join(set(detected))}",
                modified_content=masked_content,
            )
        return GuardrailResult(passed=True, action="allow")

주제 제한 가드레일

python

class TopicGuardrail(BaseGuardrail):
    def __init__(self, client, allowed_topics: list[str]):
        self.client = client
        self.allowed_topics = allowed_topics
 
    async def check(self, content: str, context: dict) -> GuardrailResult:
        response = await self.client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=100,
            messages=[{
                "role": "user",
                "content": f"""다음 텍스트가 허용된 주제 범위 내인지 판단하세요.
 
허용된 주제: {', '.join(self.allowed_topics)}
 
텍스트: {content}
 
JSON 응답: {{"on_topic": true/false, "detected_topic": "감지된 주제"}}""",
            }],
        )
        result = parse_json(response.content[0].text)
 
        if not result.get("on_topic", True):
            return GuardrailResult(
                passed=False,
                action="block",
                reason=f"주제 이탈: {result.get('detected_topic', '알 수 없음')}",
            )
        return GuardrailResult(passed=True, action="allow")

가드레일 선택 가이드

요구사항	권장 도구
유해 콘텐츠 분류	Llama Guard 3, Azure AI Content Safety
대화 주제 제한	NeMo Guardrails (Colang)
출력 구조 검증	Guardrails AI
PII 탐지/마스킹	Guardrails AI + 커스텀 정규식
프롬프트 인젝션 탐지	LLM 분류기 + 규칙 기반 필터
종합 솔루션	커스텀 파이프라인 (위 도구 조합)