2026년 3월 23일·AI / ML·

5장: LLM 기반 엔티티 추출과 관계 생성

비정형 텍스트에서 LLM을 활용하여 엔티티와 관계를 추출하고, JSON 파싱, 엔티티 해소, Neo4j 적재까지의 전체 파이프라인을 구축하는 방법을 다룹니다.

19분1,069자10개 섹션

knowledge-graph ai data-engineering

knowledge-graph5 / 10

1 2 3 4 5 6 7 8 9 10

이전4장: Amazon Neptune과 기타 그래프 DB 다음6장: GraphRAG — 그래프 기반 검색 증강 생성

학습 목표

비정형 텍스트를 구조화된 지식 그래프로 변환하는 전체 과정을 이해합니다
LLM 프롬프트 설계를 통해 엔티티와 관계를 추출하는 방법을 습득합니다
JSON 출력 파싱과 검증 전략을 파악합니다
Entity Resolution(엔티티 해소)의 필요성과 기법을 이해합니다
Neo4j LLM Graph Builder와 커스텀 파이프라인 구축을 비교합니다

비정형에서 구조화로

지식 그래프의 가치는 데이터의 품질에 달려 있습니다. 하지만 세상의 지식 대부분은 문서, 논문, 웹 페이지 같은 비정형 텍스트로 존재합니다. LLM의 등장으로 이 비정형 텍스트를 자동으로 구조화하는 것이 현실적으로 가능해졌습니다.

이 파이프라인의 각 단계를 상세히 살펴보겠습니다.

텍스트 청킹

LLM의 컨텍스트 윈도우와 추출 품질을 고려하여, 긴 문서를 적절한 크기로 분할합니다.

text_chunking.py

python

from langchain_text_splitters import RecursiveCharacterTextSplitter
 
def chunk_document(text: str, chunk_size: int = 2000, overlap: int = 200) -> list[str]:
    """문서를 겹치는 청크로 분할합니다."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " "],
        length_function=len
    )
    chunks = splitter.split_text(text)
    return chunks

청킹 시 주의할 점은 다음과 같습니다.

청크 크기: 너무 크면 추출 품질이 떨어지고, 너무 작으면 맥락이 손실됩니다. 1,500~3,000자가 적정 범위입니다
오버랩: 청크 경계에서 엔티티/관계가 잘리는 것을 방지합니다
구분자 우선순위: 단락 > 문장 > 단어 순으로 자연스러운 경계를 찾습니다

LLM 프롬프트 설계

엔티티 추출 프롬프트

엔티티와 관계를 동시에 추출하는 프롬프트를 설계합니다. 핵심은 출력 스키마를 명확히 정의하는 것입니다.

extraction_prompt.py

python

EXTRACTION_SYSTEM_PROMPT = """당신은 텍스트에서 엔티티(개체)와 관계를 추출하는 전문가입니다.
 
주어진 텍스트를 분석하여 다음 형식의 JSON을 생성하세요.
 
## 엔티티 타입
- Person: 사람 (이름, 역할, 소속)
- Technology: 기술, 프레임워크, 라이브러리, 도구 (이름, 카테고리, 버전)
- Concept: 추상적 개념, 방법론 (이름, 설명)
- Organization: 조직, 회사, 기관 (이름, 유형)
 
## 관계 타입
- USES: Person/Organization이 Technology를 사용함
- DEVELOPED_BY: Technology가 Organization에 의해 개발됨
- DEPENDS_ON: Technology가 다른 Technology에 의존함
- IMPLEMENTS: Technology가 Concept을 구현함
- RELATED_TO: Concept이 다른 Concept과 관련됨
 
## 출력 형식
반드시 아래 JSON 스키마를 따르세요:
 
```json
{
  "entities": [
    {
      "id": "고유 식별자 (소문자, 하이픈 구분)",
      "type": "엔티티 타입",
      "name": "표시 이름",
      "properties": {}
    }
  ],
  "relationships": [
    {
      "source": "소스 엔티티 id",
      "target": "타겟 엔티티 id",
      "type": "관계 타입",
      "properties": {}
    }
  ]
}

규칙

텍스트에 명시적으로 언급된 엔티티와 관계만 추출하세요
추론이나 외부 지식을 추가하지 마세요
동일 엔티티는 하나의 id로 통일하세요
모호한 관계는 제외하세요 """


### 추출 함수 구현

```python title="entity_extraction.py"
import json
from anthropic import Anthropic

client = Anthropic()

def extract_entities(text: str) -> dict:
    """텍스트에서 엔티티와 관계를 추출합니다."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=EXTRACTION_SYSTEM_PROMPT,
        messages=[
            {
                "role": "user",
                "content": f"다음 텍스트에서 엔티티와 관계를 추출하세요:\n\n{text}"
            }
        ]
    )

    # JSON 파싱
    content = response.content[0].text
    # JSON 블록 추출 (마크다운 코드 블록 처리)
    if "```json" in content:
        content = content.split("```json")[1].split("```")[0]
    elif "```" in content:
        content = content.split("```")[1].split("```")[0]

    return json.loads(content.strip())

구조화된 출력 활용

최신 LLM들은 Structured Output(구조화된 출력) 기능을 제공하여 JSON 스키마를 강제할 수 있습니다.

structured_extraction.py

python

from pydantic import BaseModel
 
class Entity(BaseModel):
    id: str
    type: str
    name: str
    properties: dict = {}
 
class Relationship(BaseModel):
    source: str
    target: str
    type: str
    properties: dict = {}
 
class ExtractionResult(BaseModel):
    entities: list[Entity]
    relationships: list[Relationship]
 
def extract_with_schema(text: str) -> ExtractionResult:
    """Pydantic 스키마를 활용한 구조화된 추출을 수행합니다."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=EXTRACTION_SYSTEM_PROMPT,
        messages=[
            {"role": "user", "content": f"텍스트:\n\n{text}"}
        ]
    )
    content = response.content[0].text
    if "```json" in content:
        content = content.split("```json")[1].split("```")[0]
 
    data = json.loads(content.strip())
    return ExtractionResult(**data)

Tip

LLM 추출의 품질은 프롬프트에 크게 의존합니다. 도메인에 특화된 엔티티 타입과 관계 타입을 명시하면 추출 정확도가 크게 향상됩니다. 범용적인 "모든 엔티티를 추출하라"는 지시보다, "Person, Technology, Concept 타입의 엔티티를 추출하라"는 지시가 훨씬 효과적입니다.

JSON 출력 파싱과 검증

LLM의 출력은 항상 완벽하지 않습니다. 견고한 파싱과 검증이 필요합니다.

validation.py

python

from pydantic import BaseModel, field_validator
 
class ValidatedEntity(BaseModel):
    id: str
    type: str
    name: str
    properties: dict = {}
 
    @field_validator("type")
    @classmethod
    def validate_type(cls, v: str) -> str:
        allowed = {"Person", "Technology", "Concept", "Organization"}
        if v not in allowed:
            raise ValueError(f"허용되지 않는 엔티티 타입: {v}")
        return v
 
    @field_validator("id")
    @classmethod
    def validate_id(cls, v: str) -> str:
        # ID는 소문자와 하이픈만 허용
        cleaned = v.lower().replace(" ", "-")
        return cleaned
 
class ValidatedRelationship(BaseModel):
    source: str
    target: str
    type: str
    properties: dict = {}
 
    @field_validator("type")
    @classmethod
    def validate_type(cls, v: str) -> str:
        allowed = {"USES", "DEVELOPED_BY", "DEPENDS_ON", "IMPLEMENTS", "RELATED_TO"}
        if v not in allowed:
            raise ValueError(f"허용되지 않는 관계 타입: {v}")
        return v
 
def validate_extraction(data: dict) -> tuple[list[ValidatedEntity], list[ValidatedRelationship]]:
    """추출 결과를 검증하고 유효한 항목만 반환합니다."""
    valid_entities = []
    valid_relationships = []
    entity_ids = set()
 
    # 엔티티 검증
    for e in data.get("entities", []):
        try:
            entity = ValidatedEntity(**e)
            valid_entities.append(entity)
            entity_ids.add(entity.id)
        except Exception as err:
            print(f"엔티티 검증 실패: {e} - {err}")
 
    # 관계 검증 (존재하는 엔티티만 참조)
    for r in data.get("relationships", []):
        try:
            rel = ValidatedRelationship(**r)
            if rel.source in entity_ids and rel.target in entity_ids:
                valid_relationships.append(rel)
            else:
                print(f"관계의 엔티티가 존재하지 않음: {r}")
        except Exception as err:
            print(f"관계 검증 실패: {r} - {err}")
 
    return valid_entities, valid_relationships

엔티티 해소 (Entity Resolution)

여러 청크에서 추출된 엔티티는 동일한 대상을 다른 이름으로 참조할 수 있습니다. Entity Resolution(엔티티 해소)은 이런 중복을 식별하고 통합하는 과정입니다.

왜 엔티티 해소가 필요한가

하나의 기술을 다양한 이름으로 참조하는 경우를 생각해 봅시다.

"Neo4j", "neo4j", "Neo4J", "네오포제이"
"Knowledge Graph", "지식 그래프", "KG"
"GraphRAG", "Graph RAG", "그래프 RAG"

이들을 별개의 엔티티로 저장하면 그래프의 연결성이 깨지고, 쿼리 결과가 부정확해집니다.

해소 전략

entity_resolution.py

python

from difflib import SequenceMatcher
 
class EntityResolver:
    """엔티티 해소를 수행하는 클래스입니다."""
 
    def __init__(self, similarity_threshold: float = 0.85):
        self.threshold = similarity_threshold
        self.canonical_map: dict[str, str] = {}  # 별칭 -> 정규 이름
        self.entities: dict[str, dict] = {}       # 정규 이름 -> 엔티티 데이터
 
    def normalize(self, name: str) -> str:
        """이름을 정규화합니다."""
        return name.strip().lower().replace("-", " ")
 
    def similarity(self, a: str, b: str) -> float:
        """두 문자열의 유사도를 계산합니다."""
        return SequenceMatcher(None, self.normalize(a), self.normalize(b)).ratio()
 
    def resolve(self, entity: dict) -> str:
        """엔티티를 기존 항목과 대조하여 정규 이름을 반환합니다."""
        name = entity["name"]
        normalized = self.normalize(name)
 
        # 1. 정확히 일치하는 정규 이름이 있는지 확인
        if normalized in self.canonical_map:
            return self.canonical_map[normalized]
 
        # 2. 유사한 기존 엔티티가 있는지 확인
        for canonical_name in self.entities:
            if self.similarity(name, canonical_name) >= self.threshold:
                self.canonical_map[normalized] = canonical_name
                return canonical_name
 
        # 3. 새로운 엔티티로 등록
        self.canonical_map[normalized] = name
        self.entities[name] = entity
        return name
 
    def resolve_batch(self, entities: list[dict]) -> list[dict]:
        """엔티티 목록을 일괄 해소합니다."""
        resolved = []
        for entity in entities:
            canonical = self.resolve(entity)
            resolved_entity = {**entity, "name": canonical, "original_name": entity["name"]}
            resolved.append(resolved_entity)
        return resolved

LLM 기반 엔티티 해소

문자열 유사도만으로는 "KG"와 "Knowledge Graph"처럼 약어를 처리하기 어렵습니다. LLM을 활용한 해소도 고려할 수 있습니다.

llm_entity_resolution.py

python

RESOLUTION_PROMPT = """다음 엔티티 목록에서 동일한 대상을 가리키는 항목들을 그룹화하세요.
 
엔티티 목록:
{entities_list}
 
각 그룹에 대해 가장 적절한 정규 이름(canonical name)을 선택하고,
다음 JSON 형식으로 응답하세요:
 
```json
{
  "groups": [
    {
      "canonical": "정규 이름",
      "aliases": ["별칭1", "별칭2"]
    }
  ]
}

"""


<Callout type="warning">
엔티티 해소는 지식 그래프 품질의 가장 큰 병목입니다. 자동 해소의 정확도는 80~90% 수준이며, 높은 품질이 요구되는 도메인에서는 사람의 검토가 필요합니다. 해소 결과를 로그로 남기고 주기적으로 검토하는 프로세스를 구축하는 것을 권장합니다.
</Callout>

---

## Neo4j 적재

추출되고 검증된 엔티티와 관계를 Neo4j에 적재합니다.

```python title="neo4j_loader.py"
from neo4j import GraphDatabase

class KnowledgeGraphLoader:
    """추출된 엔티티와 관계를 Neo4j에 적재합니다."""

    def __init__(self, uri: str, auth: tuple[str, str]):
        self.driver = GraphDatabase.driver(uri, auth=auth)

    def load_entities(self, entities: list[dict]) -> int:
        """엔티티를 노드로 생성합니다 (MERGE로 중복 방지)."""
        query = """
        UNWIND $entities AS entity
        CALL apoc.merge.node(
            [entity.type],
            {name: entity.name},
            entity.properties,
            {}
        ) YIELD node
        RETURN count(node) AS created
        """
        records, _, _ = self.driver.execute_query(
            query, entities=[e.__dict__ if hasattr(e, '__dict__') else e for e in entities]
        )
        return records[0]["created"]

    def load_relationships(self, relationships: list[dict]) -> int:
        """관계를 생성합니다 (MERGE로 중복 방지)."""
        query = """
        UNWIND $rels AS rel
        MATCH (source {name: rel.source_name})
        MATCH (target {name: rel.target_name})
        CALL apoc.merge.relationship(
            source,
            rel.type,
            {},
            rel.properties,
            target
        ) YIELD rel AS created
        RETURN count(created) AS count
        """
        records, _, _ = self.driver.execute_query(query, rels=relationships)
        return records[0]["count"]

    def close(self):
        self.driver.close()

배치 적재 최적화

대량의 데이터를 적재할 때는 배치 처리가 중요합니다.

batch_loader.py

python

def load_in_batches(loader: KnowledgeGraphLoader,
                    entities: list[dict],
                    relationships: list[dict],
                    batch_size: int = 500) -> None:
    """대량 데이터를 배치 단위로 적재합니다."""
    # 엔티티 배치 적재
    for i in range(0, len(entities), batch_size):
        batch = entities[i:i + batch_size]
        count = loader.load_entities(batch)
        print(f"엔티티 배치 {i // batch_size + 1}: {count}개 적재")
 
    # 관계 배치 적재 (엔티티가 모두 적재된 후)
    for i in range(0, len(relationships), batch_size):
        batch = relationships[i:i + batch_size]
        count = loader.load_relationships(batch)
        print(f"관계 배치 {i // batch_size + 1}: {count}개 적재")

Neo4j LLM Graph Builder

Neo4j LLM Graph Builder는 Neo4j에서 제공하는 오픈소스 도구로, UI를 통해 비정형 텍스트에서 지식 그래프를 자동 구축합니다.

주요 기능

다양한 소스 지원: PDF, 웹 페이지, YouTube 영상, S3 등
LLM 선택: OpenAI, Anthropic, Google 등 다양한 LLM 지원
스키마 정의: 추출할 엔티티/관계 타입을 사전에 정의 가능
시각화: 구축된 그래프를 브라우저에서 탐색 가능
채팅 인터페이스: 구축된 그래프에 자연어로 질문 가능

커스텀 파이프라인 vs LLM Graph Builder

기준	커스텀 파이프라인	LLM Graph Builder
유연성	높음	중간
개발 비용	높음	낮음
프로덕션 적합성	높음	프로토타이핑에 적합
스키마 제어	완전 제어	사전 정의 가능
파이프라인 커스터마이징	무제한	제한적

Info

프로토타이핑 단계에서는 LLM Graph Builder로 빠르게 결과를 확인하고, 프로덕션에서는 커스텀 파이프라인으로 전환하는 전략이 효과적입니다. 10장의 실전 프로젝트에서 이 두 접근을 모두 활용합니다.

전체 파이프라인 통합

지금까지 다룬 각 단계를 하나의 파이프라인으로 통합합니다.

extraction_pipeline.py

python

class KGExtractionPipeline:
    """비정형 텍스트에서 Knowledge Graph를 구축하는 파이프라인입니다."""
 
    def __init__(self, neo4j_uri: str, neo4j_auth: tuple[str, str]):
        self.resolver = EntityResolver(similarity_threshold=0.85)
        self.loader = KnowledgeGraphLoader(neo4j_uri, neo4j_auth)
 
    def process_document(self, text: str) -> dict:
        """단일 문서를 처리하여 KG에 적재합니다."""
        # 1. 청킹
        chunks = chunk_document(text, chunk_size=2000, overlap=200)
        print(f"청크 수: {len(chunks)}")
 
        all_entities = []
        all_relationships = []
 
        # 2. 각 청크에서 엔티티/관계 추출
        for i, chunk in enumerate(chunks):
            try:
                result = extract_entities(chunk)
                entities, relationships = validate_extraction(result)
                all_entities.extend(entities)
                all_relationships.extend(relationships)
                print(f"청크 {i + 1}: 엔티티 {len(entities)}개, 관계 {len(relationships)}개")
            except Exception as err:
                print(f"청크 {i + 1} 추출 실패: {err}")
 
        # 3. 엔티티 해소
        resolved_entities = self.resolver.resolve_batch(
            [e.model_dump() for e in all_entities]
        )
        print(f"해소 후 고유 엔티티: {len(set(e['name'] for e in resolved_entities))}개")
 
        # 4. Neo4j 적재
        entity_count = self.loader.load_entities(resolved_entities)
        rel_count = self.loader.load_relationships(
            [r.model_dump() for r in all_relationships]
        )
 
        return {
            "chunks": len(chunks),
            "entities": entity_count,
            "relationships": rel_count
        }
 
    def close(self):
        self.loader.close()

정리

이번 장에서는 비정형 텍스트에서 지식 그래프를 구축하는 전체 파이프라인을 다루었습니다.

텍스트 청킹은 1,500~3,000자 범위가 추출 품질과 맥락 보존의 균형점입니다
LLM 프롬프트에 명확한 스키마 정의를 포함하면 추출 정확도가 크게 향상됩니다
JSON 파싱과 검증은 LLM 출력의 불확실성에 대비하는 필수 단계입니다
엔티티 해소는 그래프 품질의 가장 큰 병목이며, 문자열 유사도 + LLM 기반 해소를 병행합니다
MERGE 패턴으로 중복 없이 안전하게 Neo4j에 적재합니다

다음 장 미리보기: 6장에서는 이렇게 구축된 지식 그래프를 활용하는 GraphRAG를 본격적으로 다룹니다. Microsoft GraphRAG의 커뮤니티 요약, 글로벌/로컬 검색, 그리고 하이브리드 검색 전략을 살펴봅니다.

이 글이 도움이 되셨나요?

AI / ML

6장: GraphRAG — 그래프 기반 검색 증강 생성

Microsoft GraphRAG의 아키텍처, 커뮤니티 요약, 글로벌/로컬 검색 전략, Neo4j GraphRAG Python 라이브러리, 그리고 벡터+그래프+키워드 하이브리드 검색을 다룹니다.

2026년 3월 25일·16분

AI / ML

4장: Amazon Neptune과 기타 그래프 DB

Amazon Neptune의 아키텍처와 Bedrock 통합, 그리고 TigerGraph, JanusGraph, Memgraph 등 주요 그래프 데이터베이스를 비교하며 프로젝트에 맞는 선택 가이드를 제공합니다.

2026년 3월 21일·13분

AI / ML

7장: 지식 그래프 임베딩

TransE, DistMult, ComplEx 등 관계 예측 모델과 Node2Vec, GraphSAGE 등 노드 임베딩 기법, PyTorch Geometric을 활용한 구현까지 지식 그래프 임베딩의 핵심을 다룹니다.

2026년 3월 27일·17분

2026년 3월 23일·AI / ML·

5장: LLM 기반 엔티티 추출과 관계 생성

비정형 텍스트에서 LLM을 활용하여 엔티티와 관계를 추출하고, JSON 파싱, 엔티티 해소, Neo4j 적재까지의 전체 파이프라인을 구축하는 방법을 다룹니다.

19분1,069자10개 섹션

knowledge-graph ai data-engineering

knowledge-graph5 / 10

1 2 3 4 5 6 7 8 9 10

이전4장: Amazon Neptune과 기타 그래프 DB 다음6장: GraphRAG — 그래프 기반 검색 증강 생성

학습 목표

비정형 텍스트를 구조화된 지식 그래프로 변환하는 전체 과정을 이해합니다
LLM 프롬프트 설계를 통해 엔티티와 관계를 추출하는 방법을 습득합니다
JSON 출력 파싱과 검증 전략을 파악합니다
Entity Resolution(엔티티 해소)의 필요성과 기법을 이해합니다
Neo4j LLM Graph Builder와 커스텀 파이프라인 구축을 비교합니다

비정형에서 구조화로

이 파이프라인의 각 단계를 상세히 살펴보겠습니다.

텍스트 청킹

LLM의 컨텍스트 윈도우와 추출 품질을 고려하여, 긴 문서를 적절한 크기로 분할합니다.

text_chunking.py

python

from langchain_text_splitters import RecursiveCharacterTextSplitter
 
def chunk_document(text: str, chunk_size: int = 2000, overlap: int = 200) -> list[str]:
    """문서를 겹치는 청크로 분할합니다."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " "],
        length_function=len
    )
    chunks = splitter.split_text(text)
    return chunks

청킹 시 주의할 점은 다음과 같습니다.

청크 크기: 너무 크면 추출 품질이 떨어지고, 너무 작으면 맥락이 손실됩니다. 1,500~3,000자가 적정 범위입니다
오버랩: 청크 경계에서 엔티티/관계가 잘리는 것을 방지합니다
구분자 우선순위: 단락 > 문장 > 단어 순으로 자연스러운 경계를 찾습니다

LLM 프롬프트 설계

엔티티 추출 프롬프트

엔티티와 관계를 동시에 추출하는 프롬프트를 설계합니다. 핵심은 출력 스키마를 명확히 정의하는 것입니다.

extraction_prompt.py

python

EXTRACTION_SYSTEM_PROMPT = """당신은 텍스트에서 엔티티(개체)와 관계를 추출하는 전문가입니다.
 
주어진 텍스트를 분석하여 다음 형식의 JSON을 생성하세요.
 
## 엔티티 타입
- Person: 사람 (이름, 역할, 소속)
- Technology: 기술, 프레임워크, 라이브러리, 도구 (이름, 카테고리, 버전)
- Concept: 추상적 개념, 방법론 (이름, 설명)
- Organization: 조직, 회사, 기관 (이름, 유형)
 
## 관계 타입
- USES: Person/Organization이 Technology를 사용함
- DEVELOPED_BY: Technology가 Organization에 의해 개발됨
- DEPENDS_ON: Technology가 다른 Technology에 의존함
- IMPLEMENTS: Technology가 Concept을 구현함
- RELATED_TO: Concept이 다른 Concept과 관련됨
 
## 출력 형식
반드시 아래 JSON 스키마를 따르세요:
 
```json
{
  "entities": [
    {
      "id": "고유 식별자 (소문자, 하이픈 구분)",
      "type": "엔티티 타입",
      "name": "표시 이름",
      "properties": {}
    }
  ],
  "relationships": [
    {
      "source": "소스 엔티티 id",
      "target": "타겟 엔티티 id",
      "type": "관계 타입",
      "properties": {}
    }
  ]
}

규칙

텍스트에 명시적으로 언급된 엔티티와 관계만 추출하세요
추론이나 외부 지식을 추가하지 마세요
동일 엔티티는 하나의 id로 통일하세요
모호한 관계는 제외하세요 """


### 추출 함수 구현

```python title="entity_extraction.py"
import json
from anthropic import Anthropic

client = Anthropic()

def extract_entities(text: str) -> dict:
    """텍스트에서 엔티티와 관계를 추출합니다."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=EXTRACTION_SYSTEM_PROMPT,
        messages=[
            {
                "role": "user",
                "content": f"다음 텍스트에서 엔티티와 관계를 추출하세요:\n\n{text}"
            }
        ]
    )

    # JSON 파싱
    content = response.content[0].text
    # JSON 블록 추출 (마크다운 코드 블록 처리)
    if "```json" in content:
        content = content.split("```json")[1].split("```")[0]
    elif "```" in content:
        content = content.split("```")[1].split("```")[0]

    return json.loads(content.strip())

구조화된 출력 활용

최신 LLM들은 Structured Output(구조화된 출력) 기능을 제공하여 JSON 스키마를 강제할 수 있습니다.

structured_extraction.py

python

from pydantic import BaseModel
 
class Entity(BaseModel):
    id: str
    type: str
    name: str
    properties: dict = {}
 
class Relationship(BaseModel):
    source: str
    target: str
    type: str
    properties: dict = {}
 
class ExtractionResult(BaseModel):
    entities: list[Entity]
    relationships: list[Relationship]
 
def extract_with_schema(text: str) -> ExtractionResult:
    """Pydantic 스키마를 활용한 구조화된 추출을 수행합니다."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=EXTRACTION_SYSTEM_PROMPT,
        messages=[
            {"role": "user", "content": f"텍스트:\n\n{text}"}
        ]
    )
    content = response.content[0].text
    if "```json" in content:
        content = content.split("```json")[1].split("```")[0]
 
    data = json.loads(content.strip())
    return ExtractionResult(**data)

Tip

JSON 출력 파싱과 검증

LLM의 출력은 항상 완벽하지 않습니다. 견고한 파싱과 검증이 필요합니다.

validation.py

python

from pydantic import BaseModel, field_validator
 
class ValidatedEntity(BaseModel):
    id: str
    type: str
    name: str
    properties: dict = {}
 
    @field_validator("type")
    @classmethod
    def validate_type(cls, v: str) -> str:
        allowed = {"Person", "Technology", "Concept", "Organization"}
        if v not in allowed:
            raise ValueError(f"허용되지 않는 엔티티 타입: {v}")
        return v
 
    @field_validator("id")
    @classmethod
    def validate_id(cls, v: str) -> str:
        # ID는 소문자와 하이픈만 허용
        cleaned = v.lower().replace(" ", "-")
        return cleaned
 
class ValidatedRelationship(BaseModel):
    source: str
    target: str
    type: str
    properties: dict = {}
 
    @field_validator("type")
    @classmethod
    def validate_type(cls, v: str) -> str:
        allowed = {"USES", "DEVELOPED_BY", "DEPENDS_ON", "IMPLEMENTS", "RELATED_TO"}
        if v not in allowed:
            raise ValueError(f"허용되지 않는 관계 타입: {v}")
        return v
 
def validate_extraction(data: dict) -> tuple[list[ValidatedEntity], list[ValidatedRelationship]]:
    """추출 결과를 검증하고 유효한 항목만 반환합니다."""
    valid_entities = []
    valid_relationships = []
    entity_ids = set()
 
    # 엔티티 검증
    for e in data.get("entities", []):
        try:
            entity = ValidatedEntity(**e)
            valid_entities.append(entity)
            entity_ids.add(entity.id)
        except Exception as err:
            print(f"엔티티 검증 실패: {e} - {err}")
 
    # 관계 검증 (존재하는 엔티티만 참조)
    for r in data.get("relationships", []):
        try:
            rel = ValidatedRelationship(**r)
            if rel.source in entity_ids and rel.target in entity_ids:
                valid_relationships.append(rel)
            else:
                print(f"관계의 엔티티가 존재하지 않음: {r}")
        except Exception as err:
            print(f"관계 검증 실패: {r} - {err}")
 
    return valid_entities, valid_relationships

엔티티 해소 (Entity Resolution)

왜 엔티티 해소가 필요한가

하나의 기술을 다양한 이름으로 참조하는 경우를 생각해 봅시다.

"Neo4j", "neo4j", "Neo4J", "네오포제이"
"Knowledge Graph", "지식 그래프", "KG"
"GraphRAG", "Graph RAG", "그래프 RAG"

이들을 별개의 엔티티로 저장하면 그래프의 연결성이 깨지고, 쿼리 결과가 부정확해집니다.

해소 전략

entity_resolution.py

python

from difflib import SequenceMatcher
 
class EntityResolver:
    """엔티티 해소를 수행하는 클래스입니다."""
 
    def __init__(self, similarity_threshold: float = 0.85):
        self.threshold = similarity_threshold
        self.canonical_map: dict[str, str] = {}  # 별칭 -> 정규 이름
        self.entities: dict[str, dict] = {}       # 정규 이름 -> 엔티티 데이터
 
    def normalize(self, name: str) -> str:
        """이름을 정규화합니다."""
        return name.strip().lower().replace("-", " ")
 
    def similarity(self, a: str, b: str) -> float:
        """두 문자열의 유사도를 계산합니다."""
        return SequenceMatcher(None, self.normalize(a), self.normalize(b)).ratio()
 
    def resolve(self, entity: dict) -> str:
        """엔티티를 기존 항목과 대조하여 정규 이름을 반환합니다."""
        name = entity["name"]
        normalized = self.normalize(name)
 
        # 1. 정확히 일치하는 정규 이름이 있는지 확인
        if normalized in self.canonical_map:
            return self.canonical_map[normalized]
 
        # 2. 유사한 기존 엔티티가 있는지 확인
        for canonical_name in self.entities:
            if self.similarity(name, canonical_name) >= self.threshold:
                self.canonical_map[normalized] = canonical_name
                return canonical_name
 
        # 3. 새로운 엔티티로 등록
        self.canonical_map[normalized] = name
        self.entities[name] = entity
        return name
 
    def resolve_batch(self, entities: list[dict]) -> list[dict]:
        """엔티티 목록을 일괄 해소합니다."""
        resolved = []
        for entity in entities:
            canonical = self.resolve(entity)
            resolved_entity = {**entity, "name": canonical, "original_name": entity["name"]}
            resolved.append(resolved_entity)
        return resolved

LLM 기반 엔티티 해소

문자열 유사도만으로는 "KG"와 "Knowledge Graph"처럼 약어를 처리하기 어렵습니다. LLM을 활용한 해소도 고려할 수 있습니다.

llm_entity_resolution.py

python

RESOLUTION_PROMPT = """다음 엔티티 목록에서 동일한 대상을 가리키는 항목들을 그룹화하세요.
 
엔티티 목록:
{entities_list}
 
각 그룹에 대해 가장 적절한 정규 이름(canonical name)을 선택하고,
다음 JSON 형식으로 응답하세요:
 
```json
{
  "groups": [
    {
      "canonical": "정규 이름",
      "aliases": ["별칭1", "별칭2"]
    }
  ]
}

"""


<Callout type="warning">
엔티티 해소는 지식 그래프 품질의 가장 큰 병목입니다. 자동 해소의 정확도는 80~90% 수준이며, 높은 품질이 요구되는 도메인에서는 사람의 검토가 필요합니다. 해소 결과를 로그로 남기고 주기적으로 검토하는 프로세스를 구축하는 것을 권장합니다.
</Callout>

---

## Neo4j 적재

추출되고 검증된 엔티티와 관계를 Neo4j에 적재합니다.

```python title="neo4j_loader.py"
from neo4j import GraphDatabase

class KnowledgeGraphLoader:
    """추출된 엔티티와 관계를 Neo4j에 적재합니다."""

    def __init__(self, uri: str, auth: tuple[str, str]):
        self.driver = GraphDatabase.driver(uri, auth=auth)

    def load_entities(self, entities: list[dict]) -> int:
        """엔티티를 노드로 생성합니다 (MERGE로 중복 방지)."""
        query = """
        UNWIND $entities AS entity
        CALL apoc.merge.node(
            [entity.type],
            {name: entity.name},
            entity.properties,
            {}
        ) YIELD node
        RETURN count(node) AS created
        """
        records, _, _ = self.driver.execute_query(
            query, entities=[e.__dict__ if hasattr(e, '__dict__') else e for e in entities]
        )
        return records[0]["created"]

    def load_relationships(self, relationships: list[dict]) -> int:
        """관계를 생성합니다 (MERGE로 중복 방지)."""
        query = """
        UNWIND $rels AS rel
        MATCH (source {name: rel.source_name})
        MATCH (target {name: rel.target_name})
        CALL apoc.merge.relationship(
            source,
            rel.type,
            {},
            rel.properties,
            target
        ) YIELD rel AS created
        RETURN count(created) AS count
        """
        records, _, _ = self.driver.execute_query(query, rels=relationships)
        return records[0]["count"]

    def close(self):
        self.driver.close()

배치 적재 최적화

대량의 데이터를 적재할 때는 배치 처리가 중요합니다.

batch_loader.py

python

def load_in_batches(loader: KnowledgeGraphLoader,
                    entities: list[dict],
                    relationships: list[dict],
                    batch_size: int = 500) -> None:
    """대량 데이터를 배치 단위로 적재합니다."""
    # 엔티티 배치 적재
    for i in range(0, len(entities), batch_size):
        batch = entities[i:i + batch_size]
        count = loader.load_entities(batch)
        print(f"엔티티 배치 {i // batch_size + 1}: {count}개 적재")
 
    # 관계 배치 적재 (엔티티가 모두 적재된 후)
    for i in range(0, len(relationships), batch_size):
        batch = relationships[i:i + batch_size]
        count = loader.load_relationships(batch)
        print(f"관계 배치 {i // batch_size + 1}: {count}개 적재")

Neo4j LLM Graph Builder

Neo4j LLM Graph Builder는 Neo4j에서 제공하는 오픈소스 도구로, UI를 통해 비정형 텍스트에서 지식 그래프를 자동 구축합니다.

주요 기능

다양한 소스 지원: PDF, 웹 페이지, YouTube 영상, S3 등
LLM 선택: OpenAI, Anthropic, Google 등 다양한 LLM 지원
스키마 정의: 추출할 엔티티/관계 타입을 사전에 정의 가능
시각화: 구축된 그래프를 브라우저에서 탐색 가능
채팅 인터페이스: 구축된 그래프에 자연어로 질문 가능

커스텀 파이프라인 vs LLM Graph Builder

기준	커스텀 파이프라인	LLM Graph Builder
유연성	높음	중간
개발 비용	높음	낮음
프로덕션 적합성	높음	프로토타이핑에 적합
스키마 제어	완전 제어	사전 정의 가능
파이프라인 커스터마이징	무제한	제한적

Info

전체 파이프라인 통합

지금까지 다룬 각 단계를 하나의 파이프라인으로 통합합니다.

extraction_pipeline.py

python

class KGExtractionPipeline:
    """비정형 텍스트에서 Knowledge Graph를 구축하는 파이프라인입니다."""
 
    def __init__(self, neo4j_uri: str, neo4j_auth: tuple[str, str]):
        self.resolver = EntityResolver(similarity_threshold=0.85)
        self.loader = KnowledgeGraphLoader(neo4j_uri, neo4j_auth)
 
    def process_document(self, text: str) -> dict:
        """단일 문서를 처리하여 KG에 적재합니다."""
        # 1. 청킹
        chunks = chunk_document(text, chunk_size=2000, overlap=200)
        print(f"청크 수: {len(chunks)}")
 
        all_entities = []
        all_relationships = []
 
        # 2. 각 청크에서 엔티티/관계 추출
        for i, chunk in enumerate(chunks):
            try:
                result = extract_entities(chunk)
                entities, relationships = validate_extraction(result)
                all_entities.extend(entities)
                all_relationships.extend(relationships)
                print(f"청크 {i + 1}: 엔티티 {len(entities)}개, 관계 {len(relationships)}개")
            except Exception as err:
                print(f"청크 {i + 1} 추출 실패: {err}")
 
        # 3. 엔티티 해소
        resolved_entities = self.resolver.resolve_batch(
            [e.model_dump() for e in all_entities]
        )
        print(f"해소 후 고유 엔티티: {len(set(e['name'] for e in resolved_entities))}개")
 
        # 4. Neo4j 적재
        entity_count = self.loader.load_entities(resolved_entities)
        rel_count = self.loader.load_relationships(
            [r.model_dump() for r in all_relationships]
        )
 
        return {
            "chunks": len(chunks),
            "entities": entity_count,
            "relationships": rel_count
        }
 
    def close(self):
        self.loader.close()

정리

이번 장에서는 비정형 텍스트에서 지식 그래프를 구축하는 전체 파이프라인을 다루었습니다.

텍스트 청킹은 1,500~3,000자 범위가 추출 품질과 맥락 보존의 균형점입니다
LLM 프롬프트에 명확한 스키마 정의를 포함하면 추출 정확도가 크게 향상됩니다
JSON 파싱과 검증은 LLM 출력의 불확실성에 대비하는 필수 단계입니다
엔티티 해소는 그래프 품질의 가장 큰 병목이며, 문자열 유사도 + LLM 기반 해소를 병행합니다
MERGE 패턴으로 중복 없이 안전하게 Neo4j에 적재합니다

이 글이 도움이 되셨나요?

AI / ML

관련 글

6장: GraphRAG — 그래프 기반 검색 증강 생성

4장: Amazon Neptune과 기타 그래프 DB

7장: 지식 그래프 임베딩

댓글

관련 글

6장: GraphRAG — 그래프 기반 검색 증강 생성

4장: Amazon Neptune과 기타 그래프 DB

7장: 지식 그래프 임베딩

댓글