2026년 2월 15일·AI / ML·

5장: 음성 AI — STT, TTS, 실시간 음성 대화

음성 인식(STT), 음성 합성(TTS), 실시간 음성 대화 시스템의 원리와 구현을 다룹니다. Whisper, OpenAI Audio API, 음성 에이전트 설계 패턴을 배웁니다.

11분667자6개 섹션

4장에서 문서 이해와 OCR 통합을 다뤘습니다. 이 장에서는 멀티모달 AI의 또 다른 핵심 축인 음성 AI를 다룹니다. 음성 인식(STT), 음성 합성(TTS), 그리고 이를 결합한 실시간 음성 대화 시스템의 설계와 구현을 살펴봅니다.

음성 AI의 구성 요소

음성 입력 → [STT] → 텍스트 → [LLM] → 응답 텍스트 → [TTS] → 음성 출력

전통적인 음성 AI 파이프라인은 세 단계로 구성됩니다. 최근에는 GPT-4o처럼 음성을 네이티브로 처리하는 통합 모델도 등장했지만, 파이프라인 방식은 여전히 유연성과 제어 면에서 장점이 있습니다.

STT (Speech-to-Text): 음성 인식

Whisper: 범용 음성 인식 모델

OpenAI의 Whisper는 현재 가장 널리 사용되는 오픈소스 음성 인식 모델입니다.

Whisper 로컬 실행

python

import whisper
 
model = whisper.load_model("large-v3")
 
result = model.transcribe(
    "meeting_recording.mp3",
    language="ko",
    task="transcribe",
)
 
print(result["text"])
 
# 타임스탬프 포함 세그먼트
for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"]
    print(f"[{start:.1f}s - {end:.1f}s] {text}")

Whisper 모델 크기별 비교

모델	파라미터	상대 속도	한국어 WER
tiny	39M	32x	~25%
base	74M	16x	~18%
small	244M	6x	~12%
medium	769M	2x	~8%
large-v3	1.55B	1x	~5%

API를 통한 음성 인식

OpenAI Whisper API

python

from openai import OpenAI
 
client = OpenAI()
 
with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        language="ko",
        response_format="verbose_json",
        timestamp_granularities=["word", "segment"],
    )
 
for segment in transcription.segments:
    print(f"[{segment.start:.1f}s] {segment.text}")

음성 인식 최적화 전략

긴 오디오 처리

python

from pydub import AudioSegment
 
def split_audio(file_path: str, chunk_duration_ms: int = 30000):
    """긴 오디오를 적정 크기로 분할"""
    audio = AudioSegment.from_file(file_path)
    chunks = []
 
    for i in range(0, len(audio), chunk_duration_ms):
        chunk = audio[i:i + chunk_duration_ms]
        chunks.append(chunk)
 
    return chunks
 
def transcribe_long_audio(file_path: str) -> str:
    """긴 오디오를 청크별로 처리"""
    chunks = split_audio(file_path)
    full_text = []
 
    for i, chunk in enumerate(chunks):
        # 청크를 임시 파일로 저장
        chunk_path = f"/tmp/chunk_{i}.mp3"
        chunk.export(chunk_path, format="mp3")
 
        # 이전 텍스트를 프롬프트로 제공하여 연속성 유지
        prev_text = full_text[-1] if full_text else ""
        result = transcribe_with_context(chunk_path, prev_text)
        full_text.append(result)
 
    return " ".join(full_text)

Tip

긴 오디오를 처리할 때는 청크 간 오버랩을 1~2초 정도 추가하면 경계에서의 단어 손실을 방지할 수 있습니다. 또한 이전 청크의 마지막 부분을 다음 청크의 initial_prompt로 전달하면 문맥 연속성이 향상됩니다.

TTS (Text-to-Speech): 음성 합성

클라우드 TTS API

OpenAI TTS API

python

from openai import OpenAI
from pathlib import Path
 
client = OpenAI()
 
speech_response = client.audio.speech.create(
    model="tts-1-hd",
    voice="alloy",  # alloy, echo, fable, onyx, nova, shimmer
    input="안녕하세요. 이것은 텍스트를 음성으로 변환하는 예시입니다.",
    speed=1.0,
)
 
speech_response.stream_to_file(Path("output.mp3"))

스트리밍 TTS

실시간 응답에는 스트리밍 TTS가 필수적입니다.

스트리밍 TTS

python

from openai import OpenAI
 
client = OpenAI()
 
# LLM 응답을 스트리밍으로 받으면서 TTS에 전달
def stream_speech(text_chunks):
    """텍스트 청크를 스트리밍으로 음성 변환"""
    buffer = ""
 
    for chunk in text_chunks:
        buffer += chunk
 
        # 문장 단위로 TTS 변환
        while "." in buffer or "?" in buffer or "!" in buffer:
            # 문장 경계 찾기
            for sep in [".", "?", "!"]:
                idx = buffer.find(sep)
                if idx != -1:
                    sentence = buffer[:idx + 1]
                    buffer = buffer[idx + 1:].lstrip()
 
                    # 문장 단위 TTS
                    response = client.audio.speech.create(
                        model="tts-1",
                        voice="nova",
                        input=sentence,
                    )
                    yield response.content
                    break

음성 선택 가이드

음성	특징	적합한 용도
alloy	중성적, 균형	범용, 안내
echo	낮고 차분	내레이션, 오디오북
fable	밝고 활기	교육, 프레젠테이션
onyx	깊고 권위	뉴스, 공식 안내
nova	따뜻하고 친근	대화, 고객 서비스
shimmer	밝고 명확	안내, 보조

실시간 음성 대화 시스템

아키텍처 설계

사용자 음성 → [VAD] → [STT] → 텍스트 → [LLM] → 응답 → [TTS] → 음성 출력
                ↑
        음성 활동 감지
        (Voice Activity Detection)

VAD (Voice Activity Detection)

사용자가 말하고 있는지 감지하는 것은 실시간 음성 시스템의 핵심입니다.

Silero VAD를 활용한 음성 감지

python

import torch
 
model, utils = torch.hub.load(
    repo_or_dir="snakers4/silero-vad",
    model="silero_vad",
)
(get_speech_timestamps, _, read_audio, *_) = utils
 
audio = read_audio("recording.wav")
speech_timestamps = get_speech_timestamps(audio, model)
 
for ts in speech_timestamps:
    print(f"Speech: {ts['start']/16000:.1f}s - {ts['end']/16000:.1f}s")

GPT-4o 실시간 음성 API

GPT-4o의 실시간 API는 STT/LLM/TTS를 하나의 WebSocket 연결로 통합합니다.

실시간 음성 대화 개념

python

import asyncio
import websockets
import json
 
async def realtime_voice_chat():
    """GPT-4o 실시간 음성 대화"""
    url = "wss://api.openai.com/v1/realtime"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "OpenAI-Beta": "realtime=v1",
    }
 
    async with websockets.connect(url, extra_headers=headers) as ws:
        # 세션 설정
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": "한국어로 친근하게 대화해주세요.",
                "voice": "nova",
                "input_audio_transcription": {"model": "whisper-1"},
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                },
            },
        }))
 
        # 오디오 스트리밍 시작
        asyncio.create_task(send_audio_stream(ws))
 
        # 응답 수신
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "response.audio.delta":
                # 오디오 응답 청크 재생
                play_audio_chunk(event["delta"])
            elif event["type"] == "response.text.delta":
                # 텍스트 응답 (자막용)
                print(event["delta"], end="", flush=True)

Info

GPT-4o 실시간 API는 음성 간 지연 시간이 약 300ms 수준으로, 자연스러운 대화에 가까운 경험을 제공합니다. 전통적인 STT → LLM → TTS 파이프라인의 지연(1~3초)과 비교하면 획기적인 개선입니다.

음성 전처리와 후처리

노이즈 제거

배경 소음 제거

python

import noisereduce as nr
import numpy as np
import soundfile as sf
 
# 오디오 로드
audio, sr = sf.read("noisy_audio.wav")
 
# 노이즈 프로파일 추출 (첫 1초를 노이즈로 가정)
noise_sample = audio[:sr]
 
# 노이즈 제거
cleaned_audio = nr.reduce_noise(
    y=audio,
    sr=sr,
    y_noise=noise_sample,
    prop_decrease=0.8,
)
 
sf.write("cleaned_audio.wav", cleaned_audio, sr)

화자 분리 (Speaker Diarization)

화자 분리 개념

python

# pyannote.audio를 활용한 화자 분리
from pyannote.audio import Pipeline
 
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
diarization = pipeline("meeting.wav")
 
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"[{turn.start:.1f}s - {turn.end:.1f}s] {speaker}")

화자 분리와 STT를 결합하면 회의록 자동 생성이 가능합니다.

회의록 생성 파이프라인

python

def generate_meeting_minutes(audio_path: str, llm_client) -> str:
    """오디오에서 회의록 자동 생성"""
    # 1. 화자 분리
    diarization = diarize(audio_path)
 
    # 2. STT (화자별)
    transcript = transcribe_with_speakers(audio_path, diarization)
 
    # 3. LLM으로 회의록 작성
    response = llm_client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": f"""다음 회의 녹취록을 기반으로 공식 회의록을 작성해주세요.
 
녹취록:
{transcript}
 
회의록에 포함할 내용:
1. 참석자
2. 안건 요약
3. 주요 논의 사항
4. 결정 사항
5. 액션 아이템 (담당자, 기한)""",
        }],
    )
    return response.content[0].text

정리

음성 AI는 STT, TTS, VAD의 조합으로 구성되며, GPT-4o 실시간 API를 통해 통합 음성 대화가 가능해졌습니다. 실전에서는 노이즈 제거, 화자 분리, 문맥 유지 등의 전후처리가 품질을 결정합니다. 회의록 생성, 고객 서비스, 음성 비서 등 다양한 애플리케이션에서 음성 AI가 핵심 역할을 합니다.

다음 장에서는 멀티모달 AI의 시간적 차원인 비디오 이해와 분석을 다룹니다. 영상에서 정보를 추출하고, 시간 축을 따라 추론하는 기법을 배웁니다.

이 글이 도움이 되셨나요?

6장: 비디오 이해와 분석

멀티모달 AI를 활용한 비디오 이해 기법 — 프레임 추출 전략, 시간적 추론, 영상 요약, 그리고 실시간 비디오 분석 파이프라인 설계를 다룹니다.

2026년 2월 17일·12분

AI / ML

4장: 문서 이해와 OCR 통합

멀티모달 AI를 활용한 문서 이해 기법 — PDF 분석, 표 추출, 양식 처리, OCR 통합, 그리고 문서 처리 파이프라인 설계를 실전 중심으로 다룹니다.

2026년 2월 13일·12분

AI / ML

7장: 멀티모달 임베딩과 크로스모달 검색

CLIP 기반 멀티모달 임베딩의 원리, 텍스트-이미지 크로스모달 검색, 통합 벡터 스토어 설계, 그리고 실전 멀티모달 검색 시스템 구축을 다룹니다.

2026년 2월 19일·11분

2026년 2월 15일·AI / ML·

5장: 음성 AI — STT, TTS, 실시간 음성 대화

음성 인식(STT), 음성 합성(TTS), 실시간 음성 대화 시스템의 원리와 구현을 다룹니다. Whisper, OpenAI Audio API, 음성 에이전트 설계 패턴을 배웁니다.

11분667자6개 섹션

llm multimodal embedding

multimodal-ai5 / 11

1 2 3 4 5 6 7 8 9 10 11

이전4장: 문서 이해와 OCR 통합 다음6장: 비디오 이해와 분석

음성 AI의 구성 요소

음성 입력 → [STT] → 텍스트 → [LLM] → 응답 텍스트 → [TTS] → 음성 출력

STT (Speech-to-Text): 음성 인식

Whisper: 범용 음성 인식 모델

OpenAI의 Whisper는 현재 가장 널리 사용되는 오픈소스 음성 인식 모델입니다.

Whisper 로컬 실행

python

import whisper
 
model = whisper.load_model("large-v3")
 
result = model.transcribe(
    "meeting_recording.mp3",
    language="ko",
    task="transcribe",
)
 
print(result["text"])
 
# 타임스탬프 포함 세그먼트
for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"]
    print(f"[{start:.1f}s - {end:.1f}s] {text}")

Whisper 모델 크기별 비교

모델	파라미터	상대 속도	한국어 WER
tiny	39M	32x	~25%
base	74M	16x	~18%
small	244M	6x	~12%
medium	769M	2x	~8%
large-v3	1.55B	1x	~5%

API를 통한 음성 인식

OpenAI Whisper API

python

from openai import OpenAI
 
client = OpenAI()
 
with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        language="ko",
        response_format="verbose_json",
        timestamp_granularities=["word", "segment"],
    )
 
for segment in transcription.segments:
    print(f"[{segment.start:.1f}s] {segment.text}")

음성 인식 최적화 전략

긴 오디오 처리

python

from pydub import AudioSegment
 
def split_audio(file_path: str, chunk_duration_ms: int = 30000):
    """긴 오디오를 적정 크기로 분할"""
    audio = AudioSegment.from_file(file_path)
    chunks = []
 
    for i in range(0, len(audio), chunk_duration_ms):
        chunk = audio[i:i + chunk_duration_ms]
        chunks.append(chunk)
 
    return chunks
 
def transcribe_long_audio(file_path: str) -> str:
    """긴 오디오를 청크별로 처리"""
    chunks = split_audio(file_path)
    full_text = []
 
    for i, chunk in enumerate(chunks):
        # 청크를 임시 파일로 저장
        chunk_path = f"/tmp/chunk_{i}.mp3"
        chunk.export(chunk_path, format="mp3")
 
        # 이전 텍스트를 프롬프트로 제공하여 연속성 유지
        prev_text = full_text[-1] if full_text else ""
        result = transcribe_with_context(chunk_path, prev_text)
        full_text.append(result)
 
    return " ".join(full_text)

Tip

TTS (Text-to-Speech): 음성 합성

클라우드 TTS API

OpenAI TTS API

python

from openai import OpenAI
from pathlib import Path
 
client = OpenAI()
 
speech_response = client.audio.speech.create(
    model="tts-1-hd",
    voice="alloy",  # alloy, echo, fable, onyx, nova, shimmer
    input="안녕하세요. 이것은 텍스트를 음성으로 변환하는 예시입니다.",
    speed=1.0,
)
 
speech_response.stream_to_file(Path("output.mp3"))

스트리밍 TTS

실시간 응답에는 스트리밍 TTS가 필수적입니다.

스트리밍 TTS

python

from openai import OpenAI
 
client = OpenAI()
 
# LLM 응답을 스트리밍으로 받으면서 TTS에 전달
def stream_speech(text_chunks):
    """텍스트 청크를 스트리밍으로 음성 변환"""
    buffer = ""
 
    for chunk in text_chunks:
        buffer += chunk
 
        # 문장 단위로 TTS 변환
        while "." in buffer or "?" in buffer or "!" in buffer:
            # 문장 경계 찾기
            for sep in [".", "?", "!"]:
                idx = buffer.find(sep)
                if idx != -1:
                    sentence = buffer[:idx + 1]
                    buffer = buffer[idx + 1:].lstrip()
 
                    # 문장 단위 TTS
                    response = client.audio.speech.create(
                        model="tts-1",
                        voice="nova",
                        input=sentence,
                    )
                    yield response.content
                    break

음성 선택 가이드

음성	특징	적합한 용도
alloy	중성적, 균형	범용, 안내
echo	낮고 차분	내레이션, 오디오북
fable	밝고 활기	교육, 프레젠테이션
onyx	깊고 권위	뉴스, 공식 안내
nova	따뜻하고 친근	대화, 고객 서비스
shimmer	밝고 명확	안내, 보조

실시간 음성 대화 시스템

아키텍처 설계

사용자 음성 → [VAD] → [STT] → 텍스트 → [LLM] → 응답 → [TTS] → 음성 출력
                ↑
        음성 활동 감지
        (Voice Activity Detection)

VAD (Voice Activity Detection)

사용자가 말하고 있는지 감지하는 것은 실시간 음성 시스템의 핵심입니다.

Silero VAD를 활용한 음성 감지

python

import torch
 
model, utils = torch.hub.load(
    repo_or_dir="snakers4/silero-vad",
    model="silero_vad",
)
(get_speech_timestamps, _, read_audio, *_) = utils
 
audio = read_audio("recording.wav")
speech_timestamps = get_speech_timestamps(audio, model)
 
for ts in speech_timestamps:
    print(f"Speech: {ts['start']/16000:.1f}s - {ts['end']/16000:.1f}s")

GPT-4o 실시간 음성 API

GPT-4o의 실시간 API는 STT/LLM/TTS를 하나의 WebSocket 연결로 통합합니다.

실시간 음성 대화 개념

python

import asyncio
import websockets
import json
 
async def realtime_voice_chat():
    """GPT-4o 실시간 음성 대화"""
    url = "wss://api.openai.com/v1/realtime"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "OpenAI-Beta": "realtime=v1",
    }
 
    async with websockets.connect(url, extra_headers=headers) as ws:
        # 세션 설정
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": "한국어로 친근하게 대화해주세요.",
                "voice": "nova",
                "input_audio_transcription": {"model": "whisper-1"},
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                },
            },
        }))
 
        # 오디오 스트리밍 시작
        asyncio.create_task(send_audio_stream(ws))
 
        # 응답 수신
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "response.audio.delta":
                # 오디오 응답 청크 재생
                play_audio_chunk(event["delta"])
            elif event["type"] == "response.text.delta":
                # 텍스트 응답 (자막용)
                print(event["delta"], end="", flush=True)

Info

음성 전처리와 후처리

노이즈 제거

배경 소음 제거

python

import noisereduce as nr
import numpy as np
import soundfile as sf
 
# 오디오 로드
audio, sr = sf.read("noisy_audio.wav")
 
# 노이즈 프로파일 추출 (첫 1초를 노이즈로 가정)
noise_sample = audio[:sr]
 
# 노이즈 제거
cleaned_audio = nr.reduce_noise(
    y=audio,
    sr=sr,
    y_noise=noise_sample,
    prop_decrease=0.8,
)
 
sf.write("cleaned_audio.wav", cleaned_audio, sr)

화자 분리 (Speaker Diarization)

화자 분리 개념

python

# pyannote.audio를 활용한 화자 분리
from pyannote.audio import Pipeline
 
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
diarization = pipeline("meeting.wav")
 
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"[{turn.start:.1f}s - {turn.end:.1f}s] {speaker}")

화자 분리와 STT를 결합하면 회의록 자동 생성이 가능합니다.

회의록 생성 파이프라인

python

def generate_meeting_minutes(audio_path: str, llm_client) -> str:
    """오디오에서 회의록 자동 생성"""
    # 1. 화자 분리
    diarization = diarize(audio_path)
 
    # 2. STT (화자별)
    transcript = transcribe_with_speakers(audio_path, diarization)
 
    # 3. LLM으로 회의록 작성
    response = llm_client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": f"""다음 회의 녹취록을 기반으로 공식 회의록을 작성해주세요.
 
녹취록:
{transcript}
 
회의록에 포함할 내용:
1. 참석자
2. 안건 요약
3. 주요 논의 사항
4. 결정 사항
5. 액션 아이템 (담당자, 기한)""",
        }],
    )
    return response.content[0].text