docs: add comprehensive Speechmatics STT integration notes

Document all findings from the integration process directly in the
source code for future reference:

1. Language code mapping: Speechmatics uses ISO 639-3 "cmn" for
   Mandarin, but LiveKit LanguageCode auto-normalizes it to "zh".
   Must override stt._stt_options.language after construction.

2. Turn detection modes (critical):
   - EXTERNAL: unusable — LiveKit never sends FlushSentinel, only
     pushes silence frames, so FINAL_TRANSCRIPT never arrives
   - ADAPTIVE: unusable — client-side Silero VAD conflicts with
     LiveKit's own VAD, produces zero transcription output
   - SMART_TURN: correct choice — server-side intelligent turn
     detection, auto-emits FINAL_TRANSCRIPT, fully compatible

3. Speaker diarization: is_active flag distinguishes primary speaker
   from TTS echo, solving the "speaker confusion" problem

4. Docker deployment: SPEECHMATICS_API_KEY in .env, watch for
   COPY layer cache when rebuilding

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
hailin 2026-03-03 04:47:33 -08:00
parent f30aa414dd
commit e8a3e07116
1 changed files with 58 additions and 11 deletions

View File

@ -6,6 +6,44 @@ Mandarin recognition with speaker diarization support.
The SPEECHMATICS_API_KEY environment variable is read automatically
by the livekit-plugins-speechmatics package.
===========================================================================
集成笔记 (2026-03-03)
===========================================================================
1. 语言码映射
- Speechmatics 使用 ISO 639-3 语言码中文普通话为 "cmn"
- LiveKit LanguageCode 类会自动将 "cmn" 归一化为 ISO 639-1 "zh"
( livekit/agents/_language_data.py: ISO_639_3_TO_1["cmn"] = "zh")
- Speechmatics API 不接受 "zh"会报 "lang pack [zh] is not supported"
- 解决构造 STT 后手动覆盖 stt._stt_options.language = "cmn"
2. Turn Detection 模式选择关键
三种模式在 LiveKit 框架下的实际表现
- EXTERNAL: 需要客户端手动调用 client.finalize() 才会产生 FINAL_TRANSCRIPT
LiveKit agents 框架v1.4.4 VAD 检测到说话结束后并不调用
stream.flush()不发 FlushSentinel而是推送静音帧 + 等待 FINAL 事件
结果只有 INTERIM_TRANSCRIPT永远没有 FINAL 框架 2 秒超时 用户无回复
- ADAPTIVE: 使用 Speechmatics SDK 内置的 Silero VAD 做客户端转弯检测
LiveKit 自己也有 Silero VAD 在运行两个 VAD 冲突
结果零转写输出完全静默
- SMART_TURN推荐: Speechmatics 服务器端做智能转弯检测
根据语义和停顿自动判断用户是否说完主动发 AddSegment (FINAL_TRANSCRIPT)
无需客户端干预 LiveKit 框架完全兼容
官方文档: https://docs.speechmatics.com/integrations-and-sdks/livekit
3. Speaker Diarization说话人识别
- enable_diarization=True 开启后每个 segment speaker_id is_active 标记
- is_active=True 表示主要说话人用户is_active=False 表示被动说话人 TTS 回声
- 解决"说话人混淆"问题Agent 不会把自己 TTS 的回声当成用户输入
4. Docker 部署注意
- SPEECHMATICS_API_KEY 在服务器 .env 中配置docker-compose.yml 传入容器
- 每次改动 src/ 下文件后需 docker compose build voice-agent注意 COPY 层缓存
如改动未生效需加 --no-cache
"""
import logging
@ -13,14 +51,15 @@ from livekit.plugins.speechmatics import STT, TurnDetectionMode
logger = logging.getLogger(__name__)
# Map Whisper language codes to Speechmatics language codes
# Whisper 语言码 → Speechmatics 语言码映射
# Speechmatics 使用 ISO 639-3如 "cmn"),而非 ISO 639-1如 "zh"
_LANG_MAP = {
"zh": "cmn",
"en": "en",
"ja": "ja",
"ko": "ko",
"de": "de",
"fr": "fr",
"zh": "cmn", # 中文普通话
"en": "en", # 英语
"ja": "ja", # 日语
"ko": "ko", # 韩语
"de": "de", # 德语
"fr": "fr", # 法语
}
@ -35,15 +74,23 @@ def create_speechmatics_stt(language: str = "cmn") -> STT:
Configured speechmatics.STT instance with speaker diarization enabled.
"""
sm_lang = _LANG_MAP.get(language, language)
stt = STT(
language=sm_lang,
include_partials=True,
# SMART_TURN: 服务器端智能转弯检测,自动发 FINAL_TRANSCRIPT
# 不要用 EXTERNAL需手动 finalize或 ADAPTIVE与 LiveKit VAD 冲突)
turn_detection_mode=TurnDetectionMode.SMART_TURN,
# 说话人识别:区分用户语音与 TTS 回声
enable_diarization=True,
)
# Workaround: LiveKit's LanguageCode normalizes ISO 639-3 "cmn" back to
# ISO 639-1 "zh", but Speechmatics expects "cmn". Override the internal
# option after construction so the raw Speechmatics code is sent.
# 绕过 LiveKit LanguageCode 的 ISO 639-3 → 639-1 自动归一化
# LanguageCode("cmn") 会变成 "zh",但 Speechmatics 只接受 "cmn"
stt._stt_options.language = sm_lang # type: ignore[assignment]
logger.info("Speechmatics STT created: language=%s (input=%s), mode=SMART_TURN, diarization=True", sm_lang, language)
logger.info(
"Speechmatics STT created: language=%s (input=%s), mode=SMART_TURN, diarization=True",
sm_lang, language,
)
return stt