docs: add comprehensive Speechmatics STT integration notes

Document all findings from the integration process directly in the source code for future reference: 1. Language code mapping: Speechmatics uses ISO 639-3 "cmn" for Mandarin, but LiveKit LanguageCode auto-normalizes it to "zh". Must override stt._stt_options.language after construction. 2. Turn detection modes (critical): - EXTERNAL: unusable — LiveKit never sends FlushSentinel, only pushes silence frames, so FINAL_TRANSCRIPT never arrives - ADAPTIVE: unusable — client-side Silero VAD conflicts with LiveKit's own VAD, produces zero transcription output - SMART_TURN: correct choice — server-side intelligent turn detection, auto-emits FINAL_TRANSCRIPT, fully compatible 3. Speaker diarization: is_active flag distinguishes primary speaker from TTS echo, solving the "speaker confusion" problem 4. Docker deployment: SPEECHMATICS_API_KEY in .env, watch for COPY layer cache when rebuilding Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 04:47:33 -08:00 · 2026-03-03 04:47:33 -08:00 · e8a3e07116
parent f30aa414dd
commit e8a3e07116
1 changed files with 58 additions and 11 deletions
--- a/packages/services/voice-agent/src/plugins/speechmatics_stt.py
+++ b/packages/services/voice-agent/src/plugins/speechmatics_stt.py
@ -6,6 +6,44 @@ Mandarin recognition with speaker diarization support.

 The SPEECHMATICS_API_KEY environment variable is read automatically
 by the livekit-plugins-speechmatics package.
+
+===========================================================================
+集成笔记 (2026-03-03)
+===========================================================================
+
+1. 语言码映射
+   - Speechmatics 使用 ISO 639-3 语言码，中文普通话为 "cmn"
+   - LiveKit 的 LanguageCode 类会自动将 "cmn" 归一化为 ISO 639-1 的 "zh"
+     (见 livekit/agents/_language_data.py: ISO_639_3_TO_1["cmn"] = "zh")
+   - 但 Speechmatics API 不接受 "zh"，会报 "lang pack [zh] is not supported"
+   - 解决：构造 STT 后手动覆盖 stt._stt_options.language = "cmn"
+
+2. Turn Detection 模式选择（关键！）
+   三种模式在 LiveKit 框架下的实际表现：
+
+   - EXTERNAL: 需要客户端手动调用 client.finalize() 才会产生 FINAL_TRANSCRIPT。
+     但 LiveKit agents 框架（v1.4.4）在 VAD 检测到说话结束后并不调用
+     stream.flush()（不发 FlushSentinel），而是推送静音帧 + 等待 FINAL 事件。
+     结果：只有 INTERIM_TRANSCRIPT，永远没有 FINAL → 框架 2 秒超时 → 用户无回复。
+
+   - ADAPTIVE: 使用 Speechmatics SDK 内置的 Silero VAD 做客户端转弯检测。
+     但 LiveKit 自己也有 Silero VAD 在运行，两个 VAD 冲突。
+     结果：零转写输出，完全静默。
+
+   - SMART_TURN（推荐）: 由 Speechmatics 服务器端做智能转弯检测，
+     根据语义和停顿自动判断用户是否说完，主动发 AddSegment (FINAL_TRANSCRIPT)。
+     无需客户端干预，与 LiveKit 框架完全兼容。
+     官方文档: https://docs.speechmatics.com/integrations-and-sdks/livekit
+
+3. Speaker Diarization（说话人识别）
+   - enable_diarization=True 开启后，每个 segment 带 speaker_id 和 is_active 标记
+   - is_active=True 表示主要说话人（用户），is_active=False 表示被动说话人（如 TTS 回声）
+   - 解决"说话人混淆"问题：Agent 不会把自己 TTS 的回声当成用户输入
+
+4. Docker 部署注意
+   - SPEECHMATICS_API_KEY 在服务器 .env 中配置，docker-compose.yml 传入容器
+   - 每次改动 src/ 下文件后需 docker compose build voice-agent（注意 COPY 层缓存，
+     如改动未生效需加 --no-cache）
 """
 import logging

@ -13,14 +51,15 @@ from livekit.plugins.speechmatics import STT, TurnDetectionMode

 logger = logging.getLogger(__name__)

-# Map Whisper language codes to Speechmatics language codes
+# Whisper 语言码 → Speechmatics 语言码映射
+# Speechmatics 使用 ISO 639-3（如 "cmn"），而非 ISO 639-1（如 "zh"）
 _LANG_MAP = {
-    "zh": "cmn",
-    "en": "en",
-    "ja": "ja",
-    "ko": "ko",
-    "de": "de",
-    "fr": "fr",
+    "zh": "cmn",   # 中文普通话
+    "en": "en",    # 英语
+    "ja": "ja",    # 日语
+    "ko": "ko",    # 韩语
+    "de": "de",    # 德语
+    "fr": "fr",    # 法语
 }


@ -35,15 +74,23 @@ def create_speechmatics_stt(language: str = "cmn") -> STT:
        Configured speechmatics.STT instance with speaker diarization enabled.
    """
    sm_lang = _LANG_MAP.get(language, language)
+
    stt = STT(
        language=sm_lang,
        include_partials=True,
+        # SMART_TURN: 服务器端智能转弯检测，自动发 FINAL_TRANSCRIPT
+        # 不要用 EXTERNAL（需手动 finalize）或 ADAPTIVE（与 LiveKit VAD 冲突）
        turn_detection_mode=TurnDetectionMode.SMART_TURN,
+        # 说话人识别：区分用户语音与 TTS 回声
        enable_diarization=True,
    )
-    # Workaround: LiveKit's LanguageCode normalizes ISO 639-3 "cmn" back to
-    # ISO 639-1 "zh", but Speechmatics expects "cmn".  Override the internal
-    # option after construction so the raw Speechmatics code is sent.
+
+    # 绕过 LiveKit LanguageCode 的 ISO 639-3 → 639-1 自动归一化
+    # LanguageCode("cmn") 会变成 "zh"，但 Speechmatics 只接受 "cmn"
    stt._stt_options.language = sm_lang  # type: ignore[assignment]
-    logger.info("Speechmatics STT created: language=%s (input=%s), mode=SMART_TURN, diarization=True", sm_lang, language)
+
+    logger.info(
+        "Speechmatics STT created: language=%s (input=%s), mode=SMART_TURN, diarization=True",
+        sm_lang, language,
+    )
    return stt