docs: add comprehensive Speechmatics STT integration notes
Document all findings from the integration process directly in the
source code for future reference:
1. Language code mapping: Speechmatics uses ISO 639-3 "cmn" for
Mandarin, but LiveKit LanguageCode auto-normalizes it to "zh".
Must override stt._stt_options.language after construction.
2. Turn detection modes (critical):
- EXTERNAL: unusable — LiveKit never sends FlushSentinel, only
pushes silence frames, so FINAL_TRANSCRIPT never arrives
- ADAPTIVE: unusable — client-side Silero VAD conflicts with
LiveKit's own VAD, produces zero transcription output
- SMART_TURN: correct choice — server-side intelligent turn
detection, auto-emits FINAL_TRANSCRIPT, fully compatible
3. Speaker diarization: is_active flag distinguishes primary speaker
from TTS echo, solving the "speaker confusion" problem
4. Docker deployment: SPEECHMATICS_API_KEY in .env, watch for
COPY layer cache when rebuilding
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
f30aa414dd
commit
e8a3e07116
|
|
@ -6,6 +6,44 @@ Mandarin recognition with speaker diarization support.
|
|||
|
||||
The SPEECHMATICS_API_KEY environment variable is read automatically
|
||||
by the livekit-plugins-speechmatics package.
|
||||
|
||||
===========================================================================
|
||||
集成笔记 (2026-03-03)
|
||||
===========================================================================
|
||||
|
||||
1. 语言码映射
|
||||
- Speechmatics 使用 ISO 639-3 语言码,中文普通话为 "cmn"
|
||||
- LiveKit 的 LanguageCode 类会自动将 "cmn" 归一化为 ISO 639-1 的 "zh"
|
||||
(见 livekit/agents/_language_data.py: ISO_639_3_TO_1["cmn"] = "zh")
|
||||
- 但 Speechmatics API 不接受 "zh",会报 "lang pack [zh] is not supported"
|
||||
- 解决:构造 STT 后手动覆盖 stt._stt_options.language = "cmn"
|
||||
|
||||
2. Turn Detection 模式选择(关键!)
|
||||
三种模式在 LiveKit 框架下的实际表现:
|
||||
|
||||
- EXTERNAL: 需要客户端手动调用 client.finalize() 才会产生 FINAL_TRANSCRIPT。
|
||||
但 LiveKit agents 框架(v1.4.4)在 VAD 检测到说话结束后并不调用
|
||||
stream.flush()(不发 FlushSentinel),而是推送静音帧 + 等待 FINAL 事件。
|
||||
结果:只有 INTERIM_TRANSCRIPT,永远没有 FINAL → 框架 2 秒超时 → 用户无回复。
|
||||
|
||||
- ADAPTIVE: 使用 Speechmatics SDK 内置的 Silero VAD 做客户端转弯检测。
|
||||
但 LiveKit 自己也有 Silero VAD 在运行,两个 VAD 冲突。
|
||||
结果:零转写输出,完全静默。
|
||||
|
||||
- SMART_TURN(推荐): 由 Speechmatics 服务器端做智能转弯检测,
|
||||
根据语义和停顿自动判断用户是否说完,主动发 AddSegment (FINAL_TRANSCRIPT)。
|
||||
无需客户端干预,与 LiveKit 框架完全兼容。
|
||||
官方文档: https://docs.speechmatics.com/integrations-and-sdks/livekit
|
||||
|
||||
3. Speaker Diarization(说话人识别)
|
||||
- enable_diarization=True 开启后,每个 segment 带 speaker_id 和 is_active 标记
|
||||
- is_active=True 表示主要说话人(用户),is_active=False 表示被动说话人(如 TTS 回声)
|
||||
- 解决"说话人混淆"问题:Agent 不会把自己 TTS 的回声当成用户输入
|
||||
|
||||
4. Docker 部署注意
|
||||
- SPEECHMATICS_API_KEY 在服务器 .env 中配置,docker-compose.yml 传入容器
|
||||
- 每次改动 src/ 下文件后需 docker compose build voice-agent(注意 COPY 层缓存,
|
||||
如改动未生效需加 --no-cache)
|
||||
"""
|
||||
import logging
|
||||
|
||||
|
|
@ -13,14 +51,15 @@ from livekit.plugins.speechmatics import STT, TurnDetectionMode
|
|||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Map Whisper language codes to Speechmatics language codes
|
||||
# Whisper 语言码 → Speechmatics 语言码映射
|
||||
# Speechmatics 使用 ISO 639-3(如 "cmn"),而非 ISO 639-1(如 "zh")
|
||||
_LANG_MAP = {
|
||||
"zh": "cmn",
|
||||
"en": "en",
|
||||
"ja": "ja",
|
||||
"ko": "ko",
|
||||
"de": "de",
|
||||
"fr": "fr",
|
||||
"zh": "cmn", # 中文普通话
|
||||
"en": "en", # 英语
|
||||
"ja": "ja", # 日语
|
||||
"ko": "ko", # 韩语
|
||||
"de": "de", # 德语
|
||||
"fr": "fr", # 法语
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -35,15 +74,23 @@ def create_speechmatics_stt(language: str = "cmn") -> STT:
|
|||
Configured speechmatics.STT instance with speaker diarization enabled.
|
||||
"""
|
||||
sm_lang = _LANG_MAP.get(language, language)
|
||||
|
||||
stt = STT(
|
||||
language=sm_lang,
|
||||
include_partials=True,
|
||||
# SMART_TURN: 服务器端智能转弯检测,自动发 FINAL_TRANSCRIPT
|
||||
# 不要用 EXTERNAL(需手动 finalize)或 ADAPTIVE(与 LiveKit VAD 冲突)
|
||||
turn_detection_mode=TurnDetectionMode.SMART_TURN,
|
||||
# 说话人识别:区分用户语音与 TTS 回声
|
||||
enable_diarization=True,
|
||||
)
|
||||
# Workaround: LiveKit's LanguageCode normalizes ISO 639-3 "cmn" back to
|
||||
# ISO 639-1 "zh", but Speechmatics expects "cmn". Override the internal
|
||||
# option after construction so the raw Speechmatics code is sent.
|
||||
|
||||
# 绕过 LiveKit LanguageCode 的 ISO 639-3 → 639-1 自动归一化
|
||||
# LanguageCode("cmn") 会变成 "zh",但 Speechmatics 只接受 "cmn"
|
||||
stt._stt_options.language = sm_lang # type: ignore[assignment]
|
||||
logger.info("Speechmatics STT created: language=%s (input=%s), mode=SMART_TURN, diarization=True", sm_lang, language)
|
||||
|
||||
logger.info(
|
||||
"Speechmatics STT created: language=%s (input=%s), mode=SMART_TURN, diarization=True",
|
||||
sm_lang, language,
|
||||
)
|
||||
return stt
|
||||
|
|
|
|||
Loading…
Reference in New Issue