fix: increase STT silence_duration_ms to prevent choppy transcription

Default silence_duration_ms=350 is too aggressive for Chinese speech,
causing sentences to be fragmented into 1-3 character chunks. Increase
to 800ms and raise VAD threshold to 0.6 so the STT waits longer before
finalizing a turn, producing complete sentences for LLM processing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
hailin 2026-03-01 18:37:13 -08:00
parent a5c95b460a
commit 186234bae2
1 changed files with 8 additions and 0 deletions

View File

@ -203,6 +203,14 @@ async def entrypoint(ctx: JobContext) -> None:
language=settings.whisper_language,
client=_oai_client,
use_realtime=True,
# Increase silence_duration_ms so Chinese speech isn't chopped
# into tiny fragments (default 350ms is too aggressive).
turn_detection={
"type": "server_vad",
"threshold": 0.6,
"prefix_padding_ms": 600,
"silence_duration_ms": 800,
},
)
else:
stt = LocalWhisperSTT(