hailin/it0 - it0 - AI Wolves Team

Commit Graph

Author	SHA1	Message	Date
hailin	7fb0d1de95	refactor: remove Speechmatics STT integration entirely, default to OpenAI - Delete speechmatics_stt.py plugin - Remove speechmatics branch from voice-agent entrypoint - Remove livekit-plugins-speechmatics dependency - Change default stt_provider to 'openai' in entity, controller, and UI - Remove SPEECHMATICS_API_KEY from docker-compose.yml - Remove speechmatics option from web-admin settings dropdown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 04:58:38 -08:00
hailin	191ce2d6b3	fix: use FIXED mode with 1s silence trigger instead of SMART_TURN SMART_TURN fragments continuous speech into tiny pieces, each triggering an LLM request that aborts the previous one. FIXED mode waits for a configurable silence duration (1.0s) before emitting FINAL_TRANSCRIPT via the built-in END_OF_UTTERANCE handler. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 04:53:00 -08:00
hailin	e8a3e07116	docs: add comprehensive Speechmatics STT integration notes Document all findings from the integration process directly in the source code for future reference: 1. Language code mapping: Speechmatics uses ISO 639-3 "cmn" for Mandarin, but LiveKit LanguageCode auto-normalizes it to "zh". Must override stt._stt_options.language after construction. 2. Turn detection modes (critical): - EXTERNAL: unusable — LiveKit never sends FlushSentinel, only pushes silence frames, so FINAL_TRANSCRIPT never arrives - ADAPTIVE: unusable — client-side Silero VAD conflicts with LiveKit's own VAD, produces zero transcription output - SMART_TURN: correct choice — server-side intelligent turn detection, auto-emits FINAL_TRANSCRIPT, fully compatible 3. Speaker diarization: is_active flag distinguishes primary speaker from TTS echo, solving the "speaker confusion" problem 4. Docker deployment: SPEECHMATICS_API_KEY in .env, watch for COPY layer cache when rebuilding Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 04:47:33 -08:00
hailin	f30aa414dd	fix: use SMART_TURN mode per Speechmatics official recommendation Replace EXTERNAL mode + monkey-patch hack with SMART_TURN mode. SMART_TURN uses Speechmatics server-side turn detection that properly emits AddSegment (FINAL_TRANSCRIPT) when the user finishes speaking. No client-side finalize or debounce timer needed. Ref: https://docs.speechmatics.com/integrations-and-sdks/livekit Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 04:44:21 -08:00
hailin	de99990c4d	fix: text-based dedup to prevent duplicate FINAL_TRANSCRIPT emissions Speechmatics re-sends identical partial segments during silence, causing the debounce timer to fire multiple times with the same text. Each duplicate FINAL aborts the in-flight LLM request and restarts it. Replace time-based cooldown with text comparison: skip finalization if the segment text matches the last finalized text. Also skip starting new timers when partial text hasn't changed from last finalized. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 04:40:00 -08:00
hailin	3b0119fe09	fix: reduce STT latency, add cooldown dedup, enable diarization - Reduce debounce delay from 700ms to 400ms for faster response - Add 1.5s cooldown after emitting FINAL to prevent duplicate triggers that cause LLM abort/retry cycles - Enable speaker diarization (enable_diarization=True) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 03:20:12 -08:00
hailin	8ac1884ab4	fix: use debounce timer to auto-finalize Speechmatics partial transcripts The LiveKit framework never sends FlushSentinel to the STT stream. Instead it pushes silence frames and waits for FINAL_TRANSCRIPT events. In EXTERNAL turn-detection mode, Speechmatics only emits partials. New approach: each partial transcript restarts a 700ms debounce timer. When partials stop (user stops speaking), the timer fires and promotes the last partial to FINAL_TRANSCRIPT, unblocking the pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 03:08:17 -08:00
hailin	de3eccafd0	debug: add verbose logging to Speechmatics monkey-patch Trace _patched_process_audio lifecycle and FlushSentinel handling to diagnose why final transcripts are not being promoted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 02:50:04 -08:00
hailin	1431dc0c83	fix: directly promote partial transcripts to FINAL on FlushSentinel VoiceAgentClient.finalize() schedules an async task chain that often loses the race against session teardown. Instead, intercept partial segments as they arrive, stash them, and synchronously emit them as FINAL_TRANSCRIPT when FlushSentinel fires. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 02:16:46 -08:00
hailin	73fd56f30a	fix: durable monkey-patch for Speechmatics finalize on flush Move the SpeechStream._process_audio patch from container runtime into our own source code so it survives Docker rebuilds. The patch adds client.finalize() on FlushSentinel so EXTERNAL mode produces final transcripts when LiveKit's VAD detects end of speech. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 02:00:42 -08:00
hailin	6707c5048d	fix: use EXTERNAL mode + patch plugin to finalize on flush EXTERNAL mode produces partial transcripts but livekit-plugins-speechmatics does not call finalize() when receiving a flush sentinel from the framework. A runtime monkey-patch on the plugin's SpeechStream._process_audio adds the missing finalize() call so final transcripts are generated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 01:58:25 -08:00
hailin	8f951ad31c	fix: use turn_detection=stt for Speechmatics per official docs Speechmatics handles end-of-utterance natively via its Voice Agent API (ADAPTIVE mode). Use turn_detection="stt" on AgentSession so LiveKit delegates turn boundaries to the STT engine instead of conflicting with its own VAD-based turn detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 01:44:10 -08:00
hailin	db4e70e30c	fix: use EXTERNAL turn detection for Speechmatics in LiveKit pipeline ADAPTIVE mode enables a second client-side Silero VAD inside the Speechmatics SDK that conflicts with LiveKit's own VAD pipeline, causing no transcription to be returned. EXTERNAL mode delegates turn detection to LiveKit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 01:31:33 -08:00
hailin	9daf0e3b4f	fix: bypass LanguageCode normalization that maps cmn back to zh LiveKit's LanguageCode class normalizes ISO 639-3 codes to ISO 639-1 (cmn → zh), but Speechmatics API expects "cmn" not "zh". Override the internal _stt_options.language after construction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 01:04:20 -08:00
hailin	7292ac6ca6	fix: use cmn instead of cmn_en for Speechmatics Voice Agent API cmn_en bilingual code not supported by Voice Agent API, causes timeout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 00:19:50 -08:00
hailin	17ff9d3ce0	fix: use Speechmatics cmn_en bilingual model for Chinese-English mixed speech Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 23:57:26 -08:00
hailin	1d43943110	fix: correct Speechmatics STT language mapping and parameter name - Map Whisper language codes (zh→cmn, en→en, etc.) to Speechmatics codes - Fix parameter name: enable_partials → include_partials Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 23:56:37 -08:00
hailin	f9c47de04b	feat: add STT provider switching (OpenAI ↔ Speechmatics) in settings - Add VoiceConfig entity/repo/service/controller in agent-service for per-tenant STT provider persistence (default: speechmatics) - Add Speechmatics STT plugin in voice-agent with livekit-plugins-speechmatics - Modify voice-agent entrypoint for 3-way STT selection: metadata > agent-service config > env var fallback - Add "Voice" section in web-admin settings page with STT provider dropdown - Add i18n translations (en/zh) for voice settings - Add SPEECHMATICS_API_KEY env var in docker-compose Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 22:13:18 -08:00
hailin	ce63ece340	feat: add mixed-mode input (text + images + files) during voice calls Enable users to send text messages, images, and files to the Agent while an active voice call is in progress. This addresses the case where spoken instructions are unclear or screenshots/documents need to be shared for analysis. ## Architecture Data flows through LiveKit data channel (not direct HTTP): Flutter → publishData(topic='text_inject') → voice-agent → llm.inject_text_message() → POST /api/v1/agent/tasks (same session) → collect streamed response → session.say() → TTS playback This preserves the constraint that voice-agent owns the agent-service sessionId — Flutter never contacts agent-service directly. ## Flutter UI (agent_call_page.dart) - Add keyboard toggle button to active call controls (4-button row) - Collapsible text input area with attachment picker (+) and send button - Attachment support: gallery multi-select, camera, file picker (images max 1024x1024 quality 80%, PDF supported, max 5 attachments) - Horizontal scrolling attachment preview with delete buttons - 200KB payload size check before LiveKit data channel send - Layout adapts: Spacer flex 1/3 toggle, reduced bottom padding ## voice-agent (agent.py) - Register data_received event listener after session.start() - Filter for topic='text_inject', parse JSON payload - Call llm.inject_text_message(text, attachments) and TTS via session.say() - Use asyncio.ensure_future() wrapper for async handler (matches existing disconnect handler pattern for sync EventEmitter) ## AgentServiceLLM (agent_llm.py) - New inject_text_message(text, attachments) method on AgentServiceLLM - Reuses same _agent_session_id for conversation context continuity - WS+HTTP streaming: connect, pre-subscribe, POST /tasks with attachments field, collect full text response, return string - _injecting flag prevents concurrent _do_stream from clearing session ID on abort errors while inject is in progress - Same systemPrompt/voiceMode/engineType as voice pipeline No agent-service changes required — attachments already supported end-to-end (JSONB storage → multimodal content blocks → Claude). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 05:38:04 -08:00
hailin	02aaf40bb2	fix: move voice instructions to systemPrompt, keep prompt clean Previously, voice mode wrapped every user message with 【语音对话模式】 instructions, polluting conversation_messages history with repeated instructions on every turn. Now: - systemPrompt carries voice-mode instructions (set once, not per-message) - prompt contains only the clean user text (identical to text chat pattern) - Conversation history stays clean for multi-turn context Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 03:24:50 -08:00
hailin	da17488389	feat: voice mode event filtering — skip tool/thinking events for Agent SDK 1. Remove on_enter greeting entirely (no more race condition) 2. voice-agent sends voiceMode: true when engine_type is claude_agent_sdk 3. AgentController.runTaskStream() filters thinking, tool_use, tool_result events in voice mode — only text, completed, error reach the client 4. Detailed logging: each event logged with [FILTERED-voice] tag when skipped Claude API mode is completely unaffected (voiceMode defaults to false). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 02:56:41 -08:00
hailin	7c9fabd891	fix: avoid Agent SDK race on greeting + clear session on abort 1. Change on_enter greeting from generate_reply() to session.say() with a static message — avoids spawning an Agent SDK task just for a greeting, which caused a race condition when the user speaks before it completes. 2. Clear agent session ID when receiving abort/exit errors so the next task starts a fresh session instead of trying to resume a dead process. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 02:22:52 -08:00
hailin	a78e2cd923	chore: add detailed engine type logging for verification Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 02:18:29 -08:00
hailin	59a3e60b82	feat: add engine type selection (Agent SDK / Claude API) for voice calls Full-stack implementation allowing users to choose between Claude Agent SDK (default, with tool approval, skill injection, session resume) and Claude API (direct, lower latency) in Flutter settings. Agent SDK mode wraps prompts with voice-conversation instructions for concise spoken Chinese output. Data flow: Flutter Settings → SharedPreferences → POST /livekit/token → RoomAgentDispatch metadata → voice-agent → AgentServiceLLM(engine_type) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 02:11:51 -08:00
hailin	e66c187353	fix: improve voice pipeline robustness for poor network conditions Flutter (agent_call_page.dart): - Add ConnectOptions with 15s timeouts for connection/peerConnection/iceRestart - Add RoomReconnectingEvent/RoomAttemptReconnectEvent/RoomReconnectedEvent listeners with "网络重连中" UI indicator during reconnection - Add TimeoutException detection in _friendlyError() voice-agent (agent.py): - Wrap entrypoint() in try-except with full traceback logging - Register room disconnect listener to close httpx clients (instead of finally block, since session.start() returns while session runs in bg) - Add asyncio import for ensure_future cleanup voice-agent LLM proxy (agent_llm.py): - Add retry with exponential backoff (max 2 retries, 1s/3s delays) for network errors (ConnectError/ConnectTimeout/OSError) and WS InvalidStatusCode - Extract _do_stream() method for single-attempt logic - Add WebSocket connection params: open_timeout=10, ping_interval=20, ping_timeout=10 for keepalive and faster dead-connection detection - Use granular httpx.Timeout(connect=10, read=30, write=10, pool=10) - Increase WS recv timeout from 5s to 30s to reduce unnecessary loops Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 23:34:55 -08:00
hailin	e302891f16	fix: disable SSL verify for self-signed OpenAI proxy + handle no-user-msg - Pass httpx.AsyncClient(verify=False) to OpenAI STT/TTS to support self-signed certificate on OPENAI_BASE_URL proxy - Handle generate_reply calls with no user message by falling back to system/developer instructions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 21:39:49 -08:00
hailin	2112445191	fix: voice-agent crash — add room I/O options and filter AgentConfigUpdate - Add room_input_options/room_output_options to session.start() so agent binds audio I/O and stays in the room - Add wait_for_participant() before starting session - Filter AgentConfigUpdate items in agent_llm.py (no 'role' attribute) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 21:08:07 -08:00
hailin	94a14b3104	feat: migrate voice call from WebSocket/PCM to LiveKit WebRTC 实时语音对话架构迁移：WebSocket → LiveKit WebRTC ## 背景原语音通话架构基于 FastAPI WebSocket 传输原始 PCM，管道串行执行（VAD → 批量STT → Agent → 攒句 → 批量TTS），首音频延迟约 6 秒。迁移到 LiveKit Agents 框架后，利用 WebRTC 传输 + 流水线并行，预期延迟降至 1.5-2 秒。 ## 架构 Flutter App ←── WebRTC (Opus/UDP) ──→ LiveKit Server ←──→ Voice Agent livekit_client (自部署, Go) (Python, LiveKit Agents SDK) ├─ VAD (Silero) ├─ STT (faster-whisper / OpenAI) ├─ LLM (自定义插件 → agent-service) └─ TTS (Kokoro / OpenAI) 关键设计：LLM 不直接调用 Claude API，而是通过自定义插件代理到现有 agent-service，保留 Tool Use、会话历史、租户隔离等能力。 ## 新增服务 ### voice-agent (packages/services/voice-agent/) LiveKit Agent Worker，包含： - agent.py: 入口，prewarm() 预加载模型，entrypoint() 编排会话 - plugins/agent_llm.py: 自定义 LLM 插件，代理 agent-service API - POST /api/v1/agent/tasks 创建任务 - WS /ws/agent 订阅流式事件 (stream_event) - 跨轮复用 session_id 保持对话上下文 - plugins/whisper_stt.py: 本地 faster-whisper STT (批量识别) - plugins/kokoro_tts.py: 本地 Kokoro-82M TTS (24kHz PCM) - config.py: pydantic-settings 配置 ### LiveKit Server (deploy/docker/) - livekit.yaml: 信令端口 7880, RTC TCP 7881, UDP 50000-50200 - docker-compose.yml: 新增 livekit-server + voice-agent 容器 ### LiveKit Token 端点 - voice-service/src/api/livekit_token.py: POST /api/v1/voice/livekit/token 生成 Room JWT，嵌入 auth_header 到 AgentDispatch metadata ## Flutter 客户端改造 - agent_call_page.dart: 从 ~814 行简化到 ~380 行 - 替换: WebSocketChannel, AudioRecorder, PcmPlayer, 手动心跳/重连 - 使用: Room.connect(), setMicrophoneEnabled(true), LiveKit 事件监听 - 波形动画改用 participant.audioLevel - pubspec.yaml: 添加 livekit_client: ^2.3.0 - app_config.dart: 增加 livekitUrl 字段 - api_endpoints.dart: 增加 livekitToken 端点 ## 配置说明 (环境变量) - STT_PROVIDER: local (默认, faster-whisper) / openai - TTS_PROVIDER: local (默认, Kokoro) / openai - WHISPER_MODEL: base (默认) / small / medium / large - WHISPER_LANGUAGE: zh (默认) - KOKORO_VOICE: zf_xiaoxiao (默认) - DEVICE: cpu (默认) / cuda ## 不变的部分 - agent-service: 完全不改，voice-agent 通过现有 API 调用 - voice-service 核心: pipeline/STT/TTS/VAD 保留 (Twilio 备用) - Kong 网关: 现有路由不变 - 数据库: 无 schema 变更 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 08:55:33 -08:00

28 Commits