Log timestamps, content, and event details at each pipeline stage
to help diagnose voice-agent integration issues.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The engine stream could emit text events before the voice pipeline
subscribed, causing all text to be lost. Now we connect and subscribe
first, then POST the task.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Voice calls now use the same agent task + WS subscription flow as the
chat UI, enabling tool use and command execution during voice sessions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: Pipecat's WebsocketServerTransport creates its own WebSocket
server on (host,port) and expects FrameProcessor subclasses. Our code was
passing a FastAPI WebSocket object as 'host' and using plain STT/TTS/VAD
service classes that aren't FrameProcessors. The pipeline crashed immediately
when receiving audio, causing "disconnects when speaking".
Changes:
- **base_pipeline.py**: Complete rewrite — replaced Pipecat Pipeline with
direct async loop: WebSocket → VAD → STT → Claude LLM → TTS → WebSocket.
Supports barge-in (interrupt TTS when user speaks), audio chunking, and
24kHz→16kHz TTS resampling.
- **session_router.py**: Pass WebSocket directly to pipeline instead of
wrapping in AppTransport.
- **app_transport.py**: Deprecated (no longer needed).
- **kokoro_service.py**: Fix misaki compatibility (MutableToken→MToken
rename), use correct Chinese voice 'zf_xiaoxiao', handle torch tensors.
- **main.py**: Apply misaki monkey-patch before importing kokoro.
- **settings.py**: Change default TTS voice from 'zh_female_1' (non-existent)
to 'zf_xiaoxiao' (valid Kokoro-82M Chinese female voice).
- **requirements.txt**: Remove pipecat-ai dependency, pin kokoro==0.3.5 +
misaki==0.7.17, add Chinese NLP deps (pypinyin, cn2an, jieba, ordered-set).
- **agent_call_page.dart**: Wrap each cleanup step in try/catch to ensure
Navigator.pop() always executes after call ends. Add 3s timeout on session
delete request.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Required by FastAPI for form/file upload parsing. Missing dependency
may cause import errors and container restart loops.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses reliability gaps in the real-time voice WebSocket connection
between Flutter client and Python voice-service backend.
Backend (voice-service):
- Heartbeat: new _heartbeat_sender coroutine sends JSON ping text frames
every 15s alongside the Pipecat pipeline; failed send = dead connection
- Session preservation: on WebSocket disconnect, sessions are now marked
"disconnected" with a timestamp instead of being deleted, allowing
reconnection within a configurable TTL (default 60s)
- Reconnect endpoint: POST /sessions/{id}/reconnect verifies the session
is alive and in "disconnected" state, returns fresh websocket_url
- Reconnect-aware WS handler: detects "disconnected" sessions, cancels
stale pipeline tasks, creates a new pipeline, sends "session.resumed"
- Background cleanup: asyncio loop every 30s removes sessions that have
been disconnected longer than session_ttl
- Structured event protocol: text frames = JSON control messages
(ping/pong/session.resumed/session.ended/error), binary = PCM audio
- New settings: session_ttl (60s), heartbeat_interval (15s),
heartbeat_timeout (45s)
Flutter (agent_call_page.dart):
- Heartbeat monitoring: tracks last server ping timestamp, triggers
reconnect if no ping received in 45s (3 missed intervals)
- Auto-reconnect: exponential backoff (1s→2s→4s→8s→16s), max 5 attempts;
calls /reconnect endpoint to verify session, rebuilds WebSocket,
resets audio buffer, restarts heartbeat
- Reconnecting UI: yellow warning banner "重新连接中... (N/5)" with
spinner overlay during reconnection attempts
- WebSocket data routing: _onWsData distinguishes String (JSON control)
from binary (audio) frames, handles ping/session.resumed/session.ended
- User-initiated disconnect guard: _userEndedCall flag prevents reconnect
attempts when user intentionally hangs up
- session_id field compatibility: supports session_id/sessionId/id
Flutter (pcm_player.dart):
- Jitter buffer: queues incoming PCM chunks, starts playback only after
accumulating 4800 bytes (150ms at 16kHz 16-bit mono) to smooth out
network timing variance
- reset() method: clears buffer on reconnect to discard stale audio
- Buffer underrun handling: re-enters buffering phase if queue empties
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace traditional on-device speech_to_text with a modern pipeline:
- Record audio via `record` package with hardware noise suppression
- Apply GTCRN neural denoising (sherpa-onnx, ICASSP 2024, 48K params)
- Trim silence, POST to backend /voice/transcribe (faster-whisper)
Changes:
- Add /transcribe endpoint to voice-service for audio file upload
- Add SpeechEnhancer wrapper for sherpa-onnx GTCRN model (523KB)
- Rewrite chat_page.dart voice input: record → denoise → transcribe
- Keep NoiseReducer.trimSilence for silence removal only
- Upgrade record to v6.2.0, add sherpa_onnx, path_provider
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Model downloads (Whisper, Kokoro, Silero VAD) are synchronous blocking
calls that prevent uvicorn from completing startup and responding to
healthchecks. Move all model loading to a daemon thread so the server
starts immediately.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Wrap model loading in try/except so server starts even if models fail
- Fix device env var mapping (unified 'device' field instead of 'whisper_device')
- Default Whisper model to 'base' instead of 'large-v3' (3GB) for CPU deployment
- Increase healthcheck start_period to 120s for model download time
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Dockerfile.service: fix entry point path (dist/services/{name}/src/main)
due to tsconfig paths widening rootDir during compilation
- Kong config: remove unsupported ws/wss protocols (WebSocket works
automatically over http/https in Kong 3.7)
- voice-service: fix pipecat import path for v0.0.30 API
(pipecat.transports.network.websocket_server with lowercase class names)
- voice-service: add openai dependency required by pipecat anthropic service
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
faster-whisper 1.0.0 depends on av==11.* which has no prebuilt wheels
and fails to compile. Version 1.2.1 uses av 12+ with prebuilt wheels.
Also removed unnecessary FFmpeg dev libraries from Dockerfile.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PyAV (av==11, dep of faster-whisper) requires pkg-config and
FFmpeg development headers to compile from source.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Server is on HK network, no need for China mirrors. Added
build-essential for compiling native Python packages (kokoro, etc).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
web-admin npm ci was timing out on the server. Added npmmirror.com
for npm and tsinghua mirror for pip to resolve network issues.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>