Commit Graph

15 Commits

Author SHA1 Message Date
hailin 5be7f9c078 fix: resample OpenAI TTS output from 24kHz to 16kHz WAV
OpenAI TTS returns 24kHz audio which Android MediaPlayer can't play
via FlutterSound's pcm16WAV codec. Request raw PCM and resample to
16kHz before wrapping in WAV header, matching the local TTS format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 05:38:39 -08:00
hailin 4456550393 feat: lazy-load local TTS/STT models on first request
Local /synthesize and /transcribe endpoints now auto-load Kokoro/Whisper
models on first call instead of returning 503 when not pre-loaded at
startup. This allows switching between Local and OpenAI providers in the
Flutter test page without requiring server restart.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 04:38:49 -08:00
hailin c02c2a9a11 feat: add OpenAI TTS/STT provider support in voice pipeline
- Add STT_PROVIDER/TTS_PROVIDER config (local or openai) in settings
- Pipeline uses OpenAI API for STT/TTS when provider is "openai"
- Skip loading local models (Kokoro/faster-whisper) when using OpenAI
- VAD (Silero) always loads for speech detection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 09:27:38 -08:00
hailin f8f0d17820 fix: disable SSL verification for OpenAI proxy with self-signed cert
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 08:59:06 -08:00
hailin d43baed3a5 feat: add OpenAI TTS/STT API endpoints for comparison testing
- Add openai package to voice-service requirements
- Add /api/v1/test/tts/synthesize-openai (tts-1/tts-1-hd/gpt-4o-mini-tts)
- Add /api/v1/test/stt/transcribe-openai (gpt-4o-transcribe/whisper-1)
- Add OPENAI_API_KEY and OPENAI_BASE_URL env vars to voice-service
- Flutter test page: SegmentedButton to toggle Local/OpenAI provider
- All endpoints maintain same response format for easy comparison

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 07:20:03 -08:00
hailin 0bd050c80f feat: add STT test and round-trip test to voice test page
- STT: record from mic or upload audio file → faster-whisper transcription
- Round-trip: record → STT → TTS → playback (full pipeline test)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 05:08:00 -08:00
hailin 0aa20cbc73 feat: add temporary TTS test page at /api/v1/test/tts
Browser-accessible page to test text-to-speech synthesis without
going through the full voice pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 05:06:02 -08:00
hailin abf5e29419 feat: route voice pipeline through agent-service instead of direct LLM
Voice calls now use the same agent task + WS subscription flow as the
chat UI, enabling tool use and command execution during voice sessions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 00:47:31 -08:00
hailin 7afbd54fce fix: rewrite voice pipeline for direct WebSocket I/O, fix TTS and navigation
Root cause: Pipecat's WebsocketServerTransport creates its own WebSocket
server on (host,port) and expects FrameProcessor subclasses. Our code was
passing a FastAPI WebSocket object as 'host' and using plain STT/TTS/VAD
service classes that aren't FrameProcessors. The pipeline crashed immediately
when receiving audio, causing "disconnects when speaking".

Changes:
- **base_pipeline.py**: Complete rewrite — replaced Pipecat Pipeline with
  direct async loop: WebSocket → VAD → STT → Claude LLM → TTS → WebSocket.
  Supports barge-in (interrupt TTS when user speaks), audio chunking, and
  24kHz→16kHz TTS resampling.
- **session_router.py**: Pass WebSocket directly to pipeline instead of
  wrapping in AppTransport.
- **app_transport.py**: Deprecated (no longer needed).
- **kokoro_service.py**: Fix misaki compatibility (MutableToken→MToken
  rename), use correct Chinese voice 'zf_xiaoxiao', handle torch tensors.
- **main.py**: Apply misaki monkey-patch before importing kokoro.
- **settings.py**: Change default TTS voice from 'zh_female_1' (non-existent)
  to 'zf_xiaoxiao' (valid Kokoro-82M Chinese female voice).
- **requirements.txt**: Remove pipecat-ai dependency, pin kokoro==0.3.5 +
  misaki==0.7.17, add Chinese NLP deps (pypinyin, cn2an, jieba, ordered-set).
- **agent_call_page.dart**: Wrap each cleanup step in try/catch to ensure
  Navigator.pop() always executes after call ends. Add 3s timeout on session
  delete request.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 23:34:35 -08:00
hailin 6cd53e713c fix: bypass JWT for voice WebSocket route (fixes 401 on WS upgrade)
根因:Kong 日志显示 voice WebSocket 连接被 JWT 插件返回 401,
因为 WebSocket RFC 6455 不支持自定义 header,Flutter 的
WebSocketChannel.connect 无法携带 Authorization header。

修复策略(业界标准做法):
1. Kong: 将 voice-service 的 JWT 从 service 级别改为 route
   级别,仅在 voice-api 和 twilio-webhook 路由启用 JWT,
   voice-ws 路由免除(session 创建已通过 JWT 验证,
   session_id 本身作为认证凭据)
2. 后端: session_router 返回的 websocket_url 改为
   /ws/voice/{session_id}(匹配 Kong voice-ws 路由路径)
3. FastAPI: 在 app 级别增加 /ws/voice/{session_id} 顶级
   WebSocket 路由,委托给 session_router 的 handler

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 21:30:11 -08:00
hailin a6cd3c20d9 feat: add WebSocket robustness to voice call (heartbeat, reconnect, jitter buffer)
Addresses reliability gaps in the real-time voice WebSocket connection
between Flutter client and Python voice-service backend.

Backend (voice-service):
- Heartbeat: new _heartbeat_sender coroutine sends JSON ping text frames
  every 15s alongside the Pipecat pipeline; failed send = dead connection
- Session preservation: on WebSocket disconnect, sessions are now marked
  "disconnected" with a timestamp instead of being deleted, allowing
  reconnection within a configurable TTL (default 60s)
- Reconnect endpoint: POST /sessions/{id}/reconnect verifies the session
  is alive and in "disconnected" state, returns fresh websocket_url
- Reconnect-aware WS handler: detects "disconnected" sessions, cancels
  stale pipeline tasks, creates a new pipeline, sends "session.resumed"
- Background cleanup: asyncio loop every 30s removes sessions that have
  been disconnected longer than session_ttl
- Structured event protocol: text frames = JSON control messages
  (ping/pong/session.resumed/session.ended/error), binary = PCM audio
- New settings: session_ttl (60s), heartbeat_interval (15s),
  heartbeat_timeout (45s)

Flutter (agent_call_page.dart):
- Heartbeat monitoring: tracks last server ping timestamp, triggers
  reconnect if no ping received in 45s (3 missed intervals)
- Auto-reconnect: exponential backoff (1s→2s→4s→8s→16s), max 5 attempts;
  calls /reconnect endpoint to verify session, rebuilds WebSocket,
  resets audio buffer, restarts heartbeat
- Reconnecting UI: yellow warning banner "重新连接中... (N/5)" with
  spinner overlay during reconnection attempts
- WebSocket data routing: _onWsData distinguishes String (JSON control)
  from binary (audio) frames, handles ping/session.resumed/session.ended
- User-initiated disconnect guard: _userEndedCall flag prevents reconnect
  attempts when user intentionally hangs up
- session_id field compatibility: supports session_id/sessionId/id

Flutter (pcm_player.dart):
- Jitter buffer: queues incoming PCM chunks, starts playback only after
  accumulating 4800 bytes (150ms at 16kHz 16-bit mono) to smooth out
  network timing variance
- reset() method: clears buffer on reconnect to discard stale audio
- Buffer underrun handling: re-enters buffering phase if queue empties

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 07:32:19 -08:00
hailin a568558585 feat: replace speech_to_text with GTCRN ML noise reduction + backend STT
Replace traditional on-device speech_to_text with a modern pipeline:
- Record audio via `record` package with hardware noise suppression
- Apply GTCRN neural denoising (sherpa-onnx, ICASSP 2024, 48K params)
- Trim silence, POST to backend /voice/transcribe (faster-whisper)

Changes:
- Add /transcribe endpoint to voice-service for audio file upload
- Add SpeechEnhancer wrapper for sherpa-onnx GTCRN model (523KB)
- Rewrite chat_page.dart voice input: record → denoise → transcribe
- Keep NoiseReducer.trimSilence for silence removal only
- Upgrade record to v6.2.0, add sherpa_onnx, path_provider

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 07:59:15 -08:00
hailin a06b489a1e fix: load voice models in background thread to unblock startup
Model downloads (Whisper, Kokoro, Silero VAD) are synchronous blocking
calls that prevent uvicorn from completing startup and responding to
healthchecks. Move all model loading to a daemon thread so the server
starts immediately.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 00:26:06 -08:00
hailin 3702fa3f52 fix: make voice-service startup graceful and fix device config
- Wrap model loading in try/except so server starts even if models fail
- Fix device env var mapping (unified 'device' field instead of 'whisper_device')
- Default Whisper model to 'base' instead of 'large-v3' (3GB) for CPU deployment
- Increase healthcheck start_period to 120s for model download time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 00:20:12 -08:00
hailin 00f8801d51 Initial commit: IT0 AI-powered server cluster operations platform
Full-stack monorepo with DDD + Clean Architecture:
- Backend: 7 NestJS microservices + 5 shared libraries (TypeScript)
- Mobile: Flutter app with Riverpod (Dart)
- Web Admin: Next.js dashboard with Zustand + React Query
- Voice: Python voice service (STT/TTS/VAD)
- Infra: Docker Compose, K8s manifests, Turborepo build

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 22:54:37 -08:00