Document all findings from the integration process directly in the
source code for future reference:
1. Language code mapping: Speechmatics uses ISO 639-3 "cmn" for
Mandarin, but LiveKit LanguageCode auto-normalizes it to "zh".
Must override stt._stt_options.language after construction.
2. Turn detection modes (critical):
- EXTERNAL: unusable — LiveKit never sends FlushSentinel, only
pushes silence frames, so FINAL_TRANSCRIPT never arrives
- ADAPTIVE: unusable — client-side Silero VAD conflicts with
LiveKit's own VAD, produces zero transcription output
- SMART_TURN: correct choice — server-side intelligent turn
detection, auto-emits FINAL_TRANSCRIPT, fully compatible
3. Speaker diarization: is_active flag distinguishes primary speaker
from TTS echo, solving the "speaker confusion" problem
4. Docker deployment: SPEECHMATICS_API_KEY in .env, watch for
COPY layer cache when rebuilding
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace EXTERNAL mode + monkey-patch hack with SMART_TURN mode.
SMART_TURN uses Speechmatics server-side turn detection that properly
emits AddSegment (FINAL_TRANSCRIPT) when the user finishes speaking.
No client-side finalize or debounce timer needed.
Ref: https://docs.speechmatics.com/integrations-and-sdks/livekit
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Speechmatics re-sends identical partial segments during silence, causing
the debounce timer to fire multiple times with the same text. Each
duplicate FINAL aborts the in-flight LLM request and restarts it.
Replace time-based cooldown with text comparison: skip finalization if
the segment text matches the last finalized text. Also skip starting
new timers when partial text hasn't changed from last finalized.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reduce debounce delay from 700ms to 400ms for faster response
- Add 1.5s cooldown after emitting FINAL to prevent duplicate triggers
that cause LLM abort/retry cycles
- Enable speaker diarization (enable_diarization=True)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The LiveKit framework never sends FlushSentinel to the STT stream.
Instead it pushes silence frames and waits for FINAL_TRANSCRIPT events.
In EXTERNAL turn-detection mode, Speechmatics only emits partials.
New approach: each partial transcript restarts a 700ms debounce timer.
When partials stop (user stops speaking), the timer fires and promotes
the last partial to FINAL_TRANSCRIPT, unblocking the pipeline.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Trace _patched_process_audio lifecycle and FlushSentinel handling
to diagnose why final transcripts are not being promoted.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
VoiceAgentClient.finalize() schedules an async task chain that often
loses the race against session teardown. Instead, intercept partial
segments as they arrive, stash them, and synchronously emit them as
FINAL_TRANSCRIPT when FlushSentinel fires.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the SpeechStream._process_audio patch from container runtime
into our own source code so it survives Docker rebuilds. The patch
adds client.finalize() on FlushSentinel so EXTERNAL mode produces
final transcripts when LiveKit's VAD detects end of speech.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EXTERNAL mode produces partial transcripts but livekit-plugins-speechmatics
does not call finalize() when receiving a flush sentinel from the framework.
A runtime monkey-patch on the plugin's SpeechStream._process_audio adds the
missing finalize() call so final transcripts are generated.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Speechmatics handles end-of-utterance natively via its Voice Agent
API (ADAPTIVE mode). Use turn_detection="stt" on AgentSession so
LiveKit delegates turn boundaries to the STT engine instead of
conflicting with its own VAD-based turn detection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ADAPTIVE mode enables a second client-side Silero VAD inside the
Speechmatics SDK that conflicts with LiveKit's own VAD pipeline,
causing no transcription to be returned. EXTERNAL mode delegates
turn detection to LiveKit.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LiveKit's LanguageCode class normalizes ISO 639-3 codes to ISO 639-1
(cmn → zh), but Speechmatics API expects "cmn" not "zh". Override
the internal _stt_options.language after construction.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Problem:
- Text input area caused BOTTOM OVERFLOWED BY 135 PIXELS when keyboard opened
- Input bar overlapped with call control buttons
- Sent messages were not displayed on screen (only SnackBar feedback)
Solution — split into two distinct layouts:
1. Call Mode (default):
- Full-screen call UI: avatar, waveform, duration, large control buttons
- Keyboard button in controls toggles to chat mode
- No text input elements — clean voice-only interface
2. Chat Mode (tap keyboard button):
- Compact call header: green status dot + "iAgent" + duration + inline
mute/end/speaker/collapse controls
- Scrollable message list (Expanded widget — properly handles keyboard)
- User messages: right-aligned blue bubbles with attachment thumbnails
- Agent responses: left-aligned gray bubbles with robot avatar
- Input bar at bottom: attachment picker + text field + send button
Message display:
- User-sent text/attachments tracked in _messages list, shown as bubbles
- Agent responses sent back via LiveKit data channel (topic='text_reply')
from voice-agent → Flutter, displayed as assistant bubbles
- Auto-scroll to latest message
Voice-agent change (agent.py):
- After session.say(response), publish response text back to Flutter via
ctx.room.local_participant.publish_data() with topic='text_reply'
- Flutter listens for DataReceivedEvent to display agent responses
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable users to send text messages, images, and files to the Agent
while an active voice call is in progress. This addresses the case
where spoken instructions are unclear or screenshots/documents need
to be shared for analysis.
## Architecture
Data flows through LiveKit data channel (not direct HTTP):
Flutter → publishData(topic='text_inject') → voice-agent
→ llm.inject_text_message() → POST /api/v1/agent/tasks (same session)
→ collect streamed response → session.say() → TTS playback
This preserves the constraint that voice-agent owns the agent-service
sessionId — Flutter never contacts agent-service directly.
## Flutter UI (agent_call_page.dart)
- Add keyboard toggle button to active call controls (4-button row)
- Collapsible text input area with attachment picker (+) and send button
- Attachment support: gallery multi-select, camera, file picker
(images max 1024x1024 quality 80%, PDF supported, max 5 attachments)
- Horizontal scrolling attachment preview with delete buttons
- 200KB payload size check before LiveKit data channel send
- Layout adapts: Spacer flex 1/3 toggle, reduced bottom padding
## voice-agent (agent.py)
- Register data_received event listener after session.start()
- Filter for topic='text_inject', parse JSON payload
- Call llm.inject_text_message(text, attachments) and TTS via session.say()
- Use asyncio.ensure_future() wrapper for async handler (matches
existing disconnect handler pattern for sync EventEmitter)
## AgentServiceLLM (agent_llm.py)
- New inject_text_message(text, attachments) method on AgentServiceLLM
- Reuses same _agent_session_id for conversation context continuity
- WS+HTTP streaming: connect, pre-subscribe, POST /tasks with
attachments field, collect full text response, return string
- _injecting flag prevents concurrent _do_stream from clearing
session ID on abort errors while inject is in progress
- Same systemPrompt/voiceMode/engineType as voice pipeline
No agent-service changes required — attachments already supported
end-to-end (JSONB storage → multimodal content blocks → Claude).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, voice mode wrapped every user message with 【语音对话模式】
instructions, polluting conversation_messages history with repeated
instructions on every turn. Now:
- systemPrompt carries voice-mode instructions (set once, not per-message)
- prompt contains only the clean user text (identical to text chat pattern)
- Conversation history stays clean for multi-turn context
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Remove on_enter greeting entirely (no more race condition)
2. voice-agent sends voiceMode: true when engine_type is claude_agent_sdk
3. AgentController.runTaskStream() filters thinking, tool_use, tool_result
events in voice mode — only text, completed, error reach the client
4. Detailed logging: each event logged with [FILTERED-voice] tag when skipped
Claude API mode is completely unaffected (voiceMode defaults to false).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Change on_enter greeting from generate_reply() to session.say() with
a static message — avoids spawning an Agent SDK task just for a greeting,
which caused a race condition when the user speaks before it completes.
2. Clear agent session ID when receiving abort/exit errors so the next
task starts a fresh session instead of trying to resume a dead process.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Full-stack implementation allowing users to choose between Claude Agent SDK
(default, with tool approval, skill injection, session resume) and Claude API
(direct, lower latency) in Flutter settings. Agent SDK mode wraps prompts with
voice-conversation instructions for concise spoken Chinese output.
Data flow: Flutter Settings → SharedPreferences → POST /livekit/token →
RoomAgentDispatch metadata → voice-agent → AgentServiceLLM(engine_type)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Flutter (agent_call_page.dart):
- Add ConnectOptions with 15s timeouts for connection/peerConnection/iceRestart
- Add RoomReconnectingEvent/RoomAttemptReconnectEvent/RoomReconnectedEvent
listeners with "网络重连中" UI indicator during reconnection
- Add TimeoutException detection in _friendlyError()
voice-agent (agent.py):
- Wrap entrypoint() in try-except with full traceback logging
- Register room disconnect listener to close httpx clients (instead of
finally block, since session.start() returns while session runs in bg)
- Add asyncio import for ensure_future cleanup
voice-agent LLM proxy (agent_llm.py):
- Add retry with exponential backoff (max 2 retries, 1s/3s delays) for
network errors (ConnectError/ConnectTimeout/OSError) and WS InvalidStatusCode
- Extract _do_stream() method for single-attempt logic
- Add WebSocket connection params: open_timeout=10, ping_interval=20,
ping_timeout=10 for keepalive and faster dead-connection detection
- Use granular httpx.Timeout(connect=10, read=30, write=10, pool=10)
- Increase WS recv timeout from 5s to 30s to reduce unnecessary loops
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Default silence_duration_ms=350 is too aggressive for Chinese speech,
causing sentences to be fragmented into 1-3 character chunks. Increase
to 800ms and raise VAD threshold to 0.6 so the STT waits longer before
finalizing a turn, producing complete sentences for LLM processing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The OpenAI Realtime STT uses aiohttp WebSocket connections (not httpx),
so the existing httpx verify=False fix does not apply. LiveKit's
http_context creates aiohttp.TCPConnector without ssl=False, causing
SSL certificate verification errors when OPENAI_BASE_URL points to a
proxy with a self-signed certificate.
Monkey-patch http_context._new_session_ctx to inject ssl=False into the
aiohttp connector, fixing the "CERTIFICATE_VERIFY_FAILED" error for
Realtime STT WebSocket connections.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add user-configurable TTS voice and tone style settings that flow from
the Flutter app through the backend to the voice-agent at call time.
## Flutter App (it0_app)
### Domain Layer
- app_settings.dart: Add `ttsVoice` (default: 'coral') and `ttsStyle`
(default: '') fields to AppSettings entity with copyWith support
### Data Layer
- settings_datasource.dart: Add SharedPreferences keys
`settings_tts_voice` and `settings_tts_style` for local persistence
in loadSettings(), saveSettings(), and clearSettings()
### Presentation Layer
- settings_providers.dart: Add `setTtsVoice()` and `setTtsStyle()`
methods to SettingsNotifier for Riverpod state management
- settings_page.dart: Add "语音" settings group between Notifications
and Security groups with:
- Voice picker: 13 OpenAI voices with gender/style labels
(e.g. "女 · 温暖", "男 · 沉稳", "中性") in a BottomSheet
- Style picker: 5 presets (专业干练/温柔耐心/轻松活泼/严肃正式/科幻AI)
as ChoiceChips + custom text input field + reset button
### Call Flow
- agent_call_page.dart: Send `tts_voice` and `tts_style` in the POST
body when requesting a LiveKit token at call initiation
## Backend
### voice-service (Python/FastAPI)
- livekit_token.py: Accept optional `tts_voice` and `tts_style` via
Pydantic TokenRequest body model; embed them in RoomAgentDispatch
metadata JSON alongside auth_header (backward compatible)
### voice-agent (Python/LiveKit Agents)
- agent.py: Extract `tts_voice` and `tts_style` from ctx.job.metadata;
use them when creating openai_plugin.TTS() — user-selected voice
overrides config default, user-selected style overrides default
instructions. Falls back to config defaults when not provided.
## Data Flow
Flutter Settings → SharedPreferences → POST /livekit/token body →
voice-service embeds in RoomAgentDispatch metadata →
voice-agent reads from ctx.job.metadata → TTS creation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Switch from tts-1 to gpt-4o-mini-tts for lower latency and better quality
- Change voice from alloy to coral
- Add Chinese speech instructions for natural tone control
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from batch STT (gpt-4o-transcribe via /audio/transcriptions)
to streaming Realtime API (WebSocket). This eliminates the ~2s batch
upload+process latency per utterance.
Also updated nginx proxy on 67.223.119.33 to support WebSocket upgrade
for /v1/realtime endpoint.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Pass httpx.AsyncClient(verify=False) to OpenAI STT/TTS to support
self-signed certificate on OPENAI_BASE_URL proxy
- Handle generate_reply calls with no user message by falling back to
system/developer instructions
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In livekit-agents v1.x @server.rtc_session() pattern, ctx.room is not
yet connected when entrypoint is called. session.start() handles room
connection internally.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add room_input_options/room_output_options to session.start() so agent
binds audio I/O and stays in the room
- Add wait_for_participant() before starting session
- Filter AgentConfigUpdate items in agent_llm.py (no 'role' attribute)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace deprecated WorkerOptions(entrypoint_fnc=...) with AgentServer() +
@server.rtc_session() decorator. Use server.setup_fnc for prewarm. Remove
manual ctx.connect() and ctx.wait_for_participant() calls that prevented
the pipeline from properly wiring up VAD→STT→LLM→TTS.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RoomInputOptions is deprecated in livekit-agents 1.4.x. Switch to
RoomOptions with explicit audio_input/audio_output enabled.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LiveKit passes RoomAgentDispatch metadata through as job.metadata
(protobuf field), not via a separate agent_dispatch object. Also
use room_io.RoomInputOptions for participant targeting (livekit-agents 1.x).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
livekit-agents 1.x removed the 'participant' parameter from
AgentSession.start(). Use room_input_options with participant_identity
instead.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The livekit package is the client SDK and doesn't include the server-side
API module. Switch to livekit-api which provides AccessToken, VideoGrants,
RoomAgentDispatch, and RoomConfiguration needed for token generation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Upgrade websockets from ==12.0 to >=13.0 (openai[realtime] requires >=13)
- Install torch CPU-only build separately in Dockerfile to avoid ~2GB CUDA download
- Remove torch from requirements.txt (installed via --index-url cpu wheel)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude API supports up to 32MB PDFs; base64 encoding adds ~33% overhead.
50mb body limit covers the maximum single-document upload case.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PDF files were incorrectly wrapped as type:'image' content blocks,
causing Claude API to reject them as "Invalid image data".
- conversation-context.service: check mediaType for application/pdf,
use type:'document' block (Anthropic native PDF support) instead
- claude-agent-sdk-engine: detect both 'image' and 'document' blocks
when deciding to build multimodal SDK prompt
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The direct `import * as express from 'express'` caused a
MODULE_NOT_FOUND error in the Docker production image since express
is only available as a transitive dependency via @nestjs/platform-express.
Use NestExpressApplication.useBodyParser() which is the official NestJS API.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- SDK engine now constructs AsyncIterable<SDKUserMessage> with image
content blocks when attachments are present in conversationHistory,
using the SDK's native multimodal prompt format
- CLI engine logs a warning when images are detected, since the `-p`
flag only accepts text (upstream Claude CLI limitation)
- Both SDK and API engines now fully support multimodal image input
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two major features in this commit:
1. Streaming Markdown Rendering Optimization
- Replace deprecated flutter_markdown with gpt_markdown (active, AI-optimized)
- Real-time markdown rendering during streaming (was showing raw syntax)
- Solid block cursor (█) instead of AnimationController blink
- 80ms token throttle buffer reducing rebuilds from per-token to ~12.5/sec
- RepaintBoundary isolation for markdown widget repaints
- StreamTextWidget simplified from StatefulWidget to StatelessWidget
2. Multimodal Image Input (camera + gallery + display)
- Flutter: image_picker for gallery/camera, base64 encoding, attachment
preview strip with delete, thumbnails in sent messages
- Data layer: List<String>? → List<Map<String, dynamic>>? for structured
attachment payloads through datasource/repository/usecase
- ChatAttachment model with base64Data, mediaType, fileName
- ChatMessage entity + ChatMessageModel both support attachments field
- Backend DTO, Entity (JSONB), Controller, ConversationContextService
all extended to receive, store, and reconstruct Anthropic image
content blocks in loadContext()
- Claude API engine skips duplicate user message when history already
ends with multimodal content blocks
- NestJS body parser limit raised to 10MB for base64 image payloads
- Android CAMERA permission added to manifest
- Image.memory uses cacheWidth/cacheHeight for memory efficiency
- Max 5 images per message enforced in UI
Data flow:
ImagePicker → base64Encode → ChatAttachment → POST body →
DB (JSONB) → loadContext → Anthropic image content blocks → Claude API
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6 rounds of systematic audit identified and fixed 14 bugs across
backend controller and Flutter client:
## Backend (agent.controller.ts)
Security & Tenant Isolation:
- Add @TenantId + ForbiddenException check to cancelTask, injectMessage,
approveCommand — all 4 write endpoints now enforce tenant isolation
- Add tenantId check on session reuse in executeTask to prevent
cross-tenant session hijacking
Architecture & Correctness:
- Extract shared runTaskStream() from inline fire-and-forget block,
used by both executeTask and injectMessage to reduce duplication
- Use session.engineType (not getActiveEngine()) in cancelTask,
injectMessage, approveCommand — fixes wrong-engine-cancel when
global engine config is switched after task creation
- Add concurrent task prevention: executeTask checks for existing
RUNNING task on same session and cancels it before starting new one
- Add runningTasks Map to track task promises, awaitTaskCleanup()
helper with 3s timeout for inject to wait for partial text save
- captureSdkSessionId() captures SDK session ID into metadata
without DB save (callers persist), preventing fire-and-forget race
Cancel/Reject Improvements:
- cancelTask: idempotent (returns early if already CANCELLED/COMPLETED),
session stays 'active' (was 'cancelled'), emits cancelled WS event
- approveCommand reject: session stays 'active' (was 'cancelled'),
now emits cancelled WS event so Flutter stream listeners clean up
- approveCommand approved: collect text events and save assistant
response to conversation history on completion (was missing)
Minor:
- task.result! non-null assertion → task.result ?? 'Unknown error'
- Add findRunningBySessionId() to TaskRepository
## Flutter
API Contract Fix:
- approveCommand: route changed from /api/v1/ops/approvals/:id/approve
to /api/v1/agent/tasks/:id/approve with {approved: true} body
- rejectCommand: route changed from /api/v1/ops/approvals/:id/reject
to /api/v1/agent/tasks/:id/approve with {approved: false} body
Resource Management:
- ChatNotifier.dispose() now disconnects WebSocket to prevent
connection leak when navigating away from chat
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend (agent-engine.port.ts):
- Add `cancelled` event type: emitted when a task is cancelled (user-initiated
or injection), so Flutter can close the old stream cleanly
- Add `task_info` event type: emitted after inject to pass the new taskId to
the client, enabling cancel/re-inject on the replacement task
Flutter (features/chat/):
- ChatState: track current `taskId` alongside `sessionId`; clear on completion
or error
- Handle `TaskInfoEvent`: update taskId in state when server issues a new task
- Handle `CancelledEvent`: treat as stream termination (agentStatus → idle)
- MessageType.interrupted: new UI node (warning style) for mid-stream cancels
- _inject(): send text as an inject request while streaming; backend cancels
the current task and starts a new one with the injected message
- Input area: during streaming, hint changes to "追加指令...", Enter key calls
_inject() instead of _send(), and both inject-send + stop buttons are shown
- isAwaitingApproval kept separate from isStreaming so approval flow is not
blocked by inject mode
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously AgentSkillService wrote skills to public.agent_skills (TypeORM
entity with tenantId column filter), while ClaudeAgentSdkEngine read from
it0_t_{tenantId}.skills (per-tenant schema). The two tables were never
connected, so any skill added via the CRUD API was invisible to the agent.
This fix:
- Rewrites AgentSkillService to use DataSource + raw SQL against the
per-tenant schema it0_t_{tenantId}.skills
- Maps API fields: script→content, enabled→is_active
- Removes AgentSkillRepository and AgentSkill entity from module (no longer needed)
- CRUD API response shape is unchanged (fields mapped back to script/enabled)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Load active skills from the tenant's schema `skills` table and append
them to the system prompt before passing to the Claude Agent SDK. This
closes the gap where skills existed in the DB but were never surfaced
to the agent during task execution.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The createCredential method was missing the tenantId assignment,
causing a NOT NULL constraint violation on the credentials table.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Server side (session_router.py):
- /reconnect now accepts sessions in "active" state (not just "disconnected")
- When client reconnects to an active session, the old WebSocket/pipeline is
automatically replaced when the new WebSocket connects
- Only truly terminal states (e.g. "ended") return 409
Flutter side (agent_call_page.dart):
- Distinguish terminal errors (404 session gone, 409 ended) from transient
errors (network timeout, server unreachable) in reconnect loop
- Terminal errors break immediately instead of wasting retry attempts
- Extract _connectWebSocket() helper for cleaner reconnect flow
- Add DioException handling for proper HTTP status code inspection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
OpenAI TTS returns 24kHz audio which Android MediaPlayer can't play
via FlutterSound's pcm16WAV codec. Request raw PCM and resample to
16kHz before wrapping in WAV header, matching the local TTS format.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Local /synthesize and /transcribe endpoints now auto-load Kokoro/Whisper
models on first call instead of returning 503 when not pre-loaded at
startup. This allows switching between Local and OpenAI providers in the
Flutter test page without requiring server restart.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace prompt-prefix workaround with SDK's native resume mechanism.
Each tenant gets isolated HOME directory (/data/claude-tenants/{tenantId})
to prevent cross-tenant session file mixing. SDK session IDs are persisted
in session.metadata for cross-request resume support.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement DB-based conversation message storage (engine-agnostic) that
works across both Claude API and Agent SDK engines. Add ChatGPT/Claude-style
conversation history drawer in Flutter with date-grouped session list,
session switching, and new chat functionality.
Backend: entity, repository, context service, migration 004, session/message
API endpoints. Flutter: ConversationDrawer, sessionId flow from backend
response via SessionInfoEvent, session list/switch/delete support.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add STT_PROVIDER/TTS_PROVIDER config (local or openai) in settings
- Pipeline uses OpenAI API for STT/TTS when provider is "openai"
- Skip loading local models (Kokoro/faster-whisper) when using OpenAI
- VAD (Silero) always loads for speech detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add openai package to voice-service requirements
- Add /api/v1/test/tts/synthesize-openai (tts-1/tts-1-hd/gpt-4o-mini-tts)
- Add /api/v1/test/stt/transcribe-openai (gpt-4o-transcribe/whisper-1)
- Add OPENAI_API_KEY and OPENAI_BASE_URL env vars to voice-service
- Flutter test page: SegmentedButton to toggle Local/OpenAI provider
- All endpoints maintain same response format for easy comparison
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Claude API engine now uses streaming API (messages.stream) for real-time
text delta output instead of waiting for full response
- Agent controller accepts optional engineType body parameter to allow
callers (e.g. voice pipeline) to select a specific engine
- Fix voice_test_page.dart compilation error: replace audioplayers (not
installed) with flutter_sound (already in pubspec.yaml)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- STT: record from mic or upload audio file → faster-whisper transcription
- Round-trip: record → STT → TTS → playback (full pipeline test)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Browser-accessible page to test text-to-speech synthesis without
going through the full voice pipeline.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the first punctuation mark appeared before _MIN_SENTENCE_LEN chars,
the regex search would always find it first and skip it, permanently
blocking all subsequent sentence splits. Fix by advancing search_start
past short matches instead of breaking out of the loop.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add finished guard so that once a task reaches completed/error terminal
state, subsequent events don't flip the status back.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SDK sends text both via stream_event deltas (token-level) and assistant
message (complete block). Track hasStreamedText flag per session to skip
duplicate text extraction from assistant messages.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace batch TTS (wait for full response) with streaming approach:
- _agent_generate → _agent_stream async generator (yield text chunks)
- _process_speech accumulates tokens, splits on sentence boundaries
- Each sentence is TTS'd and sent immediately while more tokens arrive
- First audio plays within ~1s of agent response vs waiting for full text
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root causes found:
1. SDK engine only emitted 'completed' without 'text' events because
mapSdkMessage skipped text blocks in 'assistant' messages (assumed
stream_event deltas would provide them, but SDK didn't send deltas)
2. Voice pipeline read evt_data.data.content but engine events are flat
(evt_data.content) — so even if text arrived, it was never extracted
Fixes:
- Extract text/thinking blocks from assistant messages in SDK engine
- Fix voice pipeline to read content directly from evt_data, not nested
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log every SDK message type, event emission, and stream lifecycle
to diagnose why text events are missing in voice-agent flow.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log timestamps, content, and event details at each pipeline stage
to help diagnose voice-agent integration issues.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Buffer stream events when no WS clients are subscribed yet, then replay
them when a client subscribes. This eliminates the race condition where
events are lost between task creation and WS subscription.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The engine stream could emit text events before the voice pipeline
subscribed, causing all text to be lost. Now we connect and subscribe
first, then POST the task.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Voice calls now use the same agent task + WS subscription flow as the
chat UI, enabling tool use and command execution during voice sessions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: Pipecat's WebsocketServerTransport creates its own WebSocket
server on (host,port) and expects FrameProcessor subclasses. Our code was
passing a FastAPI WebSocket object as 'host' and using plain STT/TTS/VAD
service classes that aren't FrameProcessors. The pipeline crashed immediately
when receiving audio, causing "disconnects when speaking".
Changes:
- **base_pipeline.py**: Complete rewrite — replaced Pipecat Pipeline with
direct async loop: WebSocket → VAD → STT → Claude LLM → TTS → WebSocket.
Supports barge-in (interrupt TTS when user speaks), audio chunking, and
24kHz→16kHz TTS resampling.
- **session_router.py**: Pass WebSocket directly to pipeline instead of
wrapping in AppTransport.
- **app_transport.py**: Deprecated (no longer needed).
- **kokoro_service.py**: Fix misaki compatibility (MutableToken→MToken
rename), use correct Chinese voice 'zf_xiaoxiao', handle torch tensors.
- **main.py**: Apply misaki monkey-patch before importing kokoro.
- **settings.py**: Change default TTS voice from 'zh_female_1' (non-existent)
to 'zf_xiaoxiao' (valid Kokoro-82M Chinese female voice).
- **requirements.txt**: Remove pipecat-ai dependency, pin kokoro==0.3.5 +
misaki==0.7.17, add Chinese NLP deps (pypinyin, cn2an, jieba, ordered-set).
- **agent_call_page.dart**: Wrap each cleanup step in try/catch to ensure
Navigator.pop() always executes after call ends. Add 3s timeout on session
delete request.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend:
- Add includePartialMessages: true to SDK query options
- Handle stream_event/content_block_delta for real-time text streaming
- Skip text/thinking blocks from complete assistant messages (already
streamed via deltas) to avoid duplication
- Change default result summary to empty string
Flutter:
- Only show CompletedEvent summary when no assistant text was streamed
(prevents duplicate message bubble)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Socket.IO requires its own handshake protocol (EIO=4) which Kong cannot
proxy as a plain WebSocket upgrade, causing 502 Bad Gateway. Switch to
@nestjs/platform-ws (WsAdapter) with manual session room tracking so
Flutter's IOWebSocketChannel can connect directly.
Also add ws/wss protocols to Kong WebSocket routes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Required by FastAPI for form/file upload parsing. Missing dependency
may cause import errors and container restart loops.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TenantAwareRepository.getRepository() was calling createQueryRunner()
without ever releasing it, causing database connection pool exhaustion.
This caused ops-service (and eventually other services) to hang on
all API requests once the pool filled up.
Replaced getRepository() with withRepository() pattern that wraps
operations in try/finally to always release the QueryRunner.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses reliability gaps in the real-time voice WebSocket connection
between Flutter client and Python voice-service backend.
Backend (voice-service):
- Heartbeat: new _heartbeat_sender coroutine sends JSON ping text frames
every 15s alongside the Pipecat pipeline; failed send = dead connection
- Session preservation: on WebSocket disconnect, sessions are now marked
"disconnected" with a timestamp instead of being deleted, allowing
reconnection within a configurable TTL (default 60s)
- Reconnect endpoint: POST /sessions/{id}/reconnect verifies the session
is alive and in "disconnected" state, returns fresh websocket_url
- Reconnect-aware WS handler: detects "disconnected" sessions, cancels
stale pipeline tasks, creates a new pipeline, sends "session.resumed"
- Background cleanup: asyncio loop every 30s removes sessions that have
been disconnected longer than session_ttl
- Structured event protocol: text frames = JSON control messages
(ping/pong/session.resumed/session.ended/error), binary = PCM audio
- New settings: session_ttl (60s), heartbeat_interval (15s),
heartbeat_timeout (45s)
Flutter (agent_call_page.dart):
- Heartbeat monitoring: tracks last server ping timestamp, triggers
reconnect if no ping received in 45s (3 missed intervals)
- Auto-reconnect: exponential backoff (1s→2s→4s→8s→16s), max 5 attempts;
calls /reconnect endpoint to verify session, rebuilds WebSocket,
resets audio buffer, restarts heartbeat
- Reconnecting UI: yellow warning banner "重新连接中... (N/5)" with
spinner overlay during reconnection attempts
- WebSocket data routing: _onWsData distinguishes String (JSON control)
from binary (audio) frames, handles ping/session.resumed/session.ended
- User-initiated disconnect guard: _userEndedCall flag prevents reconnect
attempts when user intentionally hangs up
- session_id field compatibility: supports session_id/sessionId/id
Flutter (pcm_player.dart):
- Jitter buffer: queues incoming PCM chunks, starts playback only after
accumulating 4800 bytes (150ms at 16kHz 16-bit mono) to smooth out
network timing variance
- reset() method: clears buffer on reconnect to discard stale audio
- Buffer underrun handling: re-enters buffering phase if queue empties
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SDK blocks bypassPermissions when running as root for security.
Add non-root 'appuser' to Dockerfile.service and update volume
mounts to use /home/appuser/.claude paths.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- bypassPermissions blocked by SDK when running as root
- Switch to acceptEdits with canUseTool for programmatic control
- Mount .claude.json config file into container
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In a Docker container without TTY, permissionMode 'default' blocks
waiting for interactive permission prompts. Switch to bypassPermissions
with canUseTool callback for programmatic risk-based access control.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tsc with module=commonjs converts `await import()` to require(),
which breaks ESM-only packages. Use Function('return import()')
workaround to preserve native dynamic import at runtime.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Mount ~/.claude/ into agent-service container for OAuth token access
- Switch default engine to claude_agent_sdk
- Remove ANTHROPIC_API_KEY from env in subscription mode so SDK uses OAuth
- Keep API key mode for per-tenant billing
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Follow iConsulting pattern: set NODE_TLS_REJECT_UNAUTHORIZED=0 when
ANTHROPIC_BASE_URL is configured, enabling connection through the
self-signed proxy at 67.223.119.33.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change AGENT_ENGINE_TYPE from claude_code_cli to claude_api in docker-compose
- Add ANTHROPIC_BASE_URL env var support to claude-api-engine
- Add ANTHROPIC_BASE_URL to agent-service environment in docker-compose
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add restart: unless-stopped to all 12 Docker services
- Add process.on(unhandledRejection/uncaughtException) to all 7 service main.ts
- Fix handleEventTrigger using tenantId UUID as schema name instead of slug lookup
- Wrap Redis event subscription callbacks in try/catch
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace traditional on-device speech_to_text with a modern pipeline:
- Record audio via `record` package with hardware noise suppression
- Apply GTCRN neural denoising (sherpa-onnx, ICASSP 2024, 48K params)
- Trim silence, POST to backend /voice/transcribe (faster-whisper)
Changes:
- Add /transcribe endpoint to voice-service for audio file upload
- Add SpeechEnhancer wrapper for sherpa-onnx GTCRN model (523KB)
- Rewrite chat_page.dart voice input: record → denoise → transcribe
- Keep NoiseReducer.trimSilence for silence removal only
- Upgrade record to v6.2.0, add sherpa_onnx, path_provider
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend:
- Enhanced register endpoint to accept companyName for self-service
tenant creation with schema provisioning and admin user setup
- Added TenantInvite entity with token-based invitation system
- Added invite CRUD endpoints to TenantController (create/list/revoke)
- Added public endpoints for invite validation and acceptance
Frontend:
- Created registration page with optional organization name field
- Created invitation acceptance page at /invite/[token]
- Added invite management UI to tenant detail page
- Updated login page with link to registration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>