Enable users to send text messages, images, and files to the Agent
while an active voice call is in progress. This addresses the case
where spoken instructions are unclear or screenshots/documents need
to be shared for analysis.
## Architecture
Data flows through LiveKit data channel (not direct HTTP):
Flutter → publishData(topic='text_inject') → voice-agent
→ llm.inject_text_message() → POST /api/v1/agent/tasks (same session)
→ collect streamed response → session.say() → TTS playback
This preserves the constraint that voice-agent owns the agent-service
sessionId — Flutter never contacts agent-service directly.
## Flutter UI (agent_call_page.dart)
- Add keyboard toggle button to active call controls (4-button row)
- Collapsible text input area with attachment picker (+) and send button
- Attachment support: gallery multi-select, camera, file picker
(images max 1024x1024 quality 80%, PDF supported, max 5 attachments)
- Horizontal scrolling attachment preview with delete buttons
- 200KB payload size check before LiveKit data channel send
- Layout adapts: Spacer flex 1/3 toggle, reduced bottom padding
## voice-agent (agent.py)
- Register data_received event listener after session.start()
- Filter for topic='text_inject', parse JSON payload
- Call llm.inject_text_message(text, attachments) and TTS via session.say()
- Use asyncio.ensure_future() wrapper for async handler (matches
existing disconnect handler pattern for sync EventEmitter)
## AgentServiceLLM (agent_llm.py)
- New inject_text_message(text, attachments) method on AgentServiceLLM
- Reuses same _agent_session_id for conversation context continuity
- WS+HTTP streaming: connect, pre-subscribe, POST /tasks with
attachments field, collect full text response, return string
- _injecting flag prevents concurrent _do_stream from clearing
session ID on abort errors while inject is in progress
- Same systemPrompt/voiceMode/engineType as voice pipeline
No agent-service changes required — attachments already supported
end-to-end (JSONB storage → multimodal content blocks → Claude).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, voice mode wrapped every user message with 【语音对话模式】
instructions, polluting conversation_messages history with repeated
instructions on every turn. Now:
- systemPrompt carries voice-mode instructions (set once, not per-message)
- prompt contains only the clean user text (identical to text chat pattern)
- Conversation history stays clean for multi-turn context
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Remove on_enter greeting entirely (no more race condition)
2. voice-agent sends voiceMode: true when engine_type is claude_agent_sdk
3. AgentController.runTaskStream() filters thinking, tool_use, tool_result
events in voice mode — only text, completed, error reach the client
4. Detailed logging: each event logged with [FILTERED-voice] tag when skipped
Claude API mode is completely unaffected (voiceMode defaults to false).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Change on_enter greeting from generate_reply() to session.say() with
a static message — avoids spawning an Agent SDK task just for a greeting,
which caused a race condition when the user speaks before it completes.
2. Clear agent session ID when receiving abort/exit errors so the next
task starts a fresh session instead of trying to resume a dead process.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Full-stack implementation allowing users to choose between Claude Agent SDK
(default, with tool approval, skill injection, session resume) and Claude API
(direct, lower latency) in Flutter settings. Agent SDK mode wraps prompts with
voice-conversation instructions for concise spoken Chinese output.
Data flow: Flutter Settings → SharedPreferences → POST /livekit/token →
RoomAgentDispatch metadata → voice-agent → AgentServiceLLM(engine_type)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Flutter (agent_call_page.dart):
- Add ConnectOptions with 15s timeouts for connection/peerConnection/iceRestart
- Add RoomReconnectingEvent/RoomAttemptReconnectEvent/RoomReconnectedEvent
listeners with "网络重连中" UI indicator during reconnection
- Add TimeoutException detection in _friendlyError()
voice-agent (agent.py):
- Wrap entrypoint() in try-except with full traceback logging
- Register room disconnect listener to close httpx clients (instead of
finally block, since session.start() returns while session runs in bg)
- Add asyncio import for ensure_future cleanup
voice-agent LLM proxy (agent_llm.py):
- Add retry with exponential backoff (max 2 retries, 1s/3s delays) for
network errors (ConnectError/ConnectTimeout/OSError) and WS InvalidStatusCode
- Extract _do_stream() method for single-attempt logic
- Add WebSocket connection params: open_timeout=10, ping_interval=20,
ping_timeout=10 for keepalive and faster dead-connection detection
- Use granular httpx.Timeout(connect=10, read=30, write=10, pool=10)
- Increase WS recv timeout from 5s to 30s to reduce unnecessary loops
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Pass httpx.AsyncClient(verify=False) to OpenAI STT/TTS to support
self-signed certificate on OPENAI_BASE_URL proxy
- Handle generate_reply calls with no user message by falling back to
system/developer instructions
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add room_input_options/room_output_options to session.start() so agent
binds audio I/O and stays in the room
- Add wait_for_participant() before starting session
- Filter AgentConfigUpdate items in agent_llm.py (no 'role' attribute)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>