- agent-instance.controller: POST :id/heartbeat — bridge calls this every 60s;
auto-transitions status from deploying→running when gateway is confirmed connected
- system-prompt-builder: teach iAgent about OpenClaw deployment capability:
create/list/stop/remove instance API endpoints, when to trigger deployment,
and what to tell users about channel connectivity (Telegram/WhatsApp etc.)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Backend: GET /api/v1/auth/my-org returns tenant info + member list
- Backend: GET /api/v1/auth/my-org/invites lists pending invites
- Backend: POST /api/v1/auth/my-org/invite creates invite link
- Frontend: /my-org page with member list and invite creation
- Frontend: add '用户管理' to tenant sidebar
- Frontend: add '套餐' (plans) to tenant billing section
- Frontend: admin layout initializes tenant store (fixes '租户:未选择')
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- auth-service: add SmsService (Aliyun SMS) + RedisProvider for OTP storage
- POST /api/v1/auth/sms/send — send OTP (rate limited 1/min per phone)
- POST /api/v1/auth/sms/verify — verify OTP only
- POST /api/v1/auth/login/otp — passwordless login with phone + OTP
- register endpoint now requires smsCode when registering with phone
- Web Admin register page: add OTP input + 60s countdown button for phone mode
- Flutter login page: add 验证码登录 tab with phone + OTP flow
- SMS enabled via ALIYUN_ACCESS_KEY_ID/SECRET + SMS_ENABLED=true env vars
- Falls back to mock mode (logs code) when env vars not set
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously GET /api/v1/billing/subscription threw 404 for tenants with no
subscription, causing React Query error state on the Plans and Overview pages.
Now returns a graceful default response so the UI renders without errors.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Systematically add platform_admin and platform_super_admin to all
controllers that were restricted to 'admin' only:
- audit-service: queryLogs, exportLogs
- inventory-service: decryptCredential
- auth-service: RoleController, PermissionController
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SettingsController was restricted to 'admin' only, blocking platform_admin
from the dashboard settings page (403 on general/api-keys/theme/account).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Member/invite endpoints were restricted to 'admin' role only, blocking
platform_admin from accessing them on the tenant detail page (403).
Added platform_admin and platform_super_admin to all six endpoints.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- listMembers was returning { data, total } but frontend expects TenantMember[]
directly, causing members.map is not a function crash on the detail page.
- updateMember now also syncs role changes to public.users so the new role
takes effect the next time the user logs in (JWT is generated from public.users).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- TenantController invite endpoints (list/create/revoke) were passing the
tenant UUID from the URL param directly to AuthService methods that
expect a slug, causing 404 on every invite operation. Now resolves
tenant via findTenantOrFail() first and passes slug.
- removeMember now also deletes from public.users so removed members
can no longer log in.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously, acceptInvite only wrote to the tenant schema, causing invited
users to be invisible to the login() flow which queries public.users for
cross-tenant email/phone lookup. Now inserts into both public.users and
the tenant schema within the same transaction, matching registerWithNewTenant behavior.
Also tightens duplicate check to cross-tenant uniqueness (public.users)
instead of per-tenant.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- DELETE /api/v1/admin/tenants/:id now accepts platform_admin role
- Fix cascade cleanup to use tenant slug (not UUID) for users/invites/api_keys
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- createInvite: findOneBy({ slug }) instead of { id } since JWT tenantId is slug
- getMemberCount: use SET LOCAL + transaction to prevent pool search_path leak
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Change SET search_path to SET LOCAL in tenant schema template (002)
so it reverts on COMMIT and doesn't contaminate the connection pool
- Add RESET search_path before queryRunner.release() as defensive measure
- Add ALTER TABLE public.tenants admin_email DROP NOT NULL to migration 007
to sync the direct server change back to source
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Flutter: language='auto' omits the language field → backend receives none
- Backend: no language field → passes undefined to STT service
- STT service: language=undefined → omits language param from Whisper request
- Whisper auto-detects language per utterance when no hint is provided
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Node 18 native fetch (undici) ignores https.Agent, causing fetch failed
on the self-signed proxy at 67.223.119.33:8443. Switch to https.request
with rejectUnauthorized: false which works reliably.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OPENAI_BASE_URL=https://67.223.119.33:8443/v1 already includes /v1,
so the URL was being built as .../v1/v1/audio/transcriptions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
multer was only transitively available; pnpm strict mode blocks it.
Also adds @types/multer for TypeScript compilation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add POST /api/v1/agent/transcribe endpoint (STT only, no agent trigger)
- Add transcribeAudio() to chat datasource and provider
- VoiceMicButton now fills the text input field with transcript;
user reviews and sends manually
- Add OPENAI_API_KEY/OPENAI_BASE_URL to agent-service in docker-compose
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three coordinated fixes to make in-app APK download work end-to-end:
1. version-service/main.ts: serve uploaded files as static assets via
NestExpressApplication.useStaticAssets('/data/versions', prefix:
'/downloads/versions'), so GET /downloads/versions/{platform}/{file}
returns the actual APK stored in the Docker volume.
2. kong.yml: add /downloads/versions route to Kong so requests from
the Flutter app can reach version-service through the API gateway.
Previously only /api/v1/versions and /api/app/version were routed;
the download URL returned by the check endpoint was unreachable (404).
3. download_manager.dart: skip SHA-256 verification when sha256Expected
is empty string. The check endpoint always returns sha256:"" because
version-service doesn't store file hashes. The previous code compared
actual_hash == "" which always failed, causing the downloaded file to
be deleted after a successful download.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add /api/app/version route to Kong declarative config so that the
Flutter app's GET /api/app/version/check?platform=¤t_version_code=
request can reach version-service through the API gateway.
Previously only /api/v1/versions was routed; the public check endpoint
served by AppVersionCheckController was unreachable (Kong returned
"no Route matched with those values").
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Flutter VersionChecker was calling GET /api/app/version/check but this
endpoint didn't exist — only the admin CRUD /api/v1/versions was there.
New: AppVersionCheckController (@Controller('api/app/version'))
GET /api/app/version/check?platform=android¤t_version_code=N
- Finds latest enabled version for the platform (highest buildNumber)
- Returns { needUpdate: false } when already up to date
- Returns full VersionInfo payload when update is available
Response fields match Flutter VersionInfo.fromJson exactly:
needUpdate, version, versionCode, downloadUrl, fileSize,
fileSizeFriendly (computed), sha256 (empty — not stored),
forceUpdate, updateLog, releaseDate
Also: AppVersionRepository.findLatestEnabled(platform) — queries all
enabled versions for platform, picks the one with the highest buildNumber
(parsed as int, robust against varchar storage).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New endpoint: POST /api/v1/agent/sessions/:sessionId/voice-message
- Accepts multipart/form-data audio file (any format Whisper supports)
- Transcribes via OpenAI Whisper API (routed through existing proxy)
- If a task is currently running in the session → hard-interrupts it first
(same cancel+inject pattern as text inject, triggered by voice command)
- Otherwise → starts a fresh task with the transcript
- Returns { sessionId, taskId, transcript } so client can subscribe to WS stream
This enables WhatsApp-style push-to-talk and doubles as an async voice
interrupt into any active agent workflow, bypassing the need for speaker
diarization (whoever presses record owns the message).
New files:
infrastructure/stt/openai-stt.service.ts — OpenAI Whisper client,
manually builds multipart/form-data, supports self-signed proxy cert
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements a two-level abort controller design to support real-time
interruption when the user speaks while the agent is still responding:
sessionAbortController (session-scoped)
- Created once when startSession() is called
- Fired only by terminateSession() (user hangs up)
- Propagated into each turn via addEventListener
turnAbort (per-turn, stored as handle.currentTurnAbort)
- Created fresh at the start of each executeTurn() call
- Stored on the VoiceSessionHandle so injectMessage() can abort it
- When a new inject arrives while a turn is running, injectMessage()
calls turnAbort.abort() BEFORE enqueuing the new message
Interruption flow:
1. User speaks mid-response → LiveKit stops TTS playback (client-side)
2. STT utterance → POST voice/inject → injectMessage() fires
3. handle.currentTurnAbort.abort() called → sets aborted flag
4. for-await loop checks turnAbort.signal.aborted on next SDK event → break
5. catch block NOT reached (break ≠ exception) → no error event emitted
6. finally block saves partial text with "[中断]" suffix to history
7. New message dequeued → fresh executeTurn() starts immediately
Why no "Agent error" message plays to the user:
- break exits the for-await loop silently, not via exception
- The catch block's error-event emission is guarded by err?.name !== 'AbortError'
AND requires an actual exception; a plain break never enters catch
- Empty or partial responses are filtered by `if response:` in agent.py
Also update module-level JSDoc with full architecture explanation covering
the long-lived run loop design, two-level abort hierarchy, tenant context
injection pattern, and SDK session resume across turns.
Update agent.py module docstring to document voice session lifecycle and
interruption flow for future maintainers.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the per-turn POST /tasks approach for voice calls with a
long-lived agent run loop tied to the call lifecycle:
agent-service:
- Add AsyncQueue<T> utility for blocking message relay
- Add VoiceSessionManager: spawns one background run loop per voice call,
accepts injected messages, terminates cleanly on hangup
- Add VoiceSessionController with 3 endpoints:
POST /api/v1/agent/sessions/voice/start (call start)
POST /api/v1/agent/sessions/:id/voice/inject (each speech turn)
DELETE /api/v1/agent/sessions/:id/voice (user hung up)
- Register VoiceSessionManager + VoiceSessionController in agent.module.ts
voice-agent:
- AgentServiceLLM: add start_voice_session(), terminate_voice_session(),
inject_text_message() (voice/inject-aware), _do_inject_voice()
- AgentServiceLLMStream._run(): use voice/inject path when voice session
is active; fall back to per-task POST for text-chat / non-SDK engines
- entrypoint(): call start_voice_session() after session.start();
register _on_room_disconnect that calls terminate_voice_session()
so the agent is always killed when the user hangs up
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two issues fixed:
1. agent.controller.ts — on the FIRST task of each session, write title+voiceMode
into session.metadata so the client can display a meaningful conversation title:
- Text sessions: metadata.title = first 40 chars of user prompt
- Voice sessions: metadata.title = '' + metadata.voiceMode = true
(Flutter renders these as '语音对话 M/D HH:mm')
titleSet flag prevents overwriting the title on subsequent turns of the same session.
2. session.controller.ts — listSessions() now returns a DTO instead of the raw entity.
systemPrompt is an internal engine instruction and is explicitly excluded from the
response. The client receives { id, status, engineType, metadata, createdAt, updatedAt }.
The billing-service tsconfig.json was missing the TypeScript path aliases
required for the workspace build (turbo builds shared packages first, then
resolves @it0/* via paths). Without these, nest build fails with
'Cannot find module @it0/database'.
Also disables overly strict checks (strictNullChecks, strictPropertyInitialization,
useUnknownInCatchVariables) to match the lenient settings used by other services.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comprehensive fix of 124 TS errors across the billing-service:
Entity fixes:
- invoice.entity.ts: add InvoiceStatus/InvoiceCurrency const objects,
rename fields to match DB schema (subtotalCents, taxCents, totalCents,
amountDueCents), add OneToMany items relation
- invoice-item.entity.ts: add InvoiceItemType const object, add column
name mappings and currency field
- payment.entity.ts: add PaymentStatus const, rename amount→amountCents
with column name mapping, add paidAt field
- subscription.entity.ts: add SubscriptionStatus const object
- usage-aggregate.entity.ts: rename periodYear/Month→year/month to match
DB columns, add periodStart/periodEnd fields
- payment-method.entity.ts: add displayName, expiresAt, updatedAt fields
Port/Provider fixes:
- payment-provider.port.ts: make PaymentProviderType a const object (not
just a type), add PaymentSessionRequest alias, rename WebhookEvent with
correct field shape (type vs eventType), make providerPaymentId optional
- All 4 providers: replace PaymentSessionRequest→CreatePaymentParams,
fix amountCents→amount, remove sessionId from PaymentSession return,
add confirmPayment() stub, fix Stripe API version to '2023-10-16'
Use case fixes:
- aggregate-usage.use-case.ts: replace 'redis' with 'ioredis' (workspace
standard); rewrite using ioredis xreadgroup API
- change/check/generate use cases: fix Plan field names
(monthlyPriceCentsUsd, includedTokens, overageRateCentsPerMTokenUsd)
- generate-monthly-invoice: fix SubscriptionStatus/InvoiceCurrency as
values (now const objects)
- handle-payment-webhook: fix WebhookResult import, result.type usage,
payment.paidAt
Controller/Repository fixes:
- plan.controller.ts, plan.repository.ts: fix Plan field names
- webhook.controller.ts: remove express import, use any for req type
- invoice-generator.service.ts: fix overageAmountCents→overageCentsUsd,
monthlyPriceCny→monthlyPriceFenCny, includedTokensPerMonth→includedTokens
Dependencies:
- billing-service/package.json: replace redis with ioredis dependency
- pnpm-lock.yaml: regenerated after ioredis addition
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- 005-create-billing-tables.sql: replace all `it0_shared.tenants` with
`public.tenants` and all `tenant_id VARCHAR(20)` with `tenant_id UUID`
to match the actual server DB schema (public schema, UUID primary key)
- packages/shared/testing src/test-utils.ts: add new quota fields
(maxServers, maxUsers, maxStandingOrders, maxAgentTokensPerMonth) to
TEST_TENANT mock to satisfy the extended TenantInfo interface, fixing
the @it0/testing TypeScript build error
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The entrypoint.sh expects dist/services/${SERVICE_NAME}/src/main, but
nest build with inline TypeORM config produces dist/main directly.
Using DatabaseModule from @it0/database forces tsc to emit the nested
path structure (since it references shared packages), matching the
entrypoint path convention used by all other services.
Also gains SnakeNamingStrategy and autoLoadEntities from the shared module.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
voice-agent agent.py:
- Module docstring explains lk.agent.state lifecycle
(initializing → listening → thinking → speaking)
- Explains how RoomIO publishes state as participant attribute
- Documents BackgroundAudioPlayer with all available built-in clips
Flutter agent_call_page.dart:
- Documents _agentState field and all possible values
- Documents ParticipantAttributesChanged listener with UI mapping
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Import from livekit.agents.voice.background_audio submodule directly,
as it's not re-exported from livekit.agents.voice.__init__.py.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- voice-agent: enable BackgroundAudioPlayer with keyboard typing sound
during LLM thinking state (auto-plays when agent enters "thinking",
stops when "speaking" starts)
- Flutter: monitor lk.agent.state participant attribute from LiveKit
agent, show pulsing dots animation + "思考中..." text when thinking,
avatar border changes to warning color with pulsing glow ring
- Both call mode and chat mode headers show thinking state
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detailed record of why livekit-plugins-speechmatics was removed:
- EXTERNAL: no FINAL_TRANSCRIPT (framework never sends FlushSentinel)
- ADAPTIVE: zero output (dual Silero VAD conflict)
- SMART_TURN: fragments Chinese speech into tiny pieces
- FIXED: finalize() async race condition with session teardown
All tested on 2026-03-03, none viable with LiveKit agents v1.4.4.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SMART_TURN fragments continuous speech into tiny pieces, each triggering
an LLM request that aborts the previous one. FIXED mode waits for a
configurable silence duration (1.0s) before emitting FINAL_TRANSCRIPT
via the built-in END_OF_UTTERANCE handler.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document all findings from the integration process directly in the
source code for future reference:
1. Language code mapping: Speechmatics uses ISO 639-3 "cmn" for
Mandarin, but LiveKit LanguageCode auto-normalizes it to "zh".
Must override stt._stt_options.language after construction.
2. Turn detection modes (critical):
- EXTERNAL: unusable — LiveKit never sends FlushSentinel, only
pushes silence frames, so FINAL_TRANSCRIPT never arrives
- ADAPTIVE: unusable — client-side Silero VAD conflicts with
LiveKit's own VAD, produces zero transcription output
- SMART_TURN: correct choice — server-side intelligent turn
detection, auto-emits FINAL_TRANSCRIPT, fully compatible
3. Speaker diarization: is_active flag distinguishes primary speaker
from TTS echo, solving the "speaker confusion" problem
4. Docker deployment: SPEECHMATICS_API_KEY in .env, watch for
COPY layer cache when rebuilding
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace EXTERNAL mode + monkey-patch hack with SMART_TURN mode.
SMART_TURN uses Speechmatics server-side turn detection that properly
emits AddSegment (FINAL_TRANSCRIPT) when the user finishes speaking.
No client-side finalize or debounce timer needed.
Ref: https://docs.speechmatics.com/integrations-and-sdks/livekit
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Speechmatics re-sends identical partial segments during silence, causing
the debounce timer to fire multiple times with the same text. Each
duplicate FINAL aborts the in-flight LLM request and restarts it.
Replace time-based cooldown with text comparison: skip finalization if
the segment text matches the last finalized text. Also skip starting
new timers when partial text hasn't changed from last finalized.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reduce debounce delay from 700ms to 400ms for faster response
- Add 1.5s cooldown after emitting FINAL to prevent duplicate triggers
that cause LLM abort/retry cycles
- Enable speaker diarization (enable_diarization=True)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The LiveKit framework never sends FlushSentinel to the STT stream.
Instead it pushes silence frames and waits for FINAL_TRANSCRIPT events.
In EXTERNAL turn-detection mode, Speechmatics only emits partials.
New approach: each partial transcript restarts a 700ms debounce timer.
When partials stop (user stops speaking), the timer fires and promotes
the last partial to FINAL_TRANSCRIPT, unblocking the pipeline.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Trace _patched_process_audio lifecycle and FlushSentinel handling
to diagnose why final transcripts are not being promoted.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
VoiceAgentClient.finalize() schedules an async task chain that often
loses the race against session teardown. Instead, intercept partial
segments as they arrive, stash them, and synchronously emit them as
FINAL_TRANSCRIPT when FlushSentinel fires.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the SpeechStream._process_audio patch from container runtime
into our own source code so it survives Docker rebuilds. The patch
adds client.finalize() on FlushSentinel so EXTERNAL mode produces
final transcripts when LiveKit's VAD detects end of speech.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EXTERNAL mode produces partial transcripts but livekit-plugins-speechmatics
does not call finalize() when receiving a flush sentinel from the framework.
A runtime monkey-patch on the plugin's SpeechStream._process_audio adds the
missing finalize() call so final transcripts are generated.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Speechmatics handles end-of-utterance natively via its Voice Agent
API (ADAPTIVE mode). Use turn_detection="stt" on AgentSession so
LiveKit delegates turn boundaries to the STT engine instead of
conflicting with its own VAD-based turn detection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ADAPTIVE mode enables a second client-side Silero VAD inside the
Speechmatics SDK that conflicts with LiveKit's own VAD pipeline,
causing no transcription to be returned. EXTERNAL mode delegates
turn detection to LiveKit.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LiveKit's LanguageCode class normalizes ISO 639-3 codes to ISO 639-1
(cmn → zh), but Speechmatics API expects "cmn" not "zh". Override
the internal _stt_options.language after construction.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Problem:
- Text input area caused BOTTOM OVERFLOWED BY 135 PIXELS when keyboard opened
- Input bar overlapped with call control buttons
- Sent messages were not displayed on screen (only SnackBar feedback)
Solution — split into two distinct layouts:
1. Call Mode (default):
- Full-screen call UI: avatar, waveform, duration, large control buttons
- Keyboard button in controls toggles to chat mode
- No text input elements — clean voice-only interface
2. Chat Mode (tap keyboard button):
- Compact call header: green status dot + "iAgent" + duration + inline
mute/end/speaker/collapse controls
- Scrollable message list (Expanded widget — properly handles keyboard)
- User messages: right-aligned blue bubbles with attachment thumbnails
- Agent responses: left-aligned gray bubbles with robot avatar
- Input bar at bottom: attachment picker + text field + send button
Message display:
- User-sent text/attachments tracked in _messages list, shown as bubbles
- Agent responses sent back via LiveKit data channel (topic='text_reply')
from voice-agent → Flutter, displayed as assistant bubbles
- Auto-scroll to latest message
Voice-agent change (agent.py):
- After session.say(response), publish response text back to Flutter via
ctx.room.local_participant.publish_data() with topic='text_reply'
- Flutter listens for DataReceivedEvent to display agent responses
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable users to send text messages, images, and files to the Agent
while an active voice call is in progress. This addresses the case
where spoken instructions are unclear or screenshots/documents need
to be shared for analysis.
## Architecture
Data flows through LiveKit data channel (not direct HTTP):
Flutter → publishData(topic='text_inject') → voice-agent
→ llm.inject_text_message() → POST /api/v1/agent/tasks (same session)
→ collect streamed response → session.say() → TTS playback
This preserves the constraint that voice-agent owns the agent-service
sessionId — Flutter never contacts agent-service directly.
## Flutter UI (agent_call_page.dart)
- Add keyboard toggle button to active call controls (4-button row)
- Collapsible text input area with attachment picker (+) and send button
- Attachment support: gallery multi-select, camera, file picker
(images max 1024x1024 quality 80%, PDF supported, max 5 attachments)
- Horizontal scrolling attachment preview with delete buttons
- 200KB payload size check before LiveKit data channel send
- Layout adapts: Spacer flex 1/3 toggle, reduced bottom padding
## voice-agent (agent.py)
- Register data_received event listener after session.start()
- Filter for topic='text_inject', parse JSON payload
- Call llm.inject_text_message(text, attachments) and TTS via session.say()
- Use asyncio.ensure_future() wrapper for async handler (matches
existing disconnect handler pattern for sync EventEmitter)
## AgentServiceLLM (agent_llm.py)
- New inject_text_message(text, attachments) method on AgentServiceLLM
- Reuses same _agent_session_id for conversation context continuity
- WS+HTTP streaming: connect, pre-subscribe, POST /tasks with
attachments field, collect full text response, return string
- _injecting flag prevents concurrent _do_stream from clearing
session ID on abort errors while inject is in progress
- Same systemPrompt/voiceMode/engineType as voice pipeline
No agent-service changes required — attachments already supported
end-to-end (JSONB storage → multimodal content blocks → Claude).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, voice mode wrapped every user message with 【语音对话模式】
instructions, polluting conversation_messages history with repeated
instructions on every turn. Now:
- systemPrompt carries voice-mode instructions (set once, not per-message)
- prompt contains only the clean user text (identical to text chat pattern)
- Conversation history stays clean for multi-turn context
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Remove on_enter greeting entirely (no more race condition)
2. voice-agent sends voiceMode: true when engine_type is claude_agent_sdk
3. AgentController.runTaskStream() filters thinking, tool_use, tool_result
events in voice mode — only text, completed, error reach the client
4. Detailed logging: each event logged with [FILTERED-voice] tag when skipped
Claude API mode is completely unaffected (voiceMode defaults to false).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Change on_enter greeting from generate_reply() to session.say() with
a static message — avoids spawning an Agent SDK task just for a greeting,
which caused a race condition when the user speaks before it completes.
2. Clear agent session ID when receiving abort/exit errors so the next
task starts a fresh session instead of trying to resume a dead process.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Full-stack implementation allowing users to choose between Claude Agent SDK
(default, with tool approval, skill injection, session resume) and Claude API
(direct, lower latency) in Flutter settings. Agent SDK mode wraps prompts with
voice-conversation instructions for concise spoken Chinese output.
Data flow: Flutter Settings → SharedPreferences → POST /livekit/token →
RoomAgentDispatch metadata → voice-agent → AgentServiceLLM(engine_type)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Flutter (agent_call_page.dart):
- Add ConnectOptions with 15s timeouts for connection/peerConnection/iceRestart
- Add RoomReconnectingEvent/RoomAttemptReconnectEvent/RoomReconnectedEvent
listeners with "网络重连中" UI indicator during reconnection
- Add TimeoutException detection in _friendlyError()
voice-agent (agent.py):
- Wrap entrypoint() in try-except with full traceback logging
- Register room disconnect listener to close httpx clients (instead of
finally block, since session.start() returns while session runs in bg)
- Add asyncio import for ensure_future cleanup
voice-agent LLM proxy (agent_llm.py):
- Add retry with exponential backoff (max 2 retries, 1s/3s delays) for
network errors (ConnectError/ConnectTimeout/OSError) and WS InvalidStatusCode
- Extract _do_stream() method for single-attempt logic
- Add WebSocket connection params: open_timeout=10, ping_interval=20,
ping_timeout=10 for keepalive and faster dead-connection detection
- Use granular httpx.Timeout(connect=10, read=30, write=10, pool=10)
- Increase WS recv timeout from 5s to 30s to reduce unnecessary loops
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Default silence_duration_ms=350 is too aggressive for Chinese speech,
causing sentences to be fragmented into 1-3 character chunks. Increase
to 800ms and raise VAD threshold to 0.6 so the STT waits longer before
finalizing a turn, producing complete sentences for LLM processing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The OpenAI Realtime STT uses aiohttp WebSocket connections (not httpx),
so the existing httpx verify=False fix does not apply. LiveKit's
http_context creates aiohttp.TCPConnector without ssl=False, causing
SSL certificate verification errors when OPENAI_BASE_URL points to a
proxy with a self-signed certificate.
Monkey-patch http_context._new_session_ctx to inject ssl=False into the
aiohttp connector, fixing the "CERTIFICATE_VERIFY_FAILED" error for
Realtime STT WebSocket connections.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add user-configurable TTS voice and tone style settings that flow from
the Flutter app through the backend to the voice-agent at call time.
## Flutter App (it0_app)
### Domain Layer
- app_settings.dart: Add `ttsVoice` (default: 'coral') and `ttsStyle`
(default: '') fields to AppSettings entity with copyWith support
### Data Layer
- settings_datasource.dart: Add SharedPreferences keys
`settings_tts_voice` and `settings_tts_style` for local persistence
in loadSettings(), saveSettings(), and clearSettings()
### Presentation Layer
- settings_providers.dart: Add `setTtsVoice()` and `setTtsStyle()`
methods to SettingsNotifier for Riverpod state management
- settings_page.dart: Add "语音" settings group between Notifications
and Security groups with:
- Voice picker: 13 OpenAI voices with gender/style labels
(e.g. "女 · 温暖", "男 · 沉稳", "中性") in a BottomSheet
- Style picker: 5 presets (专业干练/温柔耐心/轻松活泼/严肃正式/科幻AI)
as ChoiceChips + custom text input field + reset button
### Call Flow
- agent_call_page.dart: Send `tts_voice` and `tts_style` in the POST
body when requesting a LiveKit token at call initiation
## Backend
### voice-service (Python/FastAPI)
- livekit_token.py: Accept optional `tts_voice` and `tts_style` via
Pydantic TokenRequest body model; embed them in RoomAgentDispatch
metadata JSON alongside auth_header (backward compatible)
### voice-agent (Python/LiveKit Agents)
- agent.py: Extract `tts_voice` and `tts_style` from ctx.job.metadata;
use them when creating openai_plugin.TTS() — user-selected voice
overrides config default, user-selected style overrides default
instructions. Falls back to config defaults when not provided.
## Data Flow
Flutter Settings → SharedPreferences → POST /livekit/token body →
voice-service embeds in RoomAgentDispatch metadata →
voice-agent reads from ctx.job.metadata → TTS creation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Switch from tts-1 to gpt-4o-mini-tts for lower latency and better quality
- Change voice from alloy to coral
- Add Chinese speech instructions for natural tone control
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from batch STT (gpt-4o-transcribe via /audio/transcriptions)
to streaming Realtime API (WebSocket). This eliminates the ~2s batch
upload+process latency per utterance.
Also updated nginx proxy on 67.223.119.33 to support WebSocket upgrade
for /v1/realtime endpoint.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Pass httpx.AsyncClient(verify=False) to OpenAI STT/TTS to support
self-signed certificate on OPENAI_BASE_URL proxy
- Handle generate_reply calls with no user message by falling back to
system/developer instructions
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In livekit-agents v1.x @server.rtc_session() pattern, ctx.room is not
yet connected when entrypoint is called. session.start() handles room
connection internally.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add room_input_options/room_output_options to session.start() so agent
binds audio I/O and stays in the room
- Add wait_for_participant() before starting session
- Filter AgentConfigUpdate items in agent_llm.py (no 'role' attribute)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace deprecated WorkerOptions(entrypoint_fnc=...) with AgentServer() +
@server.rtc_session() decorator. Use server.setup_fnc for prewarm. Remove
manual ctx.connect() and ctx.wait_for_participant() calls that prevented
the pipeline from properly wiring up VAD→STT→LLM→TTS.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RoomInputOptions is deprecated in livekit-agents 1.4.x. Switch to
RoomOptions with explicit audio_input/audio_output enabled.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LiveKit passes RoomAgentDispatch metadata through as job.metadata
(protobuf field), not via a separate agent_dispatch object. Also
use room_io.RoomInputOptions for participant targeting (livekit-agents 1.x).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
livekit-agents 1.x removed the 'participant' parameter from
AgentSession.start(). Use room_input_options with participant_identity
instead.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The livekit package is the client SDK and doesn't include the server-side
API module. Switch to livekit-api which provides AccessToken, VideoGrants,
RoomAgentDispatch, and RoomConfiguration needed for token generation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Upgrade websockets from ==12.0 to >=13.0 (openai[realtime] requires >=13)
- Install torch CPU-only build separately in Dockerfile to avoid ~2GB CUDA download
- Remove torch from requirements.txt (installed via --index-url cpu wheel)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude API supports up to 32MB PDFs; base64 encoding adds ~33% overhead.
50mb body limit covers the maximum single-document upload case.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PDF files were incorrectly wrapped as type:'image' content blocks,
causing Claude API to reject them as "Invalid image data".
- conversation-context.service: check mediaType for application/pdf,
use type:'document' block (Anthropic native PDF support) instead
- claude-agent-sdk-engine: detect both 'image' and 'document' blocks
when deciding to build multimodal SDK prompt
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>