feat(voice): add per-turn interrupt support to VoiceSessionManager

Implements a two-level abort controller design to support real-time
interruption when the user speaks while the agent is still responding:

  sessionAbortController (session-scoped)
    - Created once when startSession() is called
    - Fired only by terminateSession() (user hangs up)
    - Propagated into each turn via addEventListener

  turnAbort (per-turn, stored as handle.currentTurnAbort)
    - Created fresh at the start of each executeTurn() call
    - Stored on the VoiceSessionHandle so injectMessage() can abort it
    - When a new inject arrives while a turn is running, injectMessage()
      calls turnAbort.abort() BEFORE enqueuing the new message

Interruption flow:
  1. User speaks mid-response → LiveKit stops TTS playback (client-side)
  2. STT utterance → POST voice/inject → injectMessage() fires
  3. handle.currentTurnAbort.abort() called → sets aborted flag
  4. for-await loop checks turnAbort.signal.aborted on next SDK event → break
  5. catch block NOT reached (break ≠ exception) → no error event emitted
  6. finally block saves partial text with "[中断]" suffix to history
  7. New message dequeued → fresh executeTurn() starts immediately

Why no "Agent error" message plays to the user:
  - break exits the for-await loop silently, not via exception
  - The catch block's error-event emission is guarded by err?.name !== 'AbortError'
    AND requires an actual exception; a plain break never enters catch
  - Empty or partial responses are filtered by `if response:` in agent.py

Also update module-level JSDoc with full architecture explanation covering
the long-lived run loop design, two-level abort hierarchy, tenant context
injection pattern, and SDK session resume across turns.

Update agent.py module docstring to document voice session lifecycle and
interruption flow for future maintainers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
hailin 2026-03-04 04:25:57 -08:00
parent 635cca18fa
commit d097c64c81
2 changed files with 120 additions and 17 deletions

View File

@ -1,22 +1,68 @@
/**
* VoiceSessionManager
*
* Manages long-lived agent run loops for voice calls.
* Manages long-lived Claude Agent SDK run loops for voice calls.
*
* Lifecycle:
* startSession(sessionId) spawn background run loop, ready for messages
* injectMessage(sessionId) enqueue speech turn; loop processes sequentially
* terminateSession(sessionId) send poison-pill + abort; loop exits cleanly
* Architecture overview
* ---------------------
* Text chat uses a stateless per-turn model: each user message becomes an
* independent POST /tasks request that starts a new SDK process. Voice calls
* need a different model because:
* 1. Calls can last minutes; spawning a new process per utterance is too slow.
* 2. The SDK supports native session resume (sdkSessionId), letting it carry
* tool state and conversation context across turns without re-sending history.
* 3. The agent must be explicitly shut down when the user hangs up, not just
* left running until it times out.
*
* Run-loop design (per voice session):
* while not terminated:
* message queue.dequeue() blocks between speech turns
* executeTask(message, resume) one SDK turn, streams events via gateway
* capture new sdkSessionId for next turn's native resume
* This service implements the alternative model:
* One background run loop per active voice call (Node.js async, not a thread).
* An AsyncQueue<string | null> bridges HTTP inject requests to the loop.
* The loop blocks on queue.dequeue() between turns (zero CPU while idle).
* On each turn it calls ClaudeAgentSdkEngine.executeTask() and streams the
* resulting events to the WebSocket gateway (picked up by voice-agent TTS).
* The SDK session ID returned after each turn is saved to AgentSession.metadata
* so the NEXT turn's executeTask() call can resume from where the previous
* turn left off (native SDK resume, no re-sending of conversation history).
*
* This replaces the per-turn "POST /tasks" model used by text chat.
* The SDK session is kept alive across turns via the `resume` option,
* and the run loop is explicitly terminated when the user hangs up.
* Lifecycle
* ---------
* startSession(sessionId) create queue + AbortController, start loop
* injectMessage(sessionId) enqueue speech turn; if a turn is running,
* abort it first (per-turn interrupt support)
* terminateSession(sessionId) abort session + drain queue + enqueue null
* sentinel; wait 5 s for loop to exit
*
* Interruption model (per-turn AbortController)
* ---------------------------------------------
* Two levels of abort exist:
*
* sessionAbortController session-scoped; fired only on terminateSession().
* Propagates into each turn via an event listener.
*
* turnAbort (per turn) created fresh for each executeTurn() call.
* Stored as handle.currentTurnAbort so that
* injectMessage() can abort the RUNNING turn before
* enqueuing the new message.
*
* When the user interrupts (speaks while the agent is responding):
* 1. voice-agent LiveKit framework stops TTS playback immediately.
* 2. voice-agent calls POST /:sessionId/voice/inject with new utterance.
* 3. injectMessage() sees handle.currentTurnAbort !== null aborts it.
* 4. The for-await loop in executeTurn() checks turnAbort.signal.aborted
* on the NEXT received SDK event breaks silently (no error emitted).
* 5. The new message is enqueued; the loop dequeues it and starts a fresh turn.
*
* Because `break` does not throw, the catch block's error-event emission is
* never triggered by an interrupt the user hears no "agent error" message.
* Any partial assistant text accumulated before the break is saved to
* conversation history with a "[中断]" suffix for context continuity.
*
* Tenant context
* --------------
* The run loop is a background Promise, outside any HTTP request context.
* TenantContextService.run() wraps each executeTurn() call to inject the
* tenant's AsyncLocalStorage context (schema name, quotas, etc.) the same
* pattern used by the standing-order executor in ops-service.
*/
import { Injectable, Logger } from '@nestjs/common';
import { AsyncQueue } from '../../infrastructure/voice/async-queue';
@ -39,8 +85,10 @@ const TERMINATE: null = null;
interface VoiceSessionHandle {
/** Message queue: string = user speech turn; null = terminate signal. */
queue: AsyncQueue<string | null>;
/** Allows aborting the currently-running SDK executeTask call. */
/** Aborts the entire run loop (used on session terminate). */
abortController: AbortController;
/** Aborts the currently-executing SDK turn only (replaced each turn). */
currentTurnAbort: AbortController | null;
/** Tenant who owns this voice session. */
tenantId: string;
/** Background run-loop promise (resolved when loop exits). */
@ -82,6 +130,7 @@ export class VoiceSessionManager {
const handle: VoiceSessionHandle = {
queue,
abortController,
currentTurnAbort: null,
tenantId,
runLoop: this.runLoop(sessionId, tenantId, queue, abortController),
};
@ -97,6 +146,15 @@ export class VoiceSessionManager {
async injectMessage(sessionId: string, message: string): Promise<boolean> {
const handle = this.sessions.get(sessionId);
if (!handle) return false;
// If a turn is currently running, abort it immediately so the new message
// can be processed without waiting for the old SDK call to finish.
if (handle.currentTurnAbort) {
this.logger.log(`[VoiceSession ${sessionId}] Interrupting current turn for new message`);
handle.currentTurnAbort.abort();
handle.currentTurnAbort = null;
}
handle.queue.enqueue(message);
this.logger.log(`[VoiceSession ${sessionId}] Injected: "${message.slice(0, 80)}"`);
return true;
@ -202,8 +260,17 @@ export class VoiceSessionManager {
sessionId: string,
tenantId: string,
message: string,
abortController: AbortController,
sessionAbortController: AbortController,
): Promise<void> {
// Create a per-turn abort controller so this turn can be interrupted
// independently when the user speaks again mid-response.
const turnAbort = new AbortController();
const handle = this.sessions.get(sessionId);
if (handle) handle.currentTurnAbort = turnAbort;
// Combine session-level abort with turn-level abort: if either fires, abort the turn.
const onSessionAbort = () => turnAbort.abort();
sessionAbortController.signal.addEventListener('abort', onSessionAbort, { once: true });
const session = await this.sessionRepository.findById(sessionId);
if (!session) {
this.logger.error(`[VoiceSession ${sessionId}] Session not found in DB — cannot execute turn`);
@ -264,8 +331,8 @@ export class VoiceSessionManager {
});
for await (const event of stream) {
// Exit early if the voice session was terminated mid-turn
if (abortController.signal.aborted) break;
// Exit early if this turn was interrupted (user spoke again) or session terminated
if (turnAbort.signal.aborted) break;
if (!voiceFilteredTypes.has(event.type)) {
this.gateway.emitStreamEvent(sessionId, event);
@ -322,6 +389,14 @@ export class VoiceSessionManager {
});
}
} finally {
// Remove the session-abort listener to avoid memory leaks
sessionAbortController.signal.removeEventListener('abort', onSessionAbort);
// Clear the per-turn abort ref on the handle (if it still points to this turn)
if (handle && handle.currentTurnAbort === turnAbort) {
handle.currentTurnAbort = null;
}
// If aborted mid-turn, save any partial text accumulated before the abort
if (!finished && textParts.length > 0) {
await this.contextService

View File

@ -4,6 +4,34 @@ IT0 Voice Agent — LiveKit Agents v1.x entry point.
Uses the official AgentServer + @server.rtc_session() pattern.
Pipeline: VAD STT LLM (via agent-service) TTS.
Voice Session Lifecycle (long-lived agent run loop)
----------------------------------------------------
Each voice call maps to ONE long-lived agent session in agent-service,
instead of spawning a new process for every speech turn.
Call starts POST /api/v1/agent/sessions/voice/start
agent-service creates an AgentSession, starts a background
run loop, and returns a sessionId.
User speaks LiveKit STT AgentServiceLLM._run()
POST /:sessionId/voice/inject
agent-service enqueues the utterance; run loop picks it up,
calls Claude Agent SDK, streams events back via WebSocket.
User hangs up room "disconnected" event _on_room_disconnect()
DELETE /:sessionId/voice
agent-service aborts the run loop and marks session completed.
Interruption (mid-turn abort)
------------------------------
When the user speaks while the agent is still responding:
1. LiveKit framework stops TTS playback immediately (client-side).
2. STT produces the new utterance voice/inject is called.
3. agent-service detects a turn is already running aborts it (per-turn
AbortController) enqueues the new message.
4. The SDK loop breaks silently; no error message is emitted to TTS.
5. The new turn starts, producing the response to the interrupting utterance.
Agent State & Thinking Indicator
---------------------------------
LiveKit AgentSession (v1.4.3+) automatically publishes the participant