Commit Graph

42 Commits

Author SHA1 Message Date
hailin fb236de6e4 fix: set LiveKit node_ip to China IP for domestic WebRTC connectivity
LiveKit's use_external_ip auto-detected 154.84.135.121 (overseas) via
STUN, causing WebRTC ICE candidates to use an unreachable IP for
domestic mobile clients. Explicitly set node_ip to 14.215.128.96.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 21:51:17 -08:00
hailin 8a48e92970 fix: use domain names for API access, China IP for LiveKit
Flutter app now uses https://it0api.szaiai.com (nginx reverse proxy)
instead of direct IP:port. LiveKit URL uses China IP 14.215.128.96
for lower latency from domestic mobile clients.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 21:44:25 -08:00
hailin 7fb0168dc5 fix: keep voice-service on bridge networking to avoid port conflict
iconsulting-llm-gateway already occupies port 3008 on the host.
voice-service only has a single TCP port (no docker-proxy overhead),
so bridge networking with 13008:3008 mapping is sufficient.
Only livekit-server and voice-agent need host mode (UDP port ranges).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 20:23:13 -08:00
hailin 68ee2516d5 fix: use host networking for voice services to eliminate docker-proxy overhead
Bridge mode created 600+ docker-proxy processes for LiveKit's UDP port-range
mappings (30000-30100, 50000-50200). Switch livekit-server, voice-agent, and
voice-service to network_mode: host for zero-overhead networking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 19:58:32 -08:00
hailin 2dc361f7a0 chore: update docker-compose TTS defaults to gpt-4o-mini-tts
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 08:44:17 -08:00
hailin cf60b8733f fix: expose TURN relay ports for NAT traversal
Limit TURN relay range to 30000-30100 and expose via docker-compose.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 11:39:50 -08:00
hailin 2f0cb13ecb fix: enable built-in TURN server for NAT traversal
Subscriber transport was timing out on DTLS handshake for clients
behind complex NAT (VPN/symmetric NAT). Enable LiveKit's built-in
TURN server on UDP port 3478.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 11:37:21 -08:00
hailin 2ce0e7cdd4 fix: use external LiveKit URL in voice-service config
The livekit_ws_url returned in token response needs to be the external
server address, not the internal Docker network name, so Flutter clients
can connect directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 10:00:26 -08:00
hailin 94a14b3104 feat: migrate voice call from WebSocket/PCM to LiveKit WebRTC
实时语音对话架构迁移:WebSocket → LiveKit WebRTC

## 背景
原语音通话架构基于 FastAPI WebSocket 传输原始 PCM,管道串行执行
(VAD → 批量STT → Agent → 攒句 → 批量TTS),首音频延迟约 6 秒。
迁移到 LiveKit Agents 框架后,利用 WebRTC 传输 + 流水线并行,
预期延迟降至 1.5-2 秒。

## 架构
Flutter App ←── WebRTC (Opus/UDP) ──→ LiveKit Server ←──→ Voice Agent
  livekit_client                      (自部署, Go)       (Python, LiveKit Agents SDK)
                                                          ├─ VAD (Silero)
                                                          ├─ STT (faster-whisper / OpenAI)
                                                          ├─ LLM (自定义插件 → agent-service)
                                                          └─ TTS (Kokoro / OpenAI)

关键设计:LLM 不直接调用 Claude API,而是通过自定义插件代理到现有
agent-service,保留 Tool Use、会话历史、租户隔离等能力。

## 新增服务

### voice-agent (packages/services/voice-agent/)
LiveKit Agent Worker,包含:
- agent.py: 入口,prewarm() 预加载模型,entrypoint() 编排会话
- plugins/agent_llm.py: 自定义 LLM 插件,代理 agent-service API
  - POST /api/v1/agent/tasks 创建任务
  - WS /ws/agent 订阅流式事件 (stream_event)
  - 跨轮复用 session_id 保持对话上下文
- plugins/whisper_stt.py: 本地 faster-whisper STT (批量识别)
- plugins/kokoro_tts.py: 本地 Kokoro-82M TTS (24kHz PCM)
- config.py: pydantic-settings 配置

### LiveKit Server (deploy/docker/)
- livekit.yaml: 信令端口 7880, RTC TCP 7881, UDP 50000-50200
- docker-compose.yml: 新增 livekit-server + voice-agent 容器

### LiveKit Token 端点
- voice-service/src/api/livekit_token.py:
  POST /api/v1/voice/livekit/token
  生成 Room JWT,嵌入 auth_header 到 AgentDispatch metadata

## Flutter 客户端改造
- agent_call_page.dart: 从 ~814 行简化到 ~380 行
  - 替换: WebSocketChannel, AudioRecorder, PcmPlayer, 手动心跳/重连
  - 使用: Room.connect(), setMicrophoneEnabled(true), LiveKit 事件监听
  - 波形动画改用 participant.audioLevel
- pubspec.yaml: 添加 livekit_client: ^2.3.0
- app_config.dart: 增加 livekitUrl 字段
- api_endpoints.dart: 增加 livekitToken 端点

## 配置说明 (环境变量)
- STT_PROVIDER: local (默认, faster-whisper) / openai
- TTS_PROVIDER: local (默认, Kokoro) / openai
- WHISPER_MODEL: base (默认) / small / medium / large
- WHISPER_LANGUAGE: zh (默认)
- KOKORO_VOICE: zf_xiaoxiao (默认)
- DEVICE: cpu (默认) / cuda

## 不变的部分
- agent-service: 完全不改,voice-agent 通过现有 API 调用
- voice-service 核心: pipeline/STT/TTS/VAD 保留 (Twilio 备用)
- Kong 网关: 现有路由不变
- 数据库: 无 schema 变更

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 08:55:33 -08:00
hailin 3ed20cdf08 refactor: clean up agent SSH setup after fixing host-local routing
- Remove iproute2/NET_ADMIN (no longer needed)
- Remove ip route hack from entrypoint.sh
- rwa-colocation-2 server record updated to use Docker gateway IP
  since 14.215.128.96 is a host-local NIC on the IT0 server

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 18:11:44 -08:00
hailin ae7d9251ec fix: add route for host-local IP (14.215.128.96) in agent container
14.215.128.96 is bound to a host NIC (enp5s0) and unreachable from
Docker bridge via default NAT. Add NET_ADMIN + ip route via gateway.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 18:05:30 -08:00
hailin 0dea3f82bc fix: mount correct SSH key (rwadurian_ed25519) in agent-service
The IT0 server has its own id_ed25519 which differs from the local
key that's authorized on RWADurian servers. Use a dedicated key file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 13:05:01 -08:00
hailin f0ad6e09e6 fix: move entrypoint.sh to project root (deploy/ is in .dockerignore)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 12:14:31 -08:00
hailin bad7f4802d fix: use root entrypoint to copy SSH key then drop to appuser
The bind-mounted SSH key is owned by host uid (1000/node) but the
service runs as appuser (uid 1001). Use su-exec in entrypoint.sh
to copy the key as root, fix ownership, then drop privileges.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 12:13:55 -08:00
hailin 329916e1f6 fix: correct SSH key permissions in agent-service container
Mount host key to /tmp/host-ssh-key (read-only), then copy to
appuser's .ssh directory with correct ownership at container start.
Fixes "Permission denied" due to uid mismatch on bind mount.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 12:00:02 -08:00
hailin 795e8a11c5 feat: enable SSH access from agent-service container
- Add openssh-client to Dockerfile.service (alpine)
- Create .ssh directory with correct permissions for appuser
- Mount host SSH key into agent-service container (read-only)

This allows the Agent SDK to SSH into managed servers using the Bash tool.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 11:55:54 -08:00
hailin cc0f06e2be feat: SDK engine native resume with per-tenant HOME isolation
Replace prompt-prefix workaround with SDK's native resume mechanism.
Each tenant gets isolated HOME directory (/data/claude-tenants/{tenantId})
to prevent cross-tenant session file mixing. SDK session IDs are persisted
in session.metadata for cross-request resume support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 02:27:38 -08:00
hailin c02c2a9a11 feat: add OpenAI TTS/STT provider support in voice pipeline
- Add STT_PROVIDER/TTS_PROVIDER config (local or openai) in settings
- Pipeline uses OpenAI API for STT/TTS when provider is "openai"
- Skip loading local models (Kokoro/faster-whisper) when using OpenAI
- VAD (Silero) always loads for speech detection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 09:27:38 -08:00
hailin d43baed3a5 feat: add OpenAI TTS/STT API endpoints for comparison testing
- Add openai package to voice-service requirements
- Add /api/v1/test/tts/synthesize-openai (tts-1/tts-1-hd/gpt-4o-mini-tts)
- Add /api/v1/test/stt/transcribe-openai (gpt-4o-transcribe/whisper-1)
- Add OPENAI_API_KEY and OPENAI_BASE_URL env vars to voice-service
- Flutter test page: SegmentedButton to toggle Local/OpenAI provider
- All endpoints maintain same response format for easy comparison

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 07:20:03 -08:00
hailin 7ac753ada4 fix: add ANTHROPIC_BASE_URL to agent-service for proxy access
The agent-service was missing the ANTHROPIC_BASE_URL environment variable,
causing the Claude Agent SDK to call api.anthropic.com directly instead of
going through the proxy at 67.223.119.33, resulting in 403 Forbidden errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 04:49:27 -08:00
hailin 6876ec569b fix: remove ANTHROPIC_API_KEY from agent-service to use subscription mode
Default to OAuth subscription billing via ~/.claude/.credentials.json
instead of consuming API key credits.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 03:43:09 -08:00
hailin 82d12a5ff5 feat: mount voice model cache volumes to avoid re-downloading on restart
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 02:28:28 -08:00
hailin abf5e29419 feat: route voice pipeline through agent-service instead of direct LLM
Voice calls now use the same agent task + WS subscription flow as the
chat UI, enabling tool use and command execution during voice sessions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 00:47:31 -08:00
hailin d4391eef97 fix: run services as non-root user for SDK bypassPermissions
SDK blocks bypassPermissions when running as root for security.
Add non-root 'appuser' to Dockerfile.service and update volume
mounts to use /home/appuser/.claude paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 06:41:10 -08:00
hailin 04a18a7899 fix: use acceptEdits mode and mount .claude.json for SDK
- bypassPermissions blocked by SDK when running as root
- Switch to acceptEdits with canUseTool for programmatic control
- Mount .claude.json config file into container

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 06:37:31 -08:00
hailin 3a6f9d9447 fix: mount .claude directory as read-write for SDK debug logs
SDK writes debug logs to ~/.claude/debug/ at runtime.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 06:21:31 -08:00
hailin b963b7d4da feat: enable SDK subscription mode with OAuth credentials mount
- Mount ~/.claude/ into agent-service container for OAuth token access
- Switch default engine to claude_agent_sdk
- Remove ANTHROPIC_API_KEY from env in subscription mode so SDK uses OAuth
- Keep API key mode for per-tenant billing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 06:14:45 -08:00
hailin 810dcd7def feat: switch default engine to claude_api with base URL support
- Change AGENT_ENGINE_TYPE from claude_code_cli to claude_api in docker-compose
- Add ANTHROPIC_BASE_URL env var support to claude-api-engine
- Add ANTHROPIC_BASE_URL to agent-service environment in docker-compose

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 05:45:08 -08:00
hailin 9a1ecf10ec fix: add restart policy, global error handlers, and fix tenant schema bug
- Add restart: unless-stopped to all 12 Docker services
- Add process.on(unhandledRejection/uncaughtException) to all 7 service main.ts
- Fix handleEventTrigger using tenantId UUID as schema name instead of slug lookup
- Wrap Redis event subscription callbacks in try/catch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 05:30:34 -08:00
hailin 48e47975ca fix: configure Kong JWT auth flow with consumer credentials
- Add kid claim to auth-service JWT for Kong validation
- Add Kong consumer with JWT credential (shared secret via env)
- Add agent-config route to Kong for /api/v1/agent-config
- Kong Dockerfile uses entrypoint script to inject JWT_SECRET at runtime
- Fix frontend login path (/auth/login → /api/v1/auth/login)
- Extract tenantId from JWT on login and store as current_tenant
- Add auth guard in admin layout (redirect to /login if no token)
- Pass JWT_SECRET env var to Kong container in docker-compose

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 23:20:06 -08:00
hailin e5dcfa6113 feat: configure it0.szaiai.com and it0api.szaiai.com domains
- Update Kong CORS origins to allow it0.szaiai.com
- Update WebSocket URL to wss://it0api.szaiai.com
- Fix proxy route to read API_BASE_URL at request time
  (was being inlined at build time by Next.js standalone)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 22:54:17 -08:00
hailin 67d5a13c0c fix: set compose project name to 'it0' for consistent image naming
Changes image names from docker-{service} to it0-{service}.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 02:57:42 -08:00
hailin 259838ae88 fix: set HOSTNAME=0.0.0.0 for Next.js standalone to bind all interfaces
Next.js standalone server binds to container hostname by default,
making it unreachable from 127.0.0.1 for healthchecks and from
Docker port forwarding.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 02:52:37 -08:00
hailin 83da374bbb fix: use 127.0.0.1 in web-admin healthcheck to avoid IPv6 resolution
Node.js 18 resolves 'localhost' to ::1 (IPv6) but Next.js standalone
only binds to 0.0.0.0 (IPv4), causing Connection Refused.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 02:49:51 -08:00
hailin 3702fa3f52 fix: make voice-service startup graceful and fix device config
- Wrap model loading in try/except so server starts even if models fail
- Fix device env var mapping (unified 'device' field instead of 'whisper_device')
- Default Whisper model to 'base' instead of 'large-v3' (3GB) for CPU deployment
- Increase healthcheck start_period to 120s for model download time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 00:20:12 -08:00
hailin d0447fb69f fix: use node/python HTTP healthchecks instead of wget
wget returns error on 404, but services are healthy (just no root
endpoint). Using node http.get for NestJS services (accepts any
non-5xx response) and python urllib for voice-service.

Also upgraded api-gateway depends_on to service_healthy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 00:13:47 -08:00
hailin e7ae82e51d feat: add healthcheck to all services in docker-compose
NestJS services use wget to check API endpoints.
voice-service uses curl to check FastAPI /docs endpoint.
web-admin uses wget to check Next.js root.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 00:10:38 -08:00
hailin 4db373b03f . 2026-02-19 20:37:19 +08:00
hailin e875cd49bb fix: resolve Kong image tag and port conflicts for shared server
- Change Kong base image from kong:3.7-alpine (non-existent) to kong:3.7
- Remap all host ports to avoid conflicts with existing iconsulting services:
  - Backend services: 13001-13008 (was 3001-3008)
  - Web admin: 13000 (was 3000)
  - API gateway: 18000/18001 (was 8000/8001)
  - PostgreSQL: 15432 (was 5432)
  - Redis: 16379 (was 6379)
- Add container_name with it0- prefix to all services
- Update deploy.sh health check ports to match new mappings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 04:36:23 -08:00
hailin 9120f4927e fix: add Dockerfiles and fix docker-compose build configuration
- Add shared Dockerfile.service for all 7 NestJS microservices using
  multi-stage build with pnpm workspace support
- Add Dockerfile for web-admin (Next.js standalone output)
- Add .dockerignore files for root and web-admin
- Fix docker-compose.yml: use monorepo root as build context with
  SERVICE_NAME build arg instead of per-service Dockerfiles
- Fix postgres/redis missing network config (services couldn't reach them)
- Use .env variables for DB credentials instead of hardcoded values
- Add JWT_REFRESH_SECRET and REDIS_URL to services that were missing them
- Add DB init script volume mount for postgres
- Remove deprecated version: '3.8' from all compose files
- Add output: 'standalone' to next.config.js for optimized Docker builds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 04:31:23 -08:00
hailin e761b65b6e feat: add deployment scripts with SSL support for production
Backend deploy script (deploy/docker/deploy.sh):
- install: auto-generate .env with secure secrets (JWT, DB passwords, vault keys)
- up/down/restart: manage all services (infra + app + gateway)
- build/build-no-cache: Docker image management
- status/health: health checks for all 9 services + infrastructure
- migrate: TypeORM migration commands (run/generate/revert/schema-sync)
- infra-*: standalone infrastructure management (PostgreSQL + Redis)
- voice-*: voice service with GPU support (docker-compose.voice.yml overlay)
- start-svc/stop-svc/rebuild-svc: individual service operations
- ssl-init: obtain Let's Encrypt certificates for both domains independently
- ssl-up/ssl-down: start/stop with Nginx SSL reverse proxy
- ssl-renew/ssl-status: certificate renewal and status checks

Web Admin deploy script (it0-web-admin/deploy.sh):
- build/start/stop/restart/logs/status/clean commands
- auto-generates Dockerfile (Next.js multi-stage standalone build)
- auto-generates docker-compose.yml
- configurable API domain (default: it0api.szaiai.com)

SSL / Nginx configuration:
- nginx.conf: reverse proxy for both domains with HTTP->HTTPS redirect
  - it0api.szaiai.com -> api-gateway:8000 (with WebSocket support)
  - it0.szaiai.com -> web-admin:3000 (with Next.js HMR support)
- nginx-init.conf: HTTP-only config for initial ACME challenge verification
- ssl-params.conf: TLS 1.2/1.3, HSTS, security headers (Mozilla Intermediate)
- docker-compose.ssl.yml: Nginx + Certbot overlay with auto-renewal (12h cycle)

Domain plan:
- https://it0api.szaiai.com — API endpoint (backend services)
- https://it0.szaiai.com — Web Admin dashboard (frontend)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 17:44:27 -08:00
hailin 00f8801d51 Initial commit: IT0 AI-powered server cluster operations platform
Full-stack monorepo with DDD + Clean Architecture:
- Backend: 7 NestJS microservices + 5 shared libraries (TypeScript)
- Mobile: Flutter app with Riverpod (Dart)
- Web Admin: Next.js dashboard with Zustand + React Query
- Voice: Python voice service (STT/TTS/VAD)
- Infra: Docker Compose, K8s manifests, Turborepo build

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 22:54:37 -08:00