Bug 1 — Watchdog doesn't track followers:
lastPollAt = Date.now() moved before leader check. All poll()
invocations update the timestamp, so if a follower's loop dies
the watchdog fires after WATCHDOG_THRESHOLD_MS and restarts it.
Bug 2 — Non-atomic GetDel for cross-instance recovery:
Replaced GET + DEL with atomic GETDEL (Redis 6.2+, ioredis v5).
Two instances can no longer both recover the same callback reply.
Bug 3 — Binding codes stored in per-process memory:
generateBindingCode() now async; stores in Redis:
wecom:bindcode:{CODE} → instanceId (TTL 15min)
wecom:bindcode:inst:{instId} → CODE (reverse lookup)
resolveBindCode() uses GETDEL atomically, then deletes reverse key.
Falls back to in-memory Map when Redis is unavailable.
Old code for same instance is revoked on regenerate.
handleMessage updated: resolveBindCode() replaces Map.get();
6-char hex pattern with no match now returns expired-code hint.
Controller wecomGenerateBindCode now awaits generateBindingCode().
Bug 4 — enter_session events not deduplicated:
handleEnterSession now receives msgId from the event.
redisDedup(msgId) called before sending welcome message — prevents
duplicate welcomes on WeCom retransmission or cursor reset.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace in-memory dedup Map with Redis SET NX EX:
- Key: wecom:dedup:{msgId}, TTL=600s (auto-expires, no manual cleanup)
- SET NX returns 'OK' on first write (process), null on duplicate (skip)
- Shared across all agent-service instances — no inter-process duplicates
- Fails open (return true) if Redis is unavailable — avoids silent drops
- Removed dedup Map and its periodicCleanup loop
WeCom router is now 10/10 robust:
cursor persistence, token mutex, distributed leader lease (fail-closed),
exponential backoff, watchdog, send retry, Redis dedup, Redis cross-instance
callback recovery, health endpoint.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix 1 — Observability (health endpoint):
WecomRouterService.getStatus() returns { enabled, isLeader, lastPollAt,
staleSinceMs, consecutiveErrors, pendingCallbacks, queuedUsers }.
GET /api/v1/agent/channels/wecom/health exposes it.
Fix 2 — Leader lease fail-closed:
tryClaimLeaderLease() catch now returns false instead of true.
DB failure → skip poll, preventing multi-master on DB outage.
isLeader flag tracked for health status.
Fix 3 — Cross-instance callback recovery via Redis:
routeToAgent() stores wecom:pending:{msgId} → externalUserId in Redis
with 200s TTL before waiting for the bridge callback.
resolveCallbackReply() is now async:
Fast path — local pendingCallbacks (same instance, 99% case)
Recovery — Redis GET → send reply directly to WeChat user
onModuleDestroy() cleans up Redis keys on graceful shutdown.
wecom/bridge-callback handler updated to await resolveCallbackReply.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four additional robustness fixes:
1. **Token refresh mutex** — tokenRefreshPromise deduplicates concurrent
refresh calls. All callers share one in-flight HTTP request instead
of each firing their own, eliminating the race condition.
2. **Distributed leader lease** — service_state table used for a
TTL-based leader election (LEADER_LEASE_TTL_S=90s). Only one
agent-service instance polls at a time; others skip until the lease
expires. Lease auto-released on graceful shutdown.
3. **Exponential backoff** — consecutive poll errors increment a counter;
next delay = min(10s × 2^(n-1), 5min). Prevents log spam and
reduces load during sustained WeCom API outages. Counter resets on
any successful poll.
4. **Watchdog timer** — setInterval every 2min checks lastPollAt.
If poll loop has been silent for >5min, clears the timer and
reschedules immediately, recovering from any silent crash.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three robustness fixes for the WeCom Customer Service router:
1. **Cursor persistence** — sync_msg cursor now stored in
public.service_state (auto-created via CREATE TABLE IF NOT EXISTS).
Survives service restarts; no more duplicate message processing.
2. **send_msg retry** — sendChunkWithRetry() retries once after 2s
on any API error (non-zero errcode or network failure). Lost
replies due to transient WeChat API errors are now recovered.
3. **enter_session welcome** — WeCom fires an enter_session event
(origin=0, msgtype=event) when a user opens the chat for the
first time. Now handled: bound users get a welcome-back message,
unbound users get step-by-step onboarding instructions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Print ✓/✗ for each platform (FCM/HMS/MI/OPPO/VIVO) so missing credentials
are immediately visible in container logs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GET /instances returned all tenant instances for admin accounts,
causing cross-user agent visibility. Changed to
GET /instances/user/:userId so each user only sees their own agents.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After verifying that the OpenClaw gateway's chat.send WebSocket RPC
accepts an 'attachments' array (confirmed from openclaw/openclaw source
and documentation), implement end-to-end image/file attachment support
for instance chat:
Bridge (openclaw-client.ts):
- chatSendAndWait() now accepts optional `attachments[]` parameter
- Passes attachments to chat.send RPC only when non-empty
Bridge (index.ts):
- /task-async accepts `attachments[]` from request body
- Forwards to chatSendAndWait unchanged
Backend (agent.controller.ts):
- executeInstanceTask() accepts IT0 attachment format
{ base64Data, mediaType, fileName? }
- Converts to OpenClaw format { name, mimeType, media: "data:..." }
- Saves attachments to conversation history via contextService
- Forwards to bridge via bridgeAttachments spread
Flutter (agent_instance_chat_remote_datasource.dart):
- createTask() now includes attachments in POST body when present
Flutter (chat_page.dart):
- Reverted Fix 5 (disabled button) — attachment button fully enabled
in instance mode since the bridge now supports it
Attachment format (OpenClaw wire):
{ name: string, mimeType: string, media: "data:<mime>;base64,<data>" }
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix 2 — Callback timeout wiring:
- Store callbackTimer in pendingCallbackTimers Map after creation
- handleOpenClawAppCallback clears the timer immediately on arrival,
preventing spurious "timeout" errors when the bridge replies in time
Fix 3 — Provider scope isolation:
- Override agentStatusProvider and robotStateProvider in child ProviderScope
so the robot avatar/FAB reflects the instance chat state, not iAgent's
Fix 4 — Voice routing to OpenClaw:
- AgentInstanceChatDatasource.sendVoiceMessage() now calls transcribeAudio()
then routes the transcript through instance-specific createTask() endpoint,
ensuring voice messages reach the user's OpenClaw agent, not iAgent
Fix 5 — Attachment UI in instance mode:
- Attachment button shown as disabled (onPressed: null) with explanatory
tooltip ("附件功能暂不支持智能体对话") when agentName != null
- Prevents misleading UX where attachments appear to work but are silently
dropped by the OpenClaw bridge
Fix 6 — DB schema template:
- Add agent_instance_id UUID NULL to agent_sessions table in migration 002
(tenant schema template) so new tenants get the column from creation
- Add covering index idx_agent_sessions_instance for efficient instance queries
All TypeScript and Flutter analyze checks pass clean.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Operators now only see their own instances (same as regular users).
Admin role retains superuser view. Orphaned running instances were
reassigned to hailin via DB update.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GET /api/v1/agent/instances was returning all instances regardless of user.
Now decodes JWT: non-admin users only see their own instances; admins see all.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
req.user is never populated in agent-service (Kong verifies JWT, no Passport strategy).
This caused userId to always be undefined → system prompt had no 'Current User ID' →
Claude used tenant slug 'shenzhengj' as userId → DB error 'invalid input syntax for
type uuid'.
Fix: decode JWT payload from Authorization header (no signature verify needed — Kong
already verified it) to extract sub (user UUID) for both AgentController and
VoiceSessionController.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
JWT payload was missing 'name' field — phone-invited users showed
empty name after app restart (session restore from JWT).
Also added phone fallback in Flutter _decodeUserFromJwt.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phone-invited users register with phone+password.
Changed identifier field from email-only to email/phone,
removed @ validation so phone numbers pass through.
Backend already auto-detects email vs phone.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phone-invited users are mobile App users, not web admin users.
After accepting a phone invitation, display App download QR + APK link
instead of redirecting to /dashboard.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Feishu @larksuiteoapi/node-sdk uses message_type, not msg_type (which is DingTalk).
This caused all incoming messages to be treated as non-text, returning
'我目前只能处理文字消息' for every message.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The _showOAuthBottomSheet title/subtitle were hardcoded to 钉钉. Now detects
channel from the URL (feishu.cn → 飞书, else → 钉钉) and shows correct text
and button color (#3370FF for Feishu, #1677FF for DingTalk).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add POST /sessions/:sessionId/feishu/oauth-trigger endpoint (mirrors DingTalk)
which emits oauth_prompt WS event so Flutter opens the Feishu authorization
page automatically instead of asking the user to enter a bind code
- Update SystemPromptBuilder: voice sessions now use the Feishu OAuth trigger
endpoint; text sessions still use the code-based flow as fallback
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add EmailService (nodemailer/SMTP) with invite email HTML template
- createInvite() now fires email notification after saving (fire-and-forget)
- my-org page: add App download QR code + invite link QR code panels
- Install react-qr-code in web-admin, nodemailer in auth-service
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add CRITICAL note and clear IF/ELSE branching so Claude never calls
dingtalk endpoints for feishu binding or vice versa.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
## Changes
### openclaw-bridge: POST /skill-inject
- New endpoint writes SKILL.md to ~/.openclaw/skills/{name}/ inside the container volume
- OpenClaw gateway file watcher picks it up within 250ms (no restart needed)
- Optionally calls sessions.delete RPC after write so the next user message starts
a fresh session that loads the new skill directory immediately (zero-downtime)
- Path traversal guard on skill name (rejects names with / .. \)
- OPENCLAW_HOME env var configurable (default: /home/node/.openclaw)
### agent-service: POST /api/v1/agent/instances/:id/skills
- New endpoint in AgentInstanceController proxies skill injection requests to the
instance's bridge (http://{serverHost}:{hostPort}/skill-inject)
- Guards: instance must be 'running', serverHost/hostPort must be set, content ≤ 100KB
- iAgent calls this internally (localhost:3002) via Python urllib — no Kong auth needed
- sessionKey format for DingTalk users: "agent:main:dt-{dingTalkUserId}"
### agent-service: remove dead SkillManagerService
- Deleted skill-manager.service.ts (file-system .md loader, never called by anything)
- Removed from agent.module.ts provider list
- The live skill path is ClaudeAgentSdkEngine.loadTenantSkills() which reads directly
from the DB (it0_t_{tenantId}.skills) at task-execution time
### agent-service: clean up SystemPromptBuilder
- Removed unused skills?: string[] from SystemPromptContext (was never populated)
- Added clarifying comment: SDK engine handles skill injection, not this builder
## DB
- Inserted iAgent meta-skill "为小龙虾安装技能" into it0_t_default.skills
(id: 79ac23ed-78c2-4d5f-8652-a99cf5185b61)
- Content instructs iAgent to: query user instances → generate SKILL.md → call
POST /api/v1/agent/instances/:id/skills via Python urllib heredoc
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Bridge: tag isTimeout=true in timeout callbacks for semantic error routing
- Agent-service: show "⏳ 还在努力想呢" progress batchSend after 25s silence
- Agent-service: queue position feedback ("前面还有 N 条") via sessionWebhook
- Agent-service: buildErrorReply() maps timeout/disconnect/abort to distinct msgs
- Agent-service: instance status hints (stopped/starting/error) with action guidance
- Agent-service: all user-facing strings rewritten for conversational, friendly tone
- Agent-channel: pass isTimeout from bridge callback through to resolveCallbackReply
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add RUN step to create /app/openclaw/docs/reference/templates symlink
at image build time. Previously only done as post-deploy SSH step,
leaving re-created containers broken until next full redeploy.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of "Bridge call failed" errors: bridge /task endpoint defaults
to 25s agent reply timeout, but LLM calls through the iConsulting gateway
can take 30-60s. Fix: pass timeoutSeconds=55 explicitly in POST body.
Also add batchSend fallback in routeToAgent: if the sessionWebhook has
expired by the time the LLM replies (user sent a message, LLM took >30s,
webhook window closed), the reply is now sent via proactive batchSend
using senderStaffId instead of being silently dropped.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the voice agent triggers DingTalk OAuth, the user leaves the app
to authorize in DingTalk/browser, causing the LiveKit participant to
disconnect. The voice-agent then calls DELETE /voice to terminate the
session — but the user intends to return after completing OAuth.
Fix: mark the session as "oauth_pending" in VoiceSessionController when
oauth-trigger fires. If terminateVoiceSession is called while the flag
is active (10-min grace), suppress the terminate and return 200 OK so
the voice-agent exits cleanly. The session stays alive; when the user
returns to the voice screen, voice/start + inject auto-resume it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>