Fix 1 — Observability (health endpoint):
WecomRouterService.getStatus() returns { enabled, isLeader, lastPollAt,
staleSinceMs, consecutiveErrors, pendingCallbacks, queuedUsers }.
GET /api/v1/agent/channels/wecom/health exposes it.
Fix 2 — Leader lease fail-closed:
tryClaimLeaderLease() catch now returns false instead of true.
DB failure → skip poll, preventing multi-master on DB outage.
isLeader flag tracked for health status.
Fix 3 — Cross-instance callback recovery via Redis:
routeToAgent() stores wecom:pending:{msgId} → externalUserId in Redis
with 200s TTL before waiting for the bridge callback.
resolveCallbackReply() is now async:
Fast path — local pendingCallbacks (same instance, 99% case)
Recovery — Redis GET → send reply directly to WeChat user
onModuleDestroy() cleans up Redis keys on graceful shutdown.
wecom/bridge-callback handler updated to await resolveCallbackReply.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>