Four additional robustness fixes:
1. **Token refresh mutex** — tokenRefreshPromise deduplicates concurrent
refresh calls. All callers share one in-flight HTTP request instead
of each firing their own, eliminating the race condition.
2. **Distributed leader lease** — service_state table used for a
TTL-based leader election (LEADER_LEASE_TTL_S=90s). Only one
agent-service instance polls at a time; others skip until the lease
expires. Lease auto-released on graceful shutdown.
3. **Exponential backoff** — consecutive poll errors increment a counter;
next delay = min(10s × 2^(n-1), 5min). Prevents log spam and
reduces load during sustained WeCom API outages. Counter resets on
any successful poll.
4. **Watchdog timer** — setInterval every 2min checks lastPollAt.
If poll loop has been silent for >5min, clears the timer and
reschedules immediately, recovering from any silent crash.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>