7.9 KiB
Reconnection Event Stream Bug Analysis
Problem Summary
After network disconnection and reconnection, the event stream subscription is NOT restored, causing:
- No
session_startedevent received - Keygen never starts
- Messages pile up forever (539 pending)
Root Cause
The Bug Chain
1. Network disconnects
↓
2. subscribeSessionEvents Flow closes
↓
3. awaitClose block executes (GrpcClient.kt:925)
eventStreamSubscribed.set(false) ← FLAG CLEARED
↓
4. Network reconnects successfully
↓
5. reSubscribeStreams() called (GrpcClient.kt:202)
↓
6. Line 506 checks:
val needsResubscribe = eventStreamSubscribed.get() || activeMessageSubscription != null
↓
7. eventStreamSubscribed.get() returns FALSE ❌
activeMessageSubscription is also NULL ❌
↓
8. needsResubscribe = false
↓
9. Callback NEVER invoked
↓
10. Event stream NEVER restored
Code Evidence
GrpcClient.kt - Where Flag is Set/Cleared
Line 844 - Flag set when subscribing:
fun subscribeSessionEvents(partyId: String): Flow<SessionEventData> = callbackFlow {
eventStreamSubscribed.set(true) ← Set to TRUE
eventStreamPartyId = partyId
...
}
Line 925 - Flag cleared when Flow closes (THE BUG):
awaitClose {
Log.d(TAG, "subscribeSessionEvents: Flow closed for partyId=$partyId")
eventStreamSubscribed.set(false) ← Set to FALSE on disconnect
eventStreamPartyId = null
}
Line 506 - Reconnection check (FAILS because flag is false):
private fun reSubscribeStreams() {
val needsResubscribe = eventStreamSubscribed.get() || activeMessageSubscription != null
// ↑ Returns FALSE after disconnect
if (needsResubscribe) { ← This condition is FALSE
Log.d(TAG, "Triggering stream re-subscription callback")
...
onReconnectedCallback?.invoke() ← NEVER REACHED
}
}
Log Evidence
Normal Reconnection (16:28:26) - WORKS ✅
16:28:26.082 D/GrpcClient: Connected successfully
16:28:26.086 D/GrpcClient: Re-registering party: 7c72c28f...
16:28:26.130 D/GrpcClient: Party registered: 7c72c28f...
16:28:26.130 D/GrpcClient: Triggering stream re-subscription callback ← Present!
16:28:26.130 D/GrpcClient: - Event stream: true, partyId: 7c72c28f...
16:28:26.130 D/TssRepository: gRPC reconnected, restoring streams... ← Present!
16:28:26.130 D/TssRepository: Restoring session event subscription ← Present!
Problem Reconnection (16:29:47) - FAILS ❌
16:29:47.090 D/GrpcClient: Connected successfully
16:29:47.093 D/GrpcClient: Re-registering party: 7c72c28f...
16:29:47.146 D/GrpcClient: Party registered: 7c72c28f...
[MISSING]: "Triggering stream re-subscription callback" ← NOT PRESENT!
[MISSING]: "gRPC reconnected, restoring streams..." ← NOT PRESENT!
[MISSING]: "Restoring session event subscription" ← NOT PRESENT!
Result:
16:30:47.198 W/GrpcClient: Has 539 pending messages - may have missed events
16:31:17.237 W/GrpcClient: Has 539 pending messages - may have missed events
Why First Reconnection Worked
Looking at the timeline:
16:27:53 - App started, event subscription started
16:28:26 - First reconnect (1 minute later)
Event subscription was STILL ACTIVE
eventStreamSubscribed = true ✅
16:29:15 - Network disconnect (49 seconds later)
Flow closed → eventStreamSubscribed set to FALSE ❌
16:29:47 - Second reconnect
eventStreamSubscribed = false ❌
Callback NOT invoked ❌
Key Insight: The first reconnection worked because the event stream Flow hadn't closed yet. The second reconnection failed because the Flow had closed and cleared the flag.
The Design Flaw
The current design has a state tracking inconsistency:
// When to subscribe?
eventStreamSubscribed = true // "I am currently subscribed"
// When to unsubscribe?
eventStreamSubscribed = false // "I am no longer subscribed"
// When to re-subscribe?
if (eventStreamSubscribed) { ... } // ❌ WRONG - flag is already false!
Problem: The flag tracks "am I currently subscribed?" but reconnection logic needs to know "should I re-subscribe?". These are two different concepts.
Solution Options
Option 1: Don't Clear Flag in awaitClose (Simple)
awaitClose {
Log.d(TAG, "subscribeSessionEvents: Flow closed for partyId=$partyId")
// DON'T clear the flag - keep it for reconnection
// eventStreamSubscribed.set(false) ← REMOVE THIS
// eventStreamPartyId = null ← REMOVE THIS
}
Pros: Minimal change, preserves intent to re-subscribe Cons: Flag no longer accurately reflects current state
Option 2: Add Separate "Should Restore" Flag (Better)
// Two separate flags:
private val eventStreamSubscribed = AtomicBoolean(false) // Current state
private val shouldRestoreEventStream = AtomicBoolean(false) // Intent to restore
// When subscribing:
eventStreamSubscribed.set(true)
shouldRestoreEventStream.set(true) // Remember to restore
// In awaitClose:
eventStreamSubscribed.set(false) // No longer subscribed
// Keep shouldRestoreEventStream = true // But should restore on reconnect
// In reSubscribeStreams:
val needsResubscribe = shouldRestoreEventStream.get() || activeMessageSubscription != null
Pros: Clear separation of concerns, accurate state tracking Cons: More code, requires careful handling of clear conditions
Option 3: Store Last Subscription State (Most Robust)
// Store full subscription state for recovery
private data class StreamState(
val eventStreamPartyId: String?,
val messageSessionId: String?,
val messagePartyId: String?
)
private val lastStreamState = AtomicReference<StreamState>(null)
// On subscribe, save state
// On reconnect, restore from saved state
Pros: Can restore exact previous state, handles complex scenarios Cons: Most complex implementation
Recommended Fix
Use Option 1 (simplest) with Option 2 concept (clearer intent):
- Don't clear
eventStreamSubscribedinawaitClose - Only clear it when user explicitly unsubscribes or app shuts down
- This preserves the "I was subscribed, so re-subscribe on reconnect" behavior
Alternative: Add explicit unsubscribe call only when intentionally stopping (not on disconnect).
Files to Modify
GrpcClient.kt
Line 923-927 - Remove flag clearing in awaitClose:
awaitClose {
Log.d(TAG, "subscribeSessionEvents: Flow closed for partyId=$partyId")
// Keep flags for reconnection - don't clear here
// Only clear on explicit unsubscribe or disconnect
}
Line 933-936 - Keep explicit unsubscribe as-is:
fun unsubscribeSessionEvents() {
eventStreamSubscribed.set(false)
eventStreamPartyId = null
}
Testing Checklist
After fix:
- Start app, subscribe to events
- Simulate network disconnect (airplane mode)
- Verify log shows: "Triggering stream re-subscription callback"
- Verify log shows: "gRPC reconnected, restoring streams..."
- Verify log shows: "Restoring session event subscription"
- Verify pending messages start decreasing
- Test 2-of-3 keygen succeeds after reconnection
Why This Wasn't Caught Before
- Timing-dependent: Only fails if Flow closes before reconnect
- Works in most cases: Quick reconnects (< 1 minute) often succeed before Flow timeout
- No explicit test: Didn't test scenario of "disconnect → wait for Flow to close → reconnect"
- Silent failure: No error logged, just missing callback invocation
Conclusion
The safeLaunch optimization did NOT cause this bug. The bug exists because:
awaitCloseclearseventStreamSubscribedon disconnect- Reconnection logic relies on this flag to decide if callback should be invoked
- After disconnect, flag is false, so callback is never invoked
Fix: Don't clear the subscription intent flag on temporary disconnection.