rwadurian/backend/mpc-system/services/service-party-android/RECONNECTION_BUG_ANALYSIS.md

252 lines
7.9 KiB
Markdown

# Reconnection Event Stream Bug Analysis
## Problem Summary
After network disconnection and reconnection, the event stream subscription is NOT restored, causing:
- No `session_started` event received
- Keygen never starts
- Messages pile up forever (539 pending)
## Root Cause
### The Bug Chain
```
1. Network disconnects
2. subscribeSessionEvents Flow closes
3. awaitClose block executes (GrpcClient.kt:925)
eventStreamSubscribed.set(false) ← FLAG CLEARED
4. Network reconnects successfully
5. reSubscribeStreams() called (GrpcClient.kt:202)
6. Line 506 checks:
val needsResubscribe = eventStreamSubscribed.get() || activeMessageSubscription != null
7. eventStreamSubscribed.get() returns FALSE ❌
activeMessageSubscription is also NULL ❌
8. needsResubscribe = false
9. Callback NEVER invoked
10. Event stream NEVER restored
```
## Code Evidence
### GrpcClient.kt - Where Flag is Set/Cleared
**Line 844** - Flag set when subscribing:
```kotlin
fun subscribeSessionEvents(partyId: String): Flow<SessionEventData> = callbackFlow {
eventStreamSubscribed.set(true) Set to TRUE
eventStreamPartyId = partyId
...
}
```
**Line 925** - Flag cleared when Flow closes (THE BUG):
```kotlin
awaitClose {
Log.d(TAG, "subscribeSessionEvents: Flow closed for partyId=$partyId")
eventStreamSubscribed.set(false) Set to FALSE on disconnect
eventStreamPartyId = null
}
```
**Line 506** - Reconnection check (FAILS because flag is false):
```kotlin
private fun reSubscribeStreams() {
val needsResubscribe = eventStreamSubscribed.get() || activeMessageSubscription != null
// ↑ Returns FALSE after disconnect
if (needsResubscribe) { This condition is FALSE
Log.d(TAG, "Triggering stream re-subscription callback")
...
onReconnectedCallback?.invoke() NEVER REACHED
}
}
```
## Log Evidence
### Normal Reconnection (16:28:26) - WORKS ✅
```
16:28:26.082 D/GrpcClient: Connected successfully
16:28:26.086 D/GrpcClient: Re-registering party: 7c72c28f...
16:28:26.130 D/GrpcClient: Party registered: 7c72c28f...
16:28:26.130 D/GrpcClient: Triggering stream re-subscription callback ← Present!
16:28:26.130 D/GrpcClient: - Event stream: true, partyId: 7c72c28f...
16:28:26.130 D/TssRepository: gRPC reconnected, restoring streams... ← Present!
16:28:26.130 D/TssRepository: Restoring session event subscription ← Present!
```
### Problem Reconnection (16:29:47) - FAILS ❌
```
16:29:47.090 D/GrpcClient: Connected successfully
16:29:47.093 D/GrpcClient: Re-registering party: 7c72c28f...
16:29:47.146 D/GrpcClient: Party registered: 7c72c28f...
[MISSING]: "Triggering stream re-subscription callback" ← NOT PRESENT!
[MISSING]: "gRPC reconnected, restoring streams..." ← NOT PRESENT!
[MISSING]: "Restoring session event subscription" ← NOT PRESENT!
Result:
16:30:47.198 W/GrpcClient: Has 539 pending messages - may have missed events
16:31:17.237 W/GrpcClient: Has 539 pending messages - may have missed events
```
## Why First Reconnection Worked
Looking at the timeline:
```
16:27:53 - App started, event subscription started
16:28:26 - First reconnect (1 minute later)
Event subscription was STILL ACTIVE
eventStreamSubscribed = true ✅
16:29:15 - Network disconnect (49 seconds later)
Flow closed → eventStreamSubscribed set to FALSE ❌
16:29:47 - Second reconnect
eventStreamSubscribed = false ❌
Callback NOT invoked ❌
```
**Key Insight**: The first reconnection worked because the event stream Flow hadn't closed yet. The second reconnection failed because the Flow had closed and cleared the flag.
## The Design Flaw
The current design has a **state tracking inconsistency**:
```kotlin
// When to subscribe?
eventStreamSubscribed = true // "I am currently subscribed"
// When to unsubscribe?
eventStreamSubscribed = false // "I am no longer subscribed"
// When to re-subscribe?
if (eventStreamSubscribed) { ... } // ❌ WRONG - flag is already false!
```
**Problem**: The flag tracks "am I currently subscribed?" but reconnection logic needs to know "should I re-subscribe?". These are two different concepts.
## Solution Options
### Option 1: Don't Clear Flag in awaitClose (Simple)
```kotlin
awaitClose {
Log.d(TAG, "subscribeSessionEvents: Flow closed for partyId=$partyId")
// DON'T clear the flag - keep it for reconnection
// eventStreamSubscribed.set(false) ← REMOVE THIS
// eventStreamPartyId = null ← REMOVE THIS
}
```
**Pros**: Minimal change, preserves intent to re-subscribe
**Cons**: Flag no longer accurately reflects current state
### Option 2: Add Separate "Should Restore" Flag (Better)
```kotlin
// Two separate flags:
private val eventStreamSubscribed = AtomicBoolean(false) // Current state
private val shouldRestoreEventStream = AtomicBoolean(false) // Intent to restore
// When subscribing:
eventStreamSubscribed.set(true)
shouldRestoreEventStream.set(true) // Remember to restore
// In awaitClose:
eventStreamSubscribed.set(false) // No longer subscribed
// Keep shouldRestoreEventStream = true // But should restore on reconnect
// In reSubscribeStreams:
val needsResubscribe = shouldRestoreEventStream.get() || activeMessageSubscription != null
```
**Pros**: Clear separation of concerns, accurate state tracking
**Cons**: More code, requires careful handling of clear conditions
### Option 3: Store Last Subscription State (Most Robust)
```kotlin
// Store full subscription state for recovery
private data class StreamState(
val eventStreamPartyId: String?,
val messageSessionId: String?,
val messagePartyId: String?
)
private val lastStreamState = AtomicReference<StreamState>(null)
// On subscribe, save state
// On reconnect, restore from saved state
```
**Pros**: Can restore exact previous state, handles complex scenarios
**Cons**: Most complex implementation
## Recommended Fix
**Use Option 1 (simplest) with Option 2 concept (clearer intent)**:
1. Don't clear `eventStreamSubscribed` in `awaitClose`
2. Only clear it when user explicitly unsubscribes or app shuts down
3. This preserves the "I was subscribed, so re-subscribe on reconnect" behavior
**Alternative**: Add explicit unsubscribe call only when intentionally stopping (not on disconnect).
## Files to Modify
### GrpcClient.kt
**Line 923-927** - Remove flag clearing in awaitClose:
```kotlin
awaitClose {
Log.d(TAG, "subscribeSessionEvents: Flow closed for partyId=$partyId")
// Keep flags for reconnection - don't clear here
// Only clear on explicit unsubscribe or disconnect
}
```
**Line 933-936** - Keep explicit unsubscribe as-is:
```kotlin
fun unsubscribeSessionEvents() {
eventStreamSubscribed.set(false)
eventStreamPartyId = null
}
```
## Testing Checklist
After fix:
- [ ] Start app, subscribe to events
- [ ] Simulate network disconnect (airplane mode)
- [ ] Verify log shows: "Triggering stream re-subscription callback"
- [ ] Verify log shows: "gRPC reconnected, restoring streams..."
- [ ] Verify log shows: "Restoring session event subscription"
- [ ] Verify pending messages start decreasing
- [ ] Test 2-of-3 keygen succeeds after reconnection
## Why This Wasn't Caught Before
1. **Timing-dependent**: Only fails if Flow closes before reconnect
2. **Works in most cases**: Quick reconnects (< 1 minute) often succeed before Flow timeout
3. **No explicit test**: Didn't test scenario of "disconnect wait for Flow to close reconnect"
4. **Silent failure**: No error logged, just missing callback invocation
## Conclusion
The safeLaunch optimization did NOT cause this bug. The bug exists because:
1. `awaitClose` clears `eventStreamSubscribed` on disconnect
2. Reconnection logic relies on this flag to decide if callback should be invoked
3. After disconnect, flag is false, so callback is never invoked
**Fix**: Don't clear the subscription intent flag on temporary disconnection.