rwadurian/backend/mpc-system/services/service-party-android/RECONNECTION_BUG_ANALYSIS.md

7.9 KiB

Reconnection Event Stream Bug Analysis

Problem Summary

After network disconnection and reconnection, the event stream subscription is NOT restored, causing:

  • No session_started event received
  • Keygen never starts
  • Messages pile up forever (539 pending)

Root Cause

The Bug Chain

1. Network disconnects
   ↓
2. subscribeSessionEvents Flow closes
   ↓
3. awaitClose block executes (GrpcClient.kt:925)
   eventStreamSubscribed.set(false)  ← FLAG CLEARED
   ↓
4. Network reconnects successfully
   ↓
5. reSubscribeStreams() called (GrpcClient.kt:202)
   ↓
6. Line 506 checks:
   val needsResubscribe = eventStreamSubscribed.get() || activeMessageSubscription != null
   ↓
7. eventStreamSubscribed.get() returns FALSE ❌
   activeMessageSubscription is also NULL ❌
   ↓
8. needsResubscribe = false
   ↓
9. Callback NEVER invoked
   ↓
10. Event stream NEVER restored

Code Evidence

GrpcClient.kt - Where Flag is Set/Cleared

Line 844 - Flag set when subscribing:

fun subscribeSessionEvents(partyId: String): Flow<SessionEventData> = callbackFlow {
    eventStreamSubscribed.set(true)   Set to TRUE
    eventStreamPartyId = partyId
    ...
}

Line 925 - Flag cleared when Flow closes (THE BUG):

awaitClose {
    Log.d(TAG, "subscribeSessionEvents: Flow closed for partyId=$partyId")
    eventStreamSubscribed.set(false)   Set to FALSE on disconnect
    eventStreamPartyId = null
}

Line 506 - Reconnection check (FAILS because flag is false):

private fun reSubscribeStreams() {
    val needsResubscribe = eventStreamSubscribed.get() || activeMessageSubscription != null
    //                     ↑ Returns FALSE after disconnect

    if (needsResubscribe) {   This condition is FALSE
        Log.d(TAG, "Triggering stream re-subscription callback")
        ...
        onReconnectedCallback?.invoke()   NEVER REACHED
    }
}

Log Evidence

Normal Reconnection (16:28:26) - WORKS

16:28:26.082 D/GrpcClient: Connected successfully
16:28:26.086 D/GrpcClient: Re-registering party: 7c72c28f...
16:28:26.130 D/GrpcClient: Party registered: 7c72c28f...
16:28:26.130 D/GrpcClient: Triggering stream re-subscription callback  ← Present!
16:28:26.130 D/GrpcClient:   - Event stream: true, partyId: 7c72c28f...
16:28:26.130 D/TssRepository: gRPC reconnected, restoring streams...  ← Present!
16:28:26.130 D/TssRepository: Restoring session event subscription     ← Present!

Problem Reconnection (16:29:47) - FAILS

16:29:47.090 D/GrpcClient: Connected successfully
16:29:47.093 D/GrpcClient: Re-registering party: 7c72c28f...
16:29:47.146 D/GrpcClient: Party registered: 7c72c28f...
[MISSING]: "Triggering stream re-subscription callback"  ← NOT PRESENT!
[MISSING]: "gRPC reconnected, restoring streams..."      ← NOT PRESENT!
[MISSING]: "Restoring session event subscription"        ← NOT PRESENT!

Result:
16:30:47.198 W/GrpcClient: Has 539 pending messages - may have missed events
16:31:17.237 W/GrpcClient: Has 539 pending messages - may have missed events

Why First Reconnection Worked

Looking at the timeline:

16:27:53 - App started, event subscription started
16:28:26 - First reconnect (1 minute later)
           Event subscription was STILL ACTIVE
           eventStreamSubscribed = true ✅

16:29:15 - Network disconnect (49 seconds later)
           Flow closed → eventStreamSubscribed set to FALSE ❌

16:29:47 - Second reconnect
           eventStreamSubscribed = false ❌
           Callback NOT invoked ❌

Key Insight: The first reconnection worked because the event stream Flow hadn't closed yet. The second reconnection failed because the Flow had closed and cleared the flag.

The Design Flaw

The current design has a state tracking inconsistency:

// When to subscribe?
eventStreamSubscribed = true   // "I am currently subscribed"

// When to unsubscribe?
eventStreamSubscribed = false  // "I am no longer subscribed"

// When to re-subscribe?
if (eventStreamSubscribed) { ... }  // ❌ WRONG - flag is already false!

Problem: The flag tracks "am I currently subscribed?" but reconnection logic needs to know "should I re-subscribe?". These are two different concepts.

Solution Options

Option 1: Don't Clear Flag in awaitClose (Simple)

awaitClose {
    Log.d(TAG, "subscribeSessionEvents: Flow closed for partyId=$partyId")
    // DON'T clear the flag - keep it for reconnection
    // eventStreamSubscribed.set(false)  ← REMOVE THIS
    // eventStreamPartyId = null         ← REMOVE THIS
}

Pros: Minimal change, preserves intent to re-subscribe Cons: Flag no longer accurately reflects current state

Option 2: Add Separate "Should Restore" Flag (Better)

// Two separate flags:
private val eventStreamSubscribed = AtomicBoolean(false)      // Current state
private val shouldRestoreEventStream = AtomicBoolean(false)   // Intent to restore

// When subscribing:
eventStreamSubscribed.set(true)
shouldRestoreEventStream.set(true)  // Remember to restore

// In awaitClose:
eventStreamSubscribed.set(false)    // No longer subscribed
// Keep shouldRestoreEventStream = true  // But should restore on reconnect

// In reSubscribeStreams:
val needsResubscribe = shouldRestoreEventStream.get() || activeMessageSubscription != null

Pros: Clear separation of concerns, accurate state tracking Cons: More code, requires careful handling of clear conditions

Option 3: Store Last Subscription State (Most Robust)

// Store full subscription state for recovery
private data class StreamState(
    val eventStreamPartyId: String?,
    val messageSessionId: String?,
    val messagePartyId: String?
)

private val lastStreamState = AtomicReference<StreamState>(null)

// On subscribe, save state
// On reconnect, restore from saved state

Pros: Can restore exact previous state, handles complex scenarios Cons: Most complex implementation

Use Option 1 (simplest) with Option 2 concept (clearer intent):

  1. Don't clear eventStreamSubscribed in awaitClose
  2. Only clear it when user explicitly unsubscribes or app shuts down
  3. This preserves the "I was subscribed, so re-subscribe on reconnect" behavior

Alternative: Add explicit unsubscribe call only when intentionally stopping (not on disconnect).

Files to Modify

GrpcClient.kt

Line 923-927 - Remove flag clearing in awaitClose:

awaitClose {
    Log.d(TAG, "subscribeSessionEvents: Flow closed for partyId=$partyId")
    // Keep flags for reconnection - don't clear here
    // Only clear on explicit unsubscribe or disconnect
}

Line 933-936 - Keep explicit unsubscribe as-is:

fun unsubscribeSessionEvents() {
    eventStreamSubscribed.set(false)
    eventStreamPartyId = null
}

Testing Checklist

After fix:

  • Start app, subscribe to events
  • Simulate network disconnect (airplane mode)
  • Verify log shows: "Triggering stream re-subscription callback"
  • Verify log shows: "gRPC reconnected, restoring streams..."
  • Verify log shows: "Restoring session event subscription"
  • Verify pending messages start decreasing
  • Test 2-of-3 keygen succeeds after reconnection

Why This Wasn't Caught Before

  1. Timing-dependent: Only fails if Flow closes before reconnect
  2. Works in most cases: Quick reconnects (< 1 minute) often succeed before Flow timeout
  3. No explicit test: Didn't test scenario of "disconnect → wait for Flow to close → reconnect"
  4. Silent failure: No error logged, just missing callback invocation

Conclusion

The safeLaunch optimization did NOT cause this bug. The bug exists because:

  1. awaitClose clears eventStreamSubscribed on disconnect
  2. Reconnection logic relies on this flag to decide if callback should be invoked
  3. After disconnect, flag is false, so callback is never invoked

Fix: Don't clear the subscription intent flag on temporary disconnection.