252 lines
7.9 KiB
Markdown
252 lines
7.9 KiB
Markdown
# Reconnection Event Stream Bug Analysis
|
|
|
|
## Problem Summary
|
|
|
|
After network disconnection and reconnection, the event stream subscription is NOT restored, causing:
|
|
- No `session_started` event received
|
|
- Keygen never starts
|
|
- Messages pile up forever (539 pending)
|
|
|
|
## Root Cause
|
|
|
|
### The Bug Chain
|
|
|
|
```
|
|
1. Network disconnects
|
|
↓
|
|
2. subscribeSessionEvents Flow closes
|
|
↓
|
|
3. awaitClose block executes (GrpcClient.kt:925)
|
|
eventStreamSubscribed.set(false) ← FLAG CLEARED
|
|
↓
|
|
4. Network reconnects successfully
|
|
↓
|
|
5. reSubscribeStreams() called (GrpcClient.kt:202)
|
|
↓
|
|
6. Line 506 checks:
|
|
val needsResubscribe = eventStreamSubscribed.get() || activeMessageSubscription != null
|
|
↓
|
|
7. eventStreamSubscribed.get() returns FALSE ❌
|
|
activeMessageSubscription is also NULL ❌
|
|
↓
|
|
8. needsResubscribe = false
|
|
↓
|
|
9. Callback NEVER invoked
|
|
↓
|
|
10. Event stream NEVER restored
|
|
```
|
|
|
|
## Code Evidence
|
|
|
|
### GrpcClient.kt - Where Flag is Set/Cleared
|
|
|
|
**Line 844** - Flag set when subscribing:
|
|
```kotlin
|
|
fun subscribeSessionEvents(partyId: String): Flow<SessionEventData> = callbackFlow {
|
|
eventStreamSubscribed.set(true) ← Set to TRUE
|
|
eventStreamPartyId = partyId
|
|
...
|
|
}
|
|
```
|
|
|
|
**Line 925** - Flag cleared when Flow closes (THE BUG):
|
|
```kotlin
|
|
awaitClose {
|
|
Log.d(TAG, "subscribeSessionEvents: Flow closed for partyId=$partyId")
|
|
eventStreamSubscribed.set(false) ← Set to FALSE on disconnect
|
|
eventStreamPartyId = null
|
|
}
|
|
```
|
|
|
|
**Line 506** - Reconnection check (FAILS because flag is false):
|
|
```kotlin
|
|
private fun reSubscribeStreams() {
|
|
val needsResubscribe = eventStreamSubscribed.get() || activeMessageSubscription != null
|
|
// ↑ Returns FALSE after disconnect
|
|
|
|
if (needsResubscribe) { ← This condition is FALSE
|
|
Log.d(TAG, "Triggering stream re-subscription callback")
|
|
...
|
|
onReconnectedCallback?.invoke() ← NEVER REACHED
|
|
}
|
|
}
|
|
```
|
|
|
|
## Log Evidence
|
|
|
|
### Normal Reconnection (16:28:26) - WORKS ✅
|
|
```
|
|
16:28:26.082 D/GrpcClient: Connected successfully
|
|
16:28:26.086 D/GrpcClient: Re-registering party: 7c72c28f...
|
|
16:28:26.130 D/GrpcClient: Party registered: 7c72c28f...
|
|
16:28:26.130 D/GrpcClient: Triggering stream re-subscription callback ← Present!
|
|
16:28:26.130 D/GrpcClient: - Event stream: true, partyId: 7c72c28f...
|
|
16:28:26.130 D/TssRepository: gRPC reconnected, restoring streams... ← Present!
|
|
16:28:26.130 D/TssRepository: Restoring session event subscription ← Present!
|
|
```
|
|
|
|
### Problem Reconnection (16:29:47) - FAILS ❌
|
|
```
|
|
16:29:47.090 D/GrpcClient: Connected successfully
|
|
16:29:47.093 D/GrpcClient: Re-registering party: 7c72c28f...
|
|
16:29:47.146 D/GrpcClient: Party registered: 7c72c28f...
|
|
[MISSING]: "Triggering stream re-subscription callback" ← NOT PRESENT!
|
|
[MISSING]: "gRPC reconnected, restoring streams..." ← NOT PRESENT!
|
|
[MISSING]: "Restoring session event subscription" ← NOT PRESENT!
|
|
|
|
Result:
|
|
16:30:47.198 W/GrpcClient: Has 539 pending messages - may have missed events
|
|
16:31:17.237 W/GrpcClient: Has 539 pending messages - may have missed events
|
|
```
|
|
|
|
## Why First Reconnection Worked
|
|
|
|
Looking at the timeline:
|
|
```
|
|
16:27:53 - App started, event subscription started
|
|
16:28:26 - First reconnect (1 minute later)
|
|
Event subscription was STILL ACTIVE
|
|
eventStreamSubscribed = true ✅
|
|
|
|
16:29:15 - Network disconnect (49 seconds later)
|
|
Flow closed → eventStreamSubscribed set to FALSE ❌
|
|
|
|
16:29:47 - Second reconnect
|
|
eventStreamSubscribed = false ❌
|
|
Callback NOT invoked ❌
|
|
```
|
|
|
|
**Key Insight**: The first reconnection worked because the event stream Flow hadn't closed yet. The second reconnection failed because the Flow had closed and cleared the flag.
|
|
|
|
## The Design Flaw
|
|
|
|
The current design has a **state tracking inconsistency**:
|
|
|
|
```kotlin
|
|
// When to subscribe?
|
|
eventStreamSubscribed = true // "I am currently subscribed"
|
|
|
|
// When to unsubscribe?
|
|
eventStreamSubscribed = false // "I am no longer subscribed"
|
|
|
|
// When to re-subscribe?
|
|
if (eventStreamSubscribed) { ... } // ❌ WRONG - flag is already false!
|
|
```
|
|
|
|
**Problem**: The flag tracks "am I currently subscribed?" but reconnection logic needs to know "should I re-subscribe?". These are two different concepts.
|
|
|
|
## Solution Options
|
|
|
|
### Option 1: Don't Clear Flag in awaitClose (Simple)
|
|
|
|
```kotlin
|
|
awaitClose {
|
|
Log.d(TAG, "subscribeSessionEvents: Flow closed for partyId=$partyId")
|
|
// DON'T clear the flag - keep it for reconnection
|
|
// eventStreamSubscribed.set(false) ← REMOVE THIS
|
|
// eventStreamPartyId = null ← REMOVE THIS
|
|
}
|
|
```
|
|
|
|
**Pros**: Minimal change, preserves intent to re-subscribe
|
|
**Cons**: Flag no longer accurately reflects current state
|
|
|
|
### Option 2: Add Separate "Should Restore" Flag (Better)
|
|
|
|
```kotlin
|
|
// Two separate flags:
|
|
private val eventStreamSubscribed = AtomicBoolean(false) // Current state
|
|
private val shouldRestoreEventStream = AtomicBoolean(false) // Intent to restore
|
|
|
|
// When subscribing:
|
|
eventStreamSubscribed.set(true)
|
|
shouldRestoreEventStream.set(true) // Remember to restore
|
|
|
|
// In awaitClose:
|
|
eventStreamSubscribed.set(false) // No longer subscribed
|
|
// Keep shouldRestoreEventStream = true // But should restore on reconnect
|
|
|
|
// In reSubscribeStreams:
|
|
val needsResubscribe = shouldRestoreEventStream.get() || activeMessageSubscription != null
|
|
```
|
|
|
|
**Pros**: Clear separation of concerns, accurate state tracking
|
|
**Cons**: More code, requires careful handling of clear conditions
|
|
|
|
### Option 3: Store Last Subscription State (Most Robust)
|
|
|
|
```kotlin
|
|
// Store full subscription state for recovery
|
|
private data class StreamState(
|
|
val eventStreamPartyId: String?,
|
|
val messageSessionId: String?,
|
|
val messagePartyId: String?
|
|
)
|
|
|
|
private val lastStreamState = AtomicReference<StreamState>(null)
|
|
|
|
// On subscribe, save state
|
|
// On reconnect, restore from saved state
|
|
```
|
|
|
|
**Pros**: Can restore exact previous state, handles complex scenarios
|
|
**Cons**: Most complex implementation
|
|
|
|
## Recommended Fix
|
|
|
|
**Use Option 1 (simplest) with Option 2 concept (clearer intent)**:
|
|
|
|
1. Don't clear `eventStreamSubscribed` in `awaitClose`
|
|
2. Only clear it when user explicitly unsubscribes or app shuts down
|
|
3. This preserves the "I was subscribed, so re-subscribe on reconnect" behavior
|
|
|
|
**Alternative**: Add explicit unsubscribe call only when intentionally stopping (not on disconnect).
|
|
|
|
## Files to Modify
|
|
|
|
### GrpcClient.kt
|
|
|
|
**Line 923-927** - Remove flag clearing in awaitClose:
|
|
```kotlin
|
|
awaitClose {
|
|
Log.d(TAG, "subscribeSessionEvents: Flow closed for partyId=$partyId")
|
|
// Keep flags for reconnection - don't clear here
|
|
// Only clear on explicit unsubscribe or disconnect
|
|
}
|
|
```
|
|
|
|
**Line 933-936** - Keep explicit unsubscribe as-is:
|
|
```kotlin
|
|
fun unsubscribeSessionEvents() {
|
|
eventStreamSubscribed.set(false)
|
|
eventStreamPartyId = null
|
|
}
|
|
```
|
|
|
|
## Testing Checklist
|
|
|
|
After fix:
|
|
- [ ] Start app, subscribe to events
|
|
- [ ] Simulate network disconnect (airplane mode)
|
|
- [ ] Verify log shows: "Triggering stream re-subscription callback"
|
|
- [ ] Verify log shows: "gRPC reconnected, restoring streams..."
|
|
- [ ] Verify log shows: "Restoring session event subscription"
|
|
- [ ] Verify pending messages start decreasing
|
|
- [ ] Test 2-of-3 keygen succeeds after reconnection
|
|
|
|
## Why This Wasn't Caught Before
|
|
|
|
1. **Timing-dependent**: Only fails if Flow closes before reconnect
|
|
2. **Works in most cases**: Quick reconnects (< 1 minute) often succeed before Flow timeout
|
|
3. **No explicit test**: Didn't test scenario of "disconnect → wait for Flow to close → reconnect"
|
|
4. **Silent failure**: No error logged, just missing callback invocation
|
|
|
|
## Conclusion
|
|
|
|
The safeLaunch optimization did NOT cause this bug. The bug exists because:
|
|
1. `awaitClose` clears `eventStreamSubscribed` on disconnect
|
|
2. Reconnection logic relies on this flag to decide if callback should be invoked
|
|
3. After disconnect, flag is false, so callback is never invoked
|
|
|
|
**Fix**: Don't clear the subscription intent flag on temporary disconnection.
|