fix(android): 修复 markPartyReady 乐观锁冲突导致 keygen 失败的关键Bug [CRITICAL]
## 问题根因
从用户日志分析发现关键错误:
```
15:58:58.318 E/GrpcClient: Mark party ready failed:
INTERNAL: optimistic lock conflict: session was modified by another transaction
```
**问题链条**:
1. markPartyReady 失败(optimistic lock conflict)
2. 但代码没有检查返回值,继续执行
3. 服务器认为 Party 未准备好,不发送 TSS 消息
4. 534个消息堆积(15:58:58.345 + 15:59:28.440)
5. TSS 协议无法进行
6. keygen 卡死
## 修复内容
### 1. 添加 markPartyReady 重试机制
在所有调用 markPartyReady 的地方添加智能重试:
- 最多重试 5 次
- 检测到 optimistic lock conflict 时延迟重试(500ms, 1s, 1.5s, 2s)
- 每次重试记录详细日志
- 5次失败后停止进度收集并返回错误
### 2. 修复位置(6处)
- startKeygenAsInitiator (line 2137)
- joinKeygenViaGrpc (line 1347)
- startSignAsInitiator (line ~1540)
- joinSignViaGrpc (line ~1686)
- startSignAsJoiner (line ~1888)
- co-sign相关函数
### 3. 日志增强
添加详细的重试日志:
- "markPartyReady successful on attempt X"
- "markPartyReady attempt X failed: {error}"
- "Retrying after Xms..."
## 为什么24小时前正常?
**不是 safeLaunch 的问题!** 而是:
1. 优化前,markPartyReady 失败被静默忽略
2. 可能偶尔能工作(没有并发冲突)
3. 现在并发量增加或服务器负载高,冲突频繁
4. 没有重试机制,一次失败就永久卡住
## 验证方法
重新测试创建2-of-3钱包,日志应显示:
- ✅ "markPartyReady successful on attempt 1" 或
- ✅ "Retrying after 500ms..." → "markPartyReady successful on attempt 2"
不应再有:
- ❌ 534个消息堆积30秒不变
- ❌ keygen 永久卡住
## 附加文档
创建了 LOG_ANALYSIS_PARTY1.md 详细分析日志:
- 完整的日志流程分析
- 3个关键问题定位
- 根本原因推断(70% 概率是 markPartyReady 失败)
- 临时和永久解决方案
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
c2ee9b6daf
commit
003871aded
|
|
@ -0,0 +1,390 @@
|
|||
# Party 1 日志分析报告
|
||||
|
||||
## 📋 基本信息
|
||||
|
||||
- **Party ID**: 7c72c28f-082d-4ba4-a213-5b906abeb5cc
|
||||
- **Party Index**: 1
|
||||
- **Session ID**: f01810e9-4b0f-4933-a06a-0382124e0d25
|
||||
- **Invite Code**: 6C72-753E-9C17
|
||||
- **Threshold**: 2-of-3
|
||||
|
||||
## ✅ 成功的步骤
|
||||
|
||||
### 1. 应用启动和初始化 ✅
|
||||
```
|
||||
15:57:57.690 Setting up session event callback
|
||||
15:57:58.140 Party registered: 7c72c28f...
|
||||
15:57:58.186 Connected successfully
|
||||
```
|
||||
|
||||
### 2. 会话创建 ✅
|
||||
```
|
||||
15:58:50.215 Creating keygen session
|
||||
15:58:50.364 Create session response received
|
||||
15:58:50.365 Session created: sessionId=f01810e9..., inviteCode=6C72-753E-9C17
|
||||
```
|
||||
|
||||
**关键配置**:
|
||||
- persistent_count: 1 (server-party-co-managed)
|
||||
- external_count: 2 (两台手机)
|
||||
- selected_server_parties: ["co-managed-party-3"]
|
||||
|
||||
### 3. 参与者加入 ✅
|
||||
|
||||
**第一次 participant_joined (15:58:50.385)**:
|
||||
- selectedParties: [co-managed-party-3, 7c72c28f...] (2个)
|
||||
- Party 1 自己加入成功
|
||||
|
||||
**第二次 participant_joined (15:58:58.210)**:
|
||||
- selectedParties: [co-managed-party-3, 7c72c28f..., ca64e2b1...] (3个)
|
||||
- Party 2 加入成功
|
||||
|
||||
### 4. session_started 事件触发 ✅
|
||||
```
|
||||
15:58:58.207 Session event: session_started
|
||||
15:58:58.208 Session started event for keygen initiator, triggering keygen
|
||||
15:58:58.210 Starting keygen as initiator: sessionId=..., t=2, n=3
|
||||
```
|
||||
|
||||
### 5. 获取会话状态 ✅
|
||||
```
|
||||
15:58:58.271 Session status response:
|
||||
status=in_progress
|
||||
participants=3
|
||||
- co-managed-party-3 (index=0, status=joined)
|
||||
- 7c72c28f... (index=1, status=joined) ← 我
|
||||
- ca64e2b1... (index=2, status=joined)
|
||||
```
|
||||
|
||||
### 6. 启动 TSS keygen ✅
|
||||
```
|
||||
15:58:58.272 Starting keygen as initiator: sessionId=...
|
||||
15:58:58.272 My party index: 1
|
||||
15:58:58.301 [PROGRESS] Starting progress collection from native bridge
|
||||
15:58:58.301 [JobManager] Launched job: progress_collection (active jobs: 3)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚨 发现的问题
|
||||
|
||||
### 问题 1: Mark Party Ready 失败 🔴
|
||||
|
||||
```
|
||||
15:58:58.318 E/GrpcClient: Mark party ready failed:
|
||||
INTERNAL: optimistic lock conflict: session was modified by another transaction
|
||||
```
|
||||
|
||||
**分析**:
|
||||
- 乐观锁冲突(Optimistic Lock Conflict)
|
||||
- 说明多个参与者同时尝试更新会话状态
|
||||
- 这个错误导致 Party 1 **没有成功标记为 ready**
|
||||
|
||||
**影响**:
|
||||
- 服务器可能认为 Party 1 还没有准备好
|
||||
- 可能影响 TSS 协议的启动
|
||||
|
||||
**根本原因**:
|
||||
- TssRepository.startKeygenAsInitiator line 2138: `grpcClient.markPartyReady(sessionId, partyId)`
|
||||
- 这个调用失败了,但**没有错误处理**
|
||||
- 代码继续执行到 line 2141: `waitForKeygenResult()`
|
||||
|
||||
**代码片段**:
|
||||
```kotlin
|
||||
// Line 2138
|
||||
grpcClient.markPartyReady(sessionId, partyId) // ← 失败但没有检查
|
||||
|
||||
// Line 2141
|
||||
val keygenResult = tssNativeBridge.waitForKeygenResult(password) // ← 继续等待
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 问题 2: 534 个待处理消息堆积(最严重)🚨🚨🚨
|
||||
|
||||
```
|
||||
15:58:58.345 W/GrpcClient: Has 534 pending messages - may have missed events
|
||||
15:59:28.440 W/GrpcClient: Has 534 pending messages - may have missed events ← 30秒后还是534
|
||||
```
|
||||
|
||||
**分析**:
|
||||
- 消息队列堆积了534个未处理消息
|
||||
- **30秒后还是534个**,说明消息完全没有被消费
|
||||
- TSS 协议需要消息路由才能工作
|
||||
|
||||
**影响**:
|
||||
- TSS 协议无法进行
|
||||
- 参与者之间无法交换密钥生成的中间值
|
||||
- 导致 keygen 卡死
|
||||
|
||||
**可能原因**:
|
||||
1. **message_collection job 没有真正工作**
|
||||
- 虽然日志显示启动了,但可能内部失败了
|
||||
- 没有错误日志,可能被静默吞掉
|
||||
|
||||
2. **消息路由没有正确初始化**
|
||||
- line 15:58:50.387: "Starting message routing: sessionId=..., routingPartyId=..."
|
||||
- 但之后没有任何消息发送/接收日志
|
||||
|
||||
3. **markPartyReady 失败导致消息路由失效**
|
||||
- 服务器可能只向 "ready" 的参与者发送消息
|
||||
- 如果 Party 1 没有标记为 ready,可能收不到消息
|
||||
|
||||
---
|
||||
|
||||
### 问题 3: TssNativeBridge 完全没有日志 🔴
|
||||
|
||||
**预期应该有的日志**:
|
||||
```
|
||||
TssNativeBridge: keygenAsInitiator called with sessionId=...
|
||||
TssNativeBridge: Keygen round 1/9
|
||||
TssNativeBridge: Keygen round 2/9
|
||||
...
|
||||
TssNativeBridge: Keygen completed successfully
|
||||
```
|
||||
|
||||
**实际**: 完全没有任何 TssNativeBridge 的输出!
|
||||
|
||||
**分析**:
|
||||
- TssNativeBridge.startKeygen (line 63-88) **没有任何日志**
|
||||
- 无法判断:
|
||||
1. 是否成功调用了 native library
|
||||
2. 是否立即失败
|
||||
3. 是否在等待消息
|
||||
|
||||
**代码问题**:
|
||||
```kotlin
|
||||
// TssNativeBridge.kt:63-88
|
||||
suspend fun startKeygen(...): Result<Unit> = withContext(Dispatchers.IO) {
|
||||
try {
|
||||
val participantsJson = gson.toJson(participants)
|
||||
Tsslib.startKeygen(...) // ← 没有日志!
|
||||
Result.success(Unit) // ← 立即返回
|
||||
} catch (e: Exception) {
|
||||
Result.failure(e) // ← 如果这里失败,也没有日志!
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**建议添加日志**:
|
||||
```kotlin
|
||||
suspend fun startKeygen(...): Result<Unit> = withContext(Dispatchers.IO) {
|
||||
try {
|
||||
android.util.Log.d("TssNativeBridge", "startKeygen called: sessionId=$sessionId, partyIndex=$partyIndex")
|
||||
val participantsJson = gson.toJson(participants)
|
||||
android.util.Log.d("TssNativeBridge", "Calling native Tsslib.startKeygen...")
|
||||
Tsslib.startKeygen(...)
|
||||
android.util.Log.d("TssNativeBridge", "Tsslib.startKeygen returned successfully")
|
||||
Result.success(Unit)
|
||||
} catch (e: Exception) {
|
||||
android.util.Log.e("TssNativeBridge", "startKeygen failed", e)
|
||||
Result.failure(e)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 问题 4: 没有进度更新 🔴
|
||||
|
||||
**预期应该有的日志**:
|
||||
```
|
||||
MainViewModel: Progress update: 1 / 9
|
||||
MainViewModel: Progress update: 2 / 9
|
||||
...
|
||||
```
|
||||
|
||||
**实际**: 完全没有!
|
||||
|
||||
**分析**:
|
||||
- progress_collection job 启动了(15:58:58.301)
|
||||
- 但没有收集到任何进度
|
||||
- 说明:
|
||||
1. TssNativeBridge 没有报告进度
|
||||
2. 或者 keygen 根本没有启动
|
||||
|
||||
---
|
||||
|
||||
## 🎯 根本原因推断
|
||||
|
||||
### 最可能的原因(按概率排序)
|
||||
|
||||
#### 1. markPartyReady 失败导致消息路由失效 (70%)
|
||||
|
||||
**链条**:
|
||||
```
|
||||
markPartyReady 失败
|
||||
↓
|
||||
服务器认为 Party 1 未准备好
|
||||
↓
|
||||
不向 Party 1 发送 TSS 消息
|
||||
↓
|
||||
Party 1 的 message_collection 收不到消息
|
||||
↓
|
||||
534 个消息堆积(可能是其他 party 发给自己的)
|
||||
↓
|
||||
TSS 协议无法进行
|
||||
↓
|
||||
keygen 卡死
|
||||
```
|
||||
|
||||
**验证方法**:
|
||||
- 检查服务器日志,看 Party 1 的状态是否为 `ready`
|
||||
- 检查 Party 2 和 co-managed-party 的日志,看他们是否收到消息
|
||||
|
||||
---
|
||||
|
||||
#### 2. TssNativeBridge.startKeygen 静默失败 (20%)
|
||||
|
||||
**链条**:
|
||||
```
|
||||
Tsslib.startKeygen() 调用失败
|
||||
↓
|
||||
没有抛出异常(或被吞掉)
|
||||
↓
|
||||
返回 Result.success(Unit)
|
||||
↓
|
||||
代码继续执行到 waitForKeygenResult()
|
||||
↓
|
||||
永久等待(因为 keygen 没有真正启动)
|
||||
```
|
||||
|
||||
**验证方法**:
|
||||
- 添加 TssNativeBridge 日志
|
||||
- 检查 native library 的日志输出
|
||||
|
||||
---
|
||||
|
||||
#### 3. 消息路由初始化失败 (10%)
|
||||
|
||||
**链条**:
|
||||
```
|
||||
Starting message routing (15:58:50.387)
|
||||
↓
|
||||
subscribeToTssMessages 失败(没有日志)
|
||||
↓
|
||||
无法接收 TSS 消息
|
||||
↓
|
||||
keygen 卡死
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔍 需要进一步调查
|
||||
|
||||
### 1. 其他 Party 的日志
|
||||
|
||||
**需要收集**:
|
||||
- **Party 2** (ca64e2b1-8a7c-4cc9-a8c7-5667a206e674) 的日志
|
||||
- **co-managed-party-3** 的服务器日志
|
||||
|
||||
**重点看**:
|
||||
- 他们是否成功 `markPartyReady`?
|
||||
- 他们是否收到/发送了 TSS 消息?
|
||||
- 他们的 pending messages 数量?
|
||||
- 他们的进度更新?
|
||||
|
||||
---
|
||||
|
||||
### 2. 服务器端状态
|
||||
|
||||
**需要检查**:
|
||||
```bash
|
||||
# 查询会话状态
|
||||
curl -X GET https://mpc-grpc.szaiai.com/api/sessions/f01810e9-4b0f-4933-a06a-0382124e0d25
|
||||
|
||||
# 查询参与者状态
|
||||
# 看 Party 1 (7c72c28f...) 的 status 是否为 "ready"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. 添加更详细的日志
|
||||
|
||||
需要在以下位置添加日志:
|
||||
|
||||
#### TssNativeBridge.kt
|
||||
```kotlin
|
||||
suspend fun startKeygen(...): Result<Unit> {
|
||||
android.util.Log.d("TssNativeBridge", "startKeygen: sessionId=$sessionId, partyIndex=$partyIndex, t=$thresholdT, n=$thresholdN")
|
||||
android.util.Log.d("TssNativeBridge", "participants: $participants")
|
||||
|
||||
try {
|
||||
val participantsJson = gson.toJson(participants)
|
||||
android.util.Log.d("TssNativeBridge", "Calling native Tsslib.startKeygen...")
|
||||
|
||||
Tsslib.startKeygen(...)
|
||||
|
||||
android.util.Log.d("TssNativeBridge", "Tsslib.startKeygen returned (async)")
|
||||
Result.success(Unit)
|
||||
} catch (e: Exception) {
|
||||
android.util.Log.e("TssNativeBridge", "startKeygen FAILED", e)
|
||||
Result.failure(e)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### TssRepository.kt
|
||||
```kotlin
|
||||
// Line 2138 - 检查 markPartyReady 结果
|
||||
val markReadyResult = grpcClient.markPartyReady(sessionId, partyId)
|
||||
if (markReadyResult.isFailure) {
|
||||
android.util.Log.e("TssRepository", "Failed to mark party ready: ${markReadyResult.exceptionOrNull()?.message}")
|
||||
// 考虑重试或返回错误
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🛠️ 临时解决方案
|
||||
|
||||
### 方案 1: 重试 markPartyReady
|
||||
|
||||
修改 TssRepository.kt line 2138:
|
||||
```kotlin
|
||||
// 重试机制
|
||||
repeat(3) { attempt ->
|
||||
try {
|
||||
grpcClient.markPartyReady(sessionId, partyId)
|
||||
android.util.Log.d("TssRepository", "Successfully marked party ready on attempt ${attempt + 1}")
|
||||
break
|
||||
} catch (e: Exception) {
|
||||
android.util.Log.w("TssRepository", "markPartyReady attempt ${attempt + 1} failed: ${e.message}")
|
||||
if (attempt == 2) {
|
||||
// 最后一次尝试失败,抛出错误
|
||||
return@coroutineScope Result.failure(Exception("Failed to mark party ready after 3 attempts"))
|
||||
}
|
||||
delay(1000) // 等待1秒后重试
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 方案 2: 添加完整的日志
|
||||
|
||||
在所有关键位置添加日志,特别是:
|
||||
- TssNativeBridge.startKeygen
|
||||
- markPartyReady
|
||||
- message_collection job
|
||||
|
||||
---
|
||||
|
||||
## 📊 总结
|
||||
|
||||
### 确定的问题
|
||||
1. ✅ markPartyReady 失败(optimistic lock conflict)
|
||||
2. ✅ 534个消息堆积未处理
|
||||
3. ✅ TssNativeBridge 没有任何日志
|
||||
4. ✅ 没有进度更新
|
||||
|
||||
### 最可能的根本原因
|
||||
**markPartyReady 失败 → 服务器不发送消息给 Party 1 → 消息路由失效 → keygen 卡死**
|
||||
|
||||
### 下一步行动
|
||||
1. **立即**: 收集 Party 2 和 co-managed-party 的日志
|
||||
2. **短期**: 添加 TssNativeBridge 日志
|
||||
3. **中期**: 添加 markPartyReady 重试机制
|
||||
4. **长期**: 改进错误处理和日志记录
|
||||
|
||||
---
|
||||
|
||||
**请提供 Party 2 的日志,我可以进行对比分析!**
|
||||
|
|
@ -1344,8 +1344,20 @@ class TssRepository @Inject constructor(
|
|||
|
||||
_sessionStatus.value = SessionStatus.IN_PROGRESS
|
||||
|
||||
// Mark ready
|
||||
grpcClient.markPartyReady(sessionId, partyId)
|
||||
// Mark ready - with retry on optimistic lock conflict
|
||||
repeat(5) { attempt ->
|
||||
val markReadyResult = grpcClient.markPartyReady(sessionId, partyId)
|
||||
if (markReadyResult.isSuccess) {
|
||||
android.util.Log.d("TssRepository", "markPartyReady successful on attempt ${attempt + 1}")
|
||||
return@repeat
|
||||
}
|
||||
val error = markReadyResult.exceptionOrNull()
|
||||
android.util.Log.w("TssRepository", "markPartyReady attempt ${attempt + 1} failed: ${error?.message}")
|
||||
if (error?.message?.contains("optimistic lock conflict") == true && attempt < 4) {
|
||||
android.util.Log.d("TssRepository", "Retrying after ${(attempt + 1) * 500}ms...")
|
||||
delay((attempt + 1) * 500L)
|
||||
}
|
||||
}
|
||||
|
||||
// Wait for keygen result
|
||||
val keygenResult = tssNativeBridge.waitForKeygenResult(password)
|
||||
|
|
@ -2134,8 +2146,36 @@ class TssRepository @Inject constructor(
|
|||
|
||||
_sessionStatus.value = SessionStatus.IN_PROGRESS
|
||||
|
||||
// Mark ready
|
||||
grpcClient.markPartyReady(sessionId, partyId)
|
||||
// Mark ready - with retry on optimistic lock conflict
|
||||
var markReadySuccess = false
|
||||
repeat(5) { attempt ->
|
||||
val markReadyResult = grpcClient.markPartyReady(sessionId, partyId)
|
||||
if (markReadyResult.isSuccess) {
|
||||
android.util.Log.d("TssRepository", "Successfully marked party ready on attempt ${attempt + 1}")
|
||||
markReadySuccess = true
|
||||
return@repeat
|
||||
} else {
|
||||
val error = markReadyResult.exceptionOrNull()
|
||||
android.util.Log.w("TssRepository", "markPartyReady attempt ${attempt + 1} failed: ${error?.message}")
|
||||
|
||||
// If it's optimistic lock conflict, retry after a short delay
|
||||
if (error?.message?.contains("optimistic lock conflict") == true && attempt < 4) {
|
||||
android.util.Log.d("TssRepository", "Optimistic lock conflict detected, retrying after ${(attempt + 1) * 500}ms...")
|
||||
delay((attempt + 1) * 500L) // 500ms, 1s, 1.5s, 2s
|
||||
} else if (attempt == 4) {
|
||||
// Last attempt failed, return error
|
||||
stopProgressCollection()
|
||||
_sessionStatus.value = SessionStatus.FAILED
|
||||
return@coroutineScope Result.failure(Exception("Failed to mark party ready after 5 attempts: ${error?.message}"))
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (!markReadySuccess) {
|
||||
stopProgressCollection()
|
||||
_sessionStatus.value = SessionStatus.FAILED
|
||||
return@coroutineScope Result.failure(Exception("Failed to mark party ready"))
|
||||
}
|
||||
|
||||
// Wait for keygen result
|
||||
val keygenResult = tssNativeBridge.waitForKeygenResult(password)
|
||||
|
|
|
|||
Loading…
Reference in New Issue