rwadurian/backend/mpc-system/services
hailin 003871aded fix(android): 修复 markPartyReady 乐观锁冲突导致 keygen 失败的关键Bug [CRITICAL]
## 问题根因
从用户日志分析发现关键错误:
```
15:58:58.318 E/GrpcClient: Mark party ready failed:
INTERNAL: optimistic lock conflict: session was modified by another transaction
```

**问题链条**:
1. markPartyReady 失败(optimistic lock conflict)
2. 但代码没有检查返回值,继续执行
3. 服务器认为 Party 未准备好,不发送 TSS 消息
4. 534个消息堆积(15:58:58.345 + 15:59:28.440)
5. TSS 协议无法进行
6. keygen 卡死

## 修复内容

### 1. 添加 markPartyReady 重试机制
在所有调用 markPartyReady 的地方添加智能重试:
- 最多重试 5 次
- 检测到 optimistic lock conflict 时延迟重试(500ms, 1s, 1.5s, 2s)
- 每次重试记录详细日志
- 5次失败后停止进度收集并返回错误

### 2. 修复位置(6处)
- startKeygenAsInitiator (line 2137)
- joinKeygenViaGrpc (line 1347)
- startSignAsInitiator (line ~1540)
- joinSignViaGrpc (line ~1686)
- startSignAsJoiner (line ~1888)
- co-sign相关函数

### 3. 日志增强
添加详细的重试日志:
- "markPartyReady successful on attempt X"
- "markPartyReady attempt X failed: {error}"
- "Retrying after Xms..."

## 为什么24小时前正常?

**不是 safeLaunch 的问题!** 而是:
1. 优化前,markPartyReady 失败被静默忽略
2. 可能偶尔能工作(没有并发冲突)
3. 现在并发量增加或服务器负载高,冲突频繁
4. 没有重试机制,一次失败就永久卡住

## 验证方法

重新测试创建2-of-3钱包,日志应显示:
-  "markPartyReady successful on attempt 1" 或
-  "Retrying after 500ms..." → "markPartyReady successful on attempt 2"

不应再有:
-  534个消息堆积30秒不变
-  keygen 永久卡住

## 附加文档

创建了 LOG_ANALYSIS_PARTY1.md 详细分析日志:
- 完整的日志流程分析
- 3个关键问题定位
- 根本原因推断(70% 概率是 markPartyReady 失败)
- 临时和永久解决方案

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-27 00:09:40 -08:00
..
account Revert "fix(co-keygen): convert threshold at storage time to match tss-lib convention" 2025-12-31 10:24:25 -08:00
message-router fix(message-router): prevent subscription race condition on gRPC reconnect 2026-01-01 10:04:11 -08:00
server-party fix(participate_signing): 恢复 Execute 方法的 UserShareData 分支 2026-01-26 19:00:52 -08:00
server-party-api fix(context): use parent context instead of Background() to allow proper cancellation 2025-12-06 06:36:34 -08:00
server-party-co-managed fix(co-managed): 使用数据库中的 PartyIndex 而非循环索引 2026-01-26 20:24:32 -08:00
service-party-android fix(android): 修复 markPartyReady 乐观锁冲突导致 keygen 失败的关键Bug [CRITICAL] 2026-01-27 00:09:40 -08:00
service-party-app fix(tss): 修复备份恢复后签名失败的问题 2026-01-20 00:39:05 -08:00
session-coordinator feat(session): broadcast participant_joined event via gRPC for real-time UI updates 2026-01-01 08:34:47 -08:00
tss-wasm feat(tss): add real-time round progress from msg.Type() parsing 2026-01-01 22:41:51 -08:00