rwadurian/backend/mpc-system
hailin 6dda30c528 fix(android): 实现统一的 Job 管理器,防止协程泄漏 [P0-3]
【架构安全修复 - 防止协程泄漏和内存泄漏】

## 问题背景

TssRepository 原有 4 个独立的 Job 变量:
- messageCollectionJob: 消息路由任务
- sessionEventJob: 会话事件订阅任务
- sessionStatusPollingJob: 会话状态轮询任务
- progressCollectionJob: 进度收集任务

每个 Job 需要手动取消,容易在以下场景导致协程泄漏:
1. Activity 销毁时某个 Job 忘记取消 → 后台协程继续运行 → 内存泄漏 → OOM
2. 快速重启连接时旧 Job 未取消 → 多个 Job 并行运行 → 资源竞争
3. 异常路径中某个 Job 未取消 → 僵尸协程 → 内存累积

## 修复方案

### 1. 创建 JobManager 统一管理类
```kotlin
private inner class JobManager {
    private val jobs = mutableMapOf<String, Job>()

    fun launch(name: String, block: suspend CoroutineScope.() -> Unit): Job {
        jobs[name]?.cancel()  // 自动取消同名旧 Job
        val job = repositoryScope.launch(block = block)
        jobs[name] = job
        return job
    }

    fun cancel(name: String) { ... }
    fun isActive(name: String): Boolean { ... }
    fun cancelAll() { ... }  // 一键清理所有 Job
}
```

### 2. 定义 Job 名称常量
```kotlin
companion object {
    const val JOB_MESSAGE_COLLECTION = "message_collection"
    const val JOB_SESSION_EVENT = "session_event"
    const val JOB_SESSION_STATUS_POLLING = "session_status_polling"
    const val JOB_PROGRESS_COLLECTION = "progress_collection"
}
```

### 3. 迁移所有 Job 使用方式

**启动 Job:**
```kotlin
// BEFORE:
messageCollectionJob?.cancel()
messageCollectionJob = repositoryScope.launch { ... }

// AFTER:
jobManager.launch(JOB_MESSAGE_COLLECTION) { ... }
// 自动取消旧 Job,无需手动 cancel
```

**取消 Job:**
```kotlin
// BEFORE:
messageCollectionJob?.cancel()

// AFTER:
jobManager.cancel(JOB_MESSAGE_COLLECTION)
```

**检查 Job 状态:**
```kotlin
// BEFORE:
if (messageCollectionJob == null || messageCollectionJob?.isActive != true)

// AFTER:
if (!jobManager.isActive(JOB_MESSAGE_COLLECTION))
```

**清理所有 Job:**
```kotlin
// BEFORE (需要手动取消每个 Job,容易遗漏):
fun cleanup() {
    messageCollectionJob?.cancel()
    sessionEventJob?.cancel()
    sessionStatusPollingJob?.cancel()
    progressCollectionJob?.cancel()  // 如果漏了这个 → 内存泄漏
    repositoryScope.cancel()
}

// AFTER (一键清理,永不遗漏):
fun cleanup() {
    jobManager.cancelAll()
    repositoryScope.cancel()
}
```

## 修复的崩溃场景

### 场景 1: Activity 快速销毁重建
- **原问题**: Activity 销毁时如果某个 Job 未取消,后台协程继续持有 Activity/Context 引用
- **后果**: 内存泄漏,多次重建后 OOM 崩溃
- **修复**: JobManager.cancelAll() 确保所有 Job 都被取消

### 场景 2: 网络重连时资源竞争
- **原问题**: disconnect() 后 reconnect() 启动新 Job,但旧 Job 未取消
- **后果**: 多个 messageCollectionJob 并行运行,消息重复处理,状态混乱
- **修复**: JobManager.launch() 自动取消同名旧 Job

### 场景 3: 异常路径中 Job 未清理
- **原问题**: try-catch 中异常发生后,cleanup 逻辑被跳过
- **后果**: 僵尸协程累积,内存持续增长
- **修复**: JobManager 集中管理,即使部分清理失败,cancelAll() 仍能清理全部

## 影响范围

### 修改的函数 (共 11 个):
1. disconnect() - 使用 jobManager.cancelAll()
2. cleanup() - 使用 jobManager.cancelAll()
3. startSessionEventSubscription() - 使用 jobManager.launch(JOB_SESSION_EVENT)
4. ensureSessionEventSubscriptionActive() - 使用 jobManager.isActive(JOB_SESSION_EVENT)
5. startProgressCollection() - 使用 jobManager.launch(JOB_PROGRESS_COLLECTION)
6. stopProgressCollection() - 使用 jobManager.cancel(JOB_PROGRESS_COLLECTION)
7. startSessionStatusPolling() - 使用 jobManager.launch(JOB_SESSION_STATUS_POLLING)
8. stopSessionStatusPolling() - 使用 jobManager.cancel(JOB_SESSION_STATUS_POLLING)
9. startMessageRouting() - 使用 jobManager.launch(JOB_MESSAGE_COLLECTION)
10. cancelSession() - 使用 jobManager.cancel() 取消多个 Job
11. 多个签名/密钥生成完成后的清理逻辑 - 使用 jobManager.cancel(JOB_MESSAGE_COLLECTION)

### 删除的变量:
- messageCollectionJob: Job?
- sessionEventJob: Job?
- sessionStatusPollingJob: Job?
- progressCollectionJob: Job?

### 新增代码:
- JobManager 内部类 (110 行,含详细注释)
- 4 个 Job 名称常量

## 测试验证

编译状态:  BUILD SUCCESSFUL in 2m 10s
- 无编译错误
- 仅有警告 (unused parameters),不影响功能

## 后续优化建议

可以进一步优化:
1. 添加 Job 超时检测 (避免永久运行的僵尸协程)
2. 添加 Job 异常处理回调 (统一的错误处理)
3. 添加 Job 启动/取消日志 (已在 JobManager 中实现)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-26 21:38:03 -08:00
..
.claude refactor(mpc-system): migrate to party-driven architecture with PartyID-based routing 2025-12-05 08:11:28 -08:00
api fix(mpc-system): GetSessionStatus 返回实际的 threshold_n 和 threshold_t 2025-12-29 11:59:53 -08:00
docs feat(mpc-system): add event sourcing for session tracking 2025-12-05 23:31:04 -08:00
k8s feat(mpc-system): implement party role labels with strict persistent-only default 2025-12-05 07:08:59 -08:00
migrations fix(migration): 使数据库迁移脚本幂等化,支持重复执行 2025-12-28 05:26:38 -08:00
pkg fix(tss): convert threshold to tss-lib format (threshold-1) in all keygen and signing 2025-12-31 12:19:58 -08:00
scripts fix: convert deploy.sh CRLF to LF and add executable permission 2025-12-07 07:01:13 -08:00
services fix(android): 实现统一的 Job 管理器,防止协程泄漏 [P0-3] 2026-01-26 21:38:03 -08:00
tests refactor(mpc-system): migrate to party-driven architecture with PartyID-based routing 2025-12-05 08:11:28 -08:00
.env.example docs(config): update .env.example files for production deployment 2025-12-07 04:55:21 -08:00
.env.party.example feat(mpc-system): add signing parties configuration and delegate signing support 2025-12-05 22:47:55 -08:00
.env.prod.example feat(mpc-system): add signing parties configuration and delegate signing support 2025-12-05 22:47:55 -08:00
.gitignore refactor(mpc-system): migrate to party-driven architecture with PartyID-based routing 2025-12-05 08:11:28 -08:00
DELEGATE_PARTY_GUIDE.md feat(mpc-system): implement delegate party for hybrid custody 2025-12-05 09:07:46 -08:00
MPC-Distributed-Signature-System-Complete-Spec.md refactor(mpc-system): migrate to party-driven architecture with PartyID-based routing 2025-12-05 08:11:28 -08:00
MPC_INTEGRATION_GUIDE.md refactor(mpc-system): migrate to party-driven architecture with PartyID-based routing 2025-12-05 08:11:28 -08:00
Makefile refactor(mpc-system): migrate to party-driven architecture with PartyID-based routing 2025-12-05 08:11:28 -08:00
PARTY_ROLE_VERIFICATION_REPORT.md refactor(mpc-system): migrate to party-driven architecture with PartyID-based routing 2025-12-05 08:11:28 -08:00
README.md feat(mpc-system): implement Kubernetes-based dynamic party pool architecture 2025-12-05 06:12:49 -08:00
TEST_REPORT.md refactor(mpc-system): migrate to party-driven architecture with PartyID-based routing 2025-12-05 08:11:28 -08:00
VERIFICATION_REPORT.md refactor(mpc-system): migrate to party-driven architecture with PartyID-based routing 2025-12-05 08:11:28 -08:00
config.example.yaml refactor(mpc-system): migrate to party-driven architecture with PartyID-based routing 2025-12-05 08:11:28 -08:00
deploy.sh feat(mpc-system): add server-party-co-managed for co_managed_keygen sessions 2025-12-29 23:54:45 -08:00
docker-compose.party.yml chore(docker): 为 mpc-system、api-gateway、infrastructure 添加时区配置 2025-12-23 18:35:09 -08:00
docker-compose.prod.yml chore(docker): 为 mpc-system、api-gateway、infrastructure 添加时区配置 2025-12-23 18:35:09 -08:00
docker-compose.yml feat(mpc-system): add server-party-co-managed for co_managed_keygen sessions 2025-12-29 23:54:45 -08:00
get-docker.sh refactor(mpc-system): migrate to party-driven architecture with PartyID-based routing 2025-12-05 08:11:28 -08:00
go.mod feat(mpc-system): implement party-driven architecture with SessionEvent broadcasting 2025-12-05 08:44:05 -08:00
go.sum feat(mpc-system): implement party-driven architecture with SessionEvent broadcasting 2025-12-05 08:44:05 -08:00
test_create_session.go feat: add keygen_session_id to signing session flow 2025-12-06 08:39:40 -08:00
test_real_scenario.sh refactor(mpc-system): migrate to party-driven architecture with PartyID-based routing 2025-12-05 08:11:28 -08:00
test_signing.go test: update signing test username 2025-12-06 10:54:22 -08:00
test_signing_parties_api.go fix: update test username for signing parties API test 2025-12-06 10:29:30 -08:00

README.md

MPC System Deployment Guide

Multi-Party Computation (MPC) system for secure threshold signature scheme (TSS) implementation in the RWADurian project.

Table of Contents

Overview

The MPC system implements a 2-of-3 threshold signature scheme where:

  • Server parties from a dynamically scalable pool hold key shares
  • At least 2 parties are required to generate signatures (configurable threshold)
  • User shares are generated dynamically and returned to the calling service
  • All shares are encrypted using AES-256-GCM

Key Features

  • Threshold Cryptography: Configurable N-of-M TSS for enhanced security
  • Dynamic Party Pool: Kubernetes-based service discovery for automatic party scaling
  • Distributed Architecture: Services communicate via gRPC and WebSocket
  • Secure Storage: AES-256-GCM encryption for all stored shares
  • API Authentication: API key and IP-based access control
  • Session Management: Coordinated multi-party computation sessions
  • MPC Protocol Compliance: DeviceInfo optional, aligning with international MPC standards

Architecture

┌────────────────────────────────────────────────────────────────┐
│                         MPC System                              │
│                                                                 │
│  ┌──────────────────┐        ┌──────────────────┐              │
│  │ Account Service  │        │ Server Party API │              │
│  │  (Port 4000)     │        │  (Port 8083)     │              │
│  │ External API     │        │ User Share Gen   │              │
│  └────────┬─────────┘        └────────┬─────────┘              │
│           │                           │                        │
│           ▼                           ▼                        │
│  ┌──────────────────┐        ┌──────────────────┐              │
│  │   Session        │◄──────►│ Message Router   │              │
│  │   Coordinator    │        │  (Port 8082)     │              │
│  │  (Port 8081)     │        │  WebSocket       │              │
│  └────────┬─────────┘        └────────┬─────────┘              │
│           │                           │                        │
│           ▼                           ▼                        │
│  ┌────────────────────────────────────────────┐                │
│  │   Server Party Pool (Dynamically Scalable) │                │
│  │   ┌──────────┐ ┌──────────┐ ┌──────────┐  │                │
│  │   │ Party 1  │ │ Party 2  │ │ Party 3  │  │  K8s Discovery │
│  │   │  (TSS)   │ │  (TSS)   │ │  (TSS)   │  │  Auto-selected │
│  │   └──────────┘ └──────────┘ └──────────┘  │  from pool     │
│  │   ┌──────────┐     ... can scale up/down  │                │
│  │   │ Party N  │                             │                │
│  │   └──────────┘                             │                │
│  └────────────────────────────────────────────┘                │
│                                                                 │
│  ┌────────────────────────────────────────────┐                │
│  │         Infrastructure Services            │                │
│  │  PostgreSQL  │  Redis  │  RabbitMQ         │                │
│  └────────────────────────────────────────────┘                │
└────────────────────────────────────────────────────────────────┘
                           │
                           │ Network Access
                           ▼
              ┌──────────────────────────┐
              │   Backend Services       │
              │   mpc-service (caller)   │
              └──────────────────────────┘

Deployment Options

This system supports two deployment modes:

Option 1: Docker Compose (Development/Simple Deployment)

  • Quick setup for development or simple production environments
  • Fixed 3 server parties (hardcoded IDs)
  • See instructions below in "Quick Start"

Option 2: Kubernetes (Production/Scalable Deployment)

  • Dynamic party pool with service discovery
  • Horizontally scalable server parties
  • Recommended for production environments
  • See k8s/README.md for detailed instructions

Quick Start (Docker Compose)

Prerequisites

  • Docker (version 20.10+)
  • Docker Compose (version 2.0+)
  • Network Access from backend services
  • Ports Available: 4000, 8081, 8082, 8083

1. Initial Setup

cd backend/mpc-system

# Create environment configuration
cp .env.example .env

# Edit configuration for your environment
nano .env

2. Configure Environment

Edit .env and update the following REQUIRED values:

# Database password (REQUIRED)
POSTGRES_PASSWORD=your_secure_postgres_password

# RabbitMQ password (REQUIRED)
RABBITMQ_PASSWORD=your_secure_rabbitmq_password

# JWT secret key (REQUIRED, min 32 chars)
JWT_SECRET_KEY=your_jwt_secret_key_at_least_32_characters

# Master encryption key (REQUIRED, exactly 64 hex chars)
# WARNING: If you lose this, encrypted shares cannot be recovered!
CRYPTO_MASTER_KEY=$(openssl rand -hex 32)

# API key for server-to-server auth (REQUIRED)
# Must match the MPC_API_KEY in your backend mpc-service config
MPC_API_KEY=your_api_key_matching_mpc_service

# Allowed IPs (REQUIRED - update to actual backend server IP!)
ALLOWED_IPS=192.168.1.111

3. Deploy Services

# Start all services
./deploy.sh up

# Check status
./deploy.sh status

# View logs
./deploy.sh logs

4. Verify Deployment

# Health check
./deploy.sh health

# Test API
./deploy.sh test-api

Configuration

All configuration is managed through .env file. See .env.example for complete documentation.

Critical Environment Variables

Variable Description Required Example
POSTGRES_PASSWORD Database password Yes openssl rand -base64 32
RABBITMQ_PASSWORD Message broker password Yes openssl rand -base64 32
JWT_SECRET_KEY JWT signing key (≥32 chars) Yes openssl rand -base64 48
CRYPTO_MASTER_KEY AES-256 key (64 hex chars) Yes openssl rand -hex 32
MPC_API_KEY API authentication key Yes openssl rand -base64 48
ALLOWED_IPS Comma-separated allowed IPs Yes 192.168.1.111,192.168.1.112
ENVIRONMENT Environment name No production (default)
REDIS_PASSWORD Redis password No Leave empty for internal network

Generating Secure Keys

# PostgreSQL & RabbitMQ passwords
openssl rand -base64 32

# JWT Secret Key
openssl rand -base64 48

# Master Encryption Key (MUST be exactly 64 hex characters)
openssl rand -hex 32

# API Key
openssl rand -base64 48

Configuration Checklist

Before deploying to production:

  • Change all default passwords
  • Generate secure CRYPTO_MASTER_KEY and back it up securely
  • Set MPC_API_KEY to match backend mpc-service configuration
  • Update ALLOWED_IPS to actual backend server IP(s)
  • Backup .env file to secure location (NOT in git!)

Deployment Commands

Basic Operations

./deploy.sh up          # Start all services
./deploy.sh down        # Stop all services
./deploy.sh restart     # Restart all services
./deploy.sh logs [svc]  # View logs (all or specific service)
./deploy.sh status      # Show service status
./deploy.sh health      # Health check all services

Build Commands

./deploy.sh build            # Build Docker images
./deploy.sh build-no-cache   # Rebuild without cache

Service Management

# Infrastructure only
./deploy.sh infra up    # Start postgres, redis, rabbitmq
./deploy.sh infra down  # Stop infrastructure

# MPC services only
./deploy.sh mpc up      # Start MPC services
./deploy.sh mpc down    # Stop MPC services
./deploy.sh mpc restart # Restart MPC services

Debugging

./deploy.sh logs-tail [service]  # Last 100 log lines
./deploy.sh shell [service]      # Open shell in container
./deploy.sh test-api             # Test Account Service API

Cleanup

# WARNING: This removes all data!
./deploy.sh clean

Services

External Services (Exposed Ports)

Service Port Protocol Purpose
account-service 4000 HTTP Main API for backend integration
session-coordinator 8081 HTTP/gRPC Session coordination
message-router 8082 WebSocket/gRPC Message routing
server-party-api 8083 HTTP User share generation

Internal Services

Service Purpose
server-party-1/2/3 TSS parties (Docker Compose mode - fixed IDs)
server-party-pool TSS party pool (Kubernetes mode - dynamic scaling)
postgres Database for session/account data
redis Cache and temporary data
rabbitmq Message broker for inter-service communication

Note: In Kubernetes mode, server parties are discovered dynamically using K8s service discovery. Parties can be scaled up/down without service interruption.

Service Dependencies

Infrastructure Services (postgres, redis, rabbitmq)
    ↓
Session Coordinator & Message Router
    ↓
Server Parties (1, 2, 3) & Server Party API
    ↓
Account Service (external API)

Security

Access Control

  1. IP Whitelisting: Only IPs in ALLOWED_IPS can access the API
  2. API Key Authentication: Requires valid MPC_API_KEY header
  3. Network Isolation: Services communicate within Docker network

Data Protection

  1. Encryption at Rest: All shares encrypted with AES-256-GCM
  2. Master Key: CRYPTO_MASTER_KEY must be securely stored and backed up
  3. Secure Transport: Use HTTPS/TLS for external communication

Best Practices

  • Never commit .env to version control
  • Backup CRYPTO_MASTER_KEY to multiple secure locations
  • Rotate API keys regularly
  • Use strong passwords (min 32 chars)
  • Restrict database ports (don't expose to internet)
  • Monitor failed authentication attempts
  • Enable audit logging

Key Backup

# Backup master key (CRITICAL!)
echo "CRYPTO_MASTER_KEY=$(grep CRYPTO_MASTER_KEY .env | cut -d= -f2)" > master_key.backup

# Store securely (encrypted USB, password manager, vault)
# NEVER store in plaintext on the server

Troubleshooting

Services won't start

# Check logs
./deploy.sh logs

# Check specific service
./deploy.sh logs postgres

# Common issues:
# 1. Ports already in use
# 2. .env file missing or misconfigured
# 3. Database initialization failed

Database connection errors

# Check postgres health
docker compose ps postgres

# View postgres logs
./deploy.sh logs postgres

# Restart infrastructure
./deploy.sh infra down
./deploy.sh infra up

API returns 403 Forbidden

# Check ALLOWED_IPS configuration
grep ALLOWED_IPS .env

# Verify caller's IP is in the list
# Update .env and restart:
./deploy.sh restart

API returns 401 Unauthorized

# Verify MPC_API_KEY matches between:
# 1. This system's .env
# 2. Backend mpc-service configuration

# Check API key
grep MPC_API_KEY .env

# Restart after updating
./deploy.sh restart

Keygen or signing fails

# Check all server parties are healthy
./deploy.sh health

# View server party logs
./deploy.sh logs server-party-1
./deploy.sh logs server-party-2
./deploy.sh logs server-party-3

# Check message router
./deploy.sh logs message-router

# Restart MPC services
./deploy.sh mpc restart

Lost master encryption key

CRITICAL: If CRYPTO_MASTER_KEY is lost, encrypted shares cannot be recovered!

Prevention:

  • Backup key immediately after generation
  • Store in multiple secure locations
  • Use enterprise key management system in production

Production Deployment

Pre-Deployment Checklist

  • Generate all secure keys and passwords
  • Backup CRYPTO_MASTER_KEY to secure locations
  • Configure ALLOWED_IPS for actual backend server
  • Sync MPC_API_KEY with backend mpc-service
  • Set up database backups
  • Configure log aggregation
  • Set up monitoring and alerts
  • Document recovery procedures
  • Test disaster recovery

Deployment Steps

Step 1: Prepare Environment

# On MPC server
git clone <repo> /opt/rwadurian
cd /opt/rwadurian/backend/mpc-system

# Configure environment
cp .env.example .env
nano .env  # Set all required values

# Generate and backup keys
openssl rand -hex 32 > master_key.txt
# Copy to secure storage, then delete:
# rm master_key.txt

Step 2: Deploy Services

# Build images
./deploy.sh build

# Start services
./deploy.sh up

# Verify all healthy
./deploy.sh health

Step 3: Configure Firewall

# Allow backend server to access MPC ports
sudo ufw allow from <BACKEND_IP> to any port 4000
sudo ufw allow from <BACKEND_IP> to any port 8081
sudo ufw allow from <BACKEND_IP> to any port 8082
sudo ufw allow from <BACKEND_IP> to any port 8083

# Deny all other external access
sudo ufw default deny incoming
sudo ufw enable

Step 4: Test Integration

# From backend server, test API access
curl -H "X-API-Key: YOUR_MPC_API_KEY" \
  http://<MPC_SERVER_IP>:4000/health

Monitoring

Monitor these metrics:

  • Service health status
  • API request rate and latency
  • Failed authentication attempts
  • Database connection pool usage
  • RabbitMQ queue depths
  • Key generation/signing success rates

Backup Strategy

# Database backup (daily)
docker compose exec postgres pg_dump -U mpc_user mpc_system > backup_$(date +%Y%m%d).sql

# Configuration backup
tar -czf config_backup_$(date +%Y%m%d).tar.gz .env kong.yml

# Encryption key backup (secure storage only!)

Disaster Recovery

  1. Service Failure: Restart affected service using ./deploy.sh restart
  2. Database Corruption: Restore from latest backup
  3. Key Loss: If CRYPTO_MASTER_KEY lost, all encrypted shares are unrecoverable
  4. Full System Recovery: Redeploy from backups, restore database

Performance Tuning

# docker-compose.yml - adjust resources
services:
  session-coordinator:
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G

API Reference

Account Service API (Port 4000)

# Health check
curl http://localhost:4000/health

# Create account (keygen)
curl -X POST http://localhost:4000/api/v1/accounts \
  -H "X-API-Key: YOUR_MPC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"user_id": "user123"}'

# Sign transaction
curl -X POST http://localhost:4000/api/v1/accounts/{account_id}/sign \
  -H "X-API-Key: YOUR_MPC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"message": "tx_hash"}'

Server Party API (Port 8083)

# Generate user share
curl -X POST http://localhost:8083/api/v1/shares/generate \
  -H "X-API-Key: YOUR_MPC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"session_id": "session123"}'

Getting Help

  • Check logs: ./deploy.sh logs
  • Health check: ./deploy.sh health
  • View commands: ./deploy.sh help
  • Review .env.example for configuration options

License

Copyright © 2024 RWADurian. All rights reserved.