【架构安全修复 - 防止协程泄漏和内存泄漏】
## 问题背景
TssRepository 原有 4 个独立的 Job 变量:
- messageCollectionJob: 消息路由任务
- sessionEventJob: 会话事件订阅任务
- sessionStatusPollingJob: 会话状态轮询任务
- progressCollectionJob: 进度收集任务
每个 Job 需要手动取消,容易在以下场景导致协程泄漏:
1. Activity 销毁时某个 Job 忘记取消 → 后台协程继续运行 → 内存泄漏 → OOM
2. 快速重启连接时旧 Job 未取消 → 多个 Job 并行运行 → 资源竞争
3. 异常路径中某个 Job 未取消 → 僵尸协程 → 内存累积
## 修复方案
### 1. 创建 JobManager 统一管理类
```kotlin
private inner class JobManager {
private val jobs = mutableMapOf<String, Job>()
fun launch(name: String, block: suspend CoroutineScope.() -> Unit): Job {
jobs[name]?.cancel() // 自动取消同名旧 Job
val job = repositoryScope.launch(block = block)
jobs[name] = job
return job
}
fun cancel(name: String) { ... }
fun isActive(name: String): Boolean { ... }
fun cancelAll() { ... } // 一键清理所有 Job
}
```
### 2. 定义 Job 名称常量
```kotlin
companion object {
const val JOB_MESSAGE_COLLECTION = "message_collection"
const val JOB_SESSION_EVENT = "session_event"
const val JOB_SESSION_STATUS_POLLING = "session_status_polling"
const val JOB_PROGRESS_COLLECTION = "progress_collection"
}
```
### 3. 迁移所有 Job 使用方式
**启动 Job:**
```kotlin
// BEFORE:
messageCollectionJob?.cancel()
messageCollectionJob = repositoryScope.launch { ... }
// AFTER:
jobManager.launch(JOB_MESSAGE_COLLECTION) { ... }
// 自动取消旧 Job,无需手动 cancel
```
**取消 Job:**
```kotlin
// BEFORE:
messageCollectionJob?.cancel()
// AFTER:
jobManager.cancel(JOB_MESSAGE_COLLECTION)
```
**检查 Job 状态:**
```kotlin
// BEFORE:
if (messageCollectionJob == null || messageCollectionJob?.isActive != true)
// AFTER:
if (!jobManager.isActive(JOB_MESSAGE_COLLECTION))
```
**清理所有 Job:**
```kotlin
// BEFORE (需要手动取消每个 Job,容易遗漏):
fun cleanup() {
messageCollectionJob?.cancel()
sessionEventJob?.cancel()
sessionStatusPollingJob?.cancel()
progressCollectionJob?.cancel() // 如果漏了这个 → 内存泄漏
repositoryScope.cancel()
}
// AFTER (一键清理,永不遗漏):
fun cleanup() {
jobManager.cancelAll()
repositoryScope.cancel()
}
```
## 修复的崩溃场景
### 场景 1: Activity 快速销毁重建
- **原问题**: Activity 销毁时如果某个 Job 未取消,后台协程继续持有 Activity/Context 引用
- **后果**: 内存泄漏,多次重建后 OOM 崩溃
- **修复**: JobManager.cancelAll() 确保所有 Job 都被取消
### 场景 2: 网络重连时资源竞争
- **原问题**: disconnect() 后 reconnect() 启动新 Job,但旧 Job 未取消
- **后果**: 多个 messageCollectionJob 并行运行,消息重复处理,状态混乱
- **修复**: JobManager.launch() 自动取消同名旧 Job
### 场景 3: 异常路径中 Job 未清理
- **原问题**: try-catch 中异常发生后,cleanup 逻辑被跳过
- **后果**: 僵尸协程累积,内存持续增长
- **修复**: JobManager 集中管理,即使部分清理失败,cancelAll() 仍能清理全部
## 影响范围
### 修改的函数 (共 11 个):
1. disconnect() - 使用 jobManager.cancelAll()
2. cleanup() - 使用 jobManager.cancelAll()
3. startSessionEventSubscription() - 使用 jobManager.launch(JOB_SESSION_EVENT)
4. ensureSessionEventSubscriptionActive() - 使用 jobManager.isActive(JOB_SESSION_EVENT)
5. startProgressCollection() - 使用 jobManager.launch(JOB_PROGRESS_COLLECTION)
6. stopProgressCollection() - 使用 jobManager.cancel(JOB_PROGRESS_COLLECTION)
7. startSessionStatusPolling() - 使用 jobManager.launch(JOB_SESSION_STATUS_POLLING)
8. stopSessionStatusPolling() - 使用 jobManager.cancel(JOB_SESSION_STATUS_POLLING)
9. startMessageRouting() - 使用 jobManager.launch(JOB_MESSAGE_COLLECTION)
10. cancelSession() - 使用 jobManager.cancel() 取消多个 Job
11. 多个签名/密钥生成完成后的清理逻辑 - 使用 jobManager.cancel(JOB_MESSAGE_COLLECTION)
### 删除的变量:
- messageCollectionJob: Job?
- sessionEventJob: Job?
- sessionStatusPollingJob: Job?
- progressCollectionJob: Job?
### 新增代码:
- JobManager 内部类 (110 行,含详细注释)
- 4 个 Job 名称常量
## 测试验证
编译状态: ✅ BUILD SUCCESSFUL in 2m 10s
- 无编译错误
- 仅有警告 (unused parameters),不影响功能
## 后续优化建议
可以进一步优化:
1. 添加 Job 超时检测 (避免永久运行的僵尸协程)
2. 添加 Job 异常处理回调 (统一的错误处理)
3. 添加 Job 启动/取消日志 (已在 JobManager 中实现)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| .claude | ||
| api | ||
| docs | ||
| k8s | ||
| migrations | ||
| pkg | ||
| scripts | ||
| services | ||
| tests | ||
| .env.example | ||
| .env.party.example | ||
| .env.prod.example | ||
| .gitignore | ||
| DELEGATE_PARTY_GUIDE.md | ||
| MPC-Distributed-Signature-System-Complete-Spec.md | ||
| MPC_INTEGRATION_GUIDE.md | ||
| Makefile | ||
| PARTY_ROLE_VERIFICATION_REPORT.md | ||
| README.md | ||
| TEST_REPORT.md | ||
| VERIFICATION_REPORT.md | ||
| config.example.yaml | ||
| deploy.sh | ||
| docker-compose.party.yml | ||
| docker-compose.prod.yml | ||
| docker-compose.yml | ||
| get-docker.sh | ||
| go.mod | ||
| go.sum | ||
| test_create_session.go | ||
| test_real_scenario.sh | ||
| test_signing.go | ||
| test_signing_parties_api.go | ||
README.md
MPC System Deployment Guide
Multi-Party Computation (MPC) system for secure threshold signature scheme (TSS) implementation in the RWADurian project.
Table of Contents
- Overview
- Architecture
- Quick Start
- Configuration
- Deployment Commands
- Services
- Security
- Troubleshooting
- Production Deployment
Overview
The MPC system implements a 2-of-3 threshold signature scheme where:
- Server parties from a dynamically scalable pool hold key shares
- At least 2 parties are required to generate signatures (configurable threshold)
- User shares are generated dynamically and returned to the calling service
- All shares are encrypted using AES-256-GCM
Key Features
- Threshold Cryptography: Configurable N-of-M TSS for enhanced security
- Dynamic Party Pool: Kubernetes-based service discovery for automatic party scaling
- Distributed Architecture: Services communicate via gRPC and WebSocket
- Secure Storage: AES-256-GCM encryption for all stored shares
- API Authentication: API key and IP-based access control
- Session Management: Coordinated multi-party computation sessions
- MPC Protocol Compliance: DeviceInfo optional, aligning with international MPC standards
Architecture
┌────────────────────────────────────────────────────────────────┐
│ MPC System │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Account Service │ │ Server Party API │ │
│ │ (Port 4000) │ │ (Port 8083) │ │
│ │ External API │ │ User Share Gen │ │
│ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Session │◄──────►│ Message Router │ │
│ │ Coordinator │ │ (Port 8082) │ │
│ │ (Port 8081) │ │ WebSocket │ │
│ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────────────────────────────────┐ │
│ │ Server Party Pool (Dynamically Scalable) │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Party 1 │ │ Party 2 │ │ Party 3 │ │ K8s Discovery │
│ │ │ (TSS) │ │ (TSS) │ │ (TSS) │ │ Auto-selected │
│ │ └──────────┘ └──────────┘ └──────────┘ │ from pool │
│ │ ┌──────────┐ ... can scale up/down │ │
│ │ │ Party N │ │ │
│ │ └──────────┘ │ │
│ └────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ Infrastructure Services │ │
│ │ PostgreSQL │ Redis │ RabbitMQ │ │
│ └────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
│
│ Network Access
▼
┌──────────────────────────┐
│ Backend Services │
│ mpc-service (caller) │
└──────────────────────────┘
Deployment Options
This system supports two deployment modes:
Option 1: Docker Compose (Development/Simple Deployment)
- Quick setup for development or simple production environments
- Fixed 3 server parties (hardcoded IDs)
- See instructions below in "Quick Start"
Option 2: Kubernetes (Production/Scalable Deployment)
- Dynamic party pool with service discovery
- Horizontally scalable server parties
- Recommended for production environments
- See
k8s/README.mdfor detailed instructions
Quick Start (Docker Compose)
Prerequisites
- Docker (version 20.10+)
- Docker Compose (version 2.0+)
- Network Access from backend services
- Ports Available: 4000, 8081, 8082, 8083
1. Initial Setup
cd backend/mpc-system
# Create environment configuration
cp .env.example .env
# Edit configuration for your environment
nano .env
2. Configure Environment
Edit .env and update the following REQUIRED values:
# Database password (REQUIRED)
POSTGRES_PASSWORD=your_secure_postgres_password
# RabbitMQ password (REQUIRED)
RABBITMQ_PASSWORD=your_secure_rabbitmq_password
# JWT secret key (REQUIRED, min 32 chars)
JWT_SECRET_KEY=your_jwt_secret_key_at_least_32_characters
# Master encryption key (REQUIRED, exactly 64 hex chars)
# WARNING: If you lose this, encrypted shares cannot be recovered!
CRYPTO_MASTER_KEY=$(openssl rand -hex 32)
# API key for server-to-server auth (REQUIRED)
# Must match the MPC_API_KEY in your backend mpc-service config
MPC_API_KEY=your_api_key_matching_mpc_service
# Allowed IPs (REQUIRED - update to actual backend server IP!)
ALLOWED_IPS=192.168.1.111
3. Deploy Services
# Start all services
./deploy.sh up
# Check status
./deploy.sh status
# View logs
./deploy.sh logs
4. Verify Deployment
# Health check
./deploy.sh health
# Test API
./deploy.sh test-api
Configuration
All configuration is managed through .env file. See .env.example for complete documentation.
Critical Environment Variables
| Variable | Description | Required | Example |
|---|---|---|---|
POSTGRES_PASSWORD |
Database password | Yes | openssl rand -base64 32 |
RABBITMQ_PASSWORD |
Message broker password | Yes | openssl rand -base64 32 |
JWT_SECRET_KEY |
JWT signing key (≥32 chars) | Yes | openssl rand -base64 48 |
CRYPTO_MASTER_KEY |
AES-256 key (64 hex chars) | Yes | openssl rand -hex 32 |
MPC_API_KEY |
API authentication key | Yes | openssl rand -base64 48 |
ALLOWED_IPS |
Comma-separated allowed IPs | Yes | 192.168.1.111,192.168.1.112 |
ENVIRONMENT |
Environment name | No | production (default) |
REDIS_PASSWORD |
Redis password | No | Leave empty for internal network |
Generating Secure Keys
# PostgreSQL & RabbitMQ passwords
openssl rand -base64 32
# JWT Secret Key
openssl rand -base64 48
# Master Encryption Key (MUST be exactly 64 hex characters)
openssl rand -hex 32
# API Key
openssl rand -base64 48
Configuration Checklist
Before deploying to production:
- Change all default passwords
- Generate secure
CRYPTO_MASTER_KEYand back it up securely - Set
MPC_API_KEYto match backend mpc-service configuration - Update
ALLOWED_IPSto actual backend server IP(s) - Backup
.envfile to secure location (NOT in git!)
Deployment Commands
Basic Operations
./deploy.sh up # Start all services
./deploy.sh down # Stop all services
./deploy.sh restart # Restart all services
./deploy.sh logs [svc] # View logs (all or specific service)
./deploy.sh status # Show service status
./deploy.sh health # Health check all services
Build Commands
./deploy.sh build # Build Docker images
./deploy.sh build-no-cache # Rebuild without cache
Service Management
# Infrastructure only
./deploy.sh infra up # Start postgres, redis, rabbitmq
./deploy.sh infra down # Stop infrastructure
# MPC services only
./deploy.sh mpc up # Start MPC services
./deploy.sh mpc down # Stop MPC services
./deploy.sh mpc restart # Restart MPC services
Debugging
./deploy.sh logs-tail [service] # Last 100 log lines
./deploy.sh shell [service] # Open shell in container
./deploy.sh test-api # Test Account Service API
Cleanup
# WARNING: This removes all data!
./deploy.sh clean
Services
External Services (Exposed Ports)
| Service | Port | Protocol | Purpose |
|---|---|---|---|
| account-service | 4000 | HTTP | Main API for backend integration |
| session-coordinator | 8081 | HTTP/gRPC | Session coordination |
| message-router | 8082 | WebSocket/gRPC | Message routing |
| server-party-api | 8083 | HTTP | User share generation |
Internal Services
| Service | Purpose |
|---|---|
| server-party-1/2/3 | TSS parties (Docker Compose mode - fixed IDs) |
| server-party-pool | TSS party pool (Kubernetes mode - dynamic scaling) |
| postgres | Database for session/account data |
| redis | Cache and temporary data |
| rabbitmq | Message broker for inter-service communication |
Note: In Kubernetes mode, server parties are discovered dynamically using K8s service discovery. Parties can be scaled up/down without service interruption.
Service Dependencies
Infrastructure Services (postgres, redis, rabbitmq)
↓
Session Coordinator & Message Router
↓
Server Parties (1, 2, 3) & Server Party API
↓
Account Service (external API)
Security
Access Control
- IP Whitelisting: Only IPs in
ALLOWED_IPScan access the API - API Key Authentication: Requires valid
MPC_API_KEYheader - Network Isolation: Services communicate within Docker network
Data Protection
- Encryption at Rest: All shares encrypted with AES-256-GCM
- Master Key:
CRYPTO_MASTER_KEYmust be securely stored and backed up - Secure Transport: Use HTTPS/TLS for external communication
Best Practices
- Never commit
.envto version control - Backup
CRYPTO_MASTER_KEYto multiple secure locations - Rotate API keys regularly
- Use strong passwords (min 32 chars)
- Restrict database ports (don't expose to internet)
- Monitor failed authentication attempts
- Enable audit logging
Key Backup
# Backup master key (CRITICAL!)
echo "CRYPTO_MASTER_KEY=$(grep CRYPTO_MASTER_KEY .env | cut -d= -f2)" > master_key.backup
# Store securely (encrypted USB, password manager, vault)
# NEVER store in plaintext on the server
Troubleshooting
Services won't start
# Check logs
./deploy.sh logs
# Check specific service
./deploy.sh logs postgres
# Common issues:
# 1. Ports already in use
# 2. .env file missing or misconfigured
# 3. Database initialization failed
Database connection errors
# Check postgres health
docker compose ps postgres
# View postgres logs
./deploy.sh logs postgres
# Restart infrastructure
./deploy.sh infra down
./deploy.sh infra up
API returns 403 Forbidden
# Check ALLOWED_IPS configuration
grep ALLOWED_IPS .env
# Verify caller's IP is in the list
# Update .env and restart:
./deploy.sh restart
API returns 401 Unauthorized
# Verify MPC_API_KEY matches between:
# 1. This system's .env
# 2. Backend mpc-service configuration
# Check API key
grep MPC_API_KEY .env
# Restart after updating
./deploy.sh restart
Keygen or signing fails
# Check all server parties are healthy
./deploy.sh health
# View server party logs
./deploy.sh logs server-party-1
./deploy.sh logs server-party-2
./deploy.sh logs server-party-3
# Check message router
./deploy.sh logs message-router
# Restart MPC services
./deploy.sh mpc restart
Lost master encryption key
CRITICAL: If CRYPTO_MASTER_KEY is lost, encrypted shares cannot be recovered!
Prevention:
- Backup key immediately after generation
- Store in multiple secure locations
- Use enterprise key management system in production
Production Deployment
Pre-Deployment Checklist
- Generate all secure keys and passwords
- Backup
CRYPTO_MASTER_KEYto secure locations - Configure
ALLOWED_IPSfor actual backend server - Sync
MPC_API_KEYwith backend mpc-service - Set up database backups
- Configure log aggregation
- Set up monitoring and alerts
- Document recovery procedures
- Test disaster recovery
Deployment Steps
Step 1: Prepare Environment
# On MPC server
git clone <repo> /opt/rwadurian
cd /opt/rwadurian/backend/mpc-system
# Configure environment
cp .env.example .env
nano .env # Set all required values
# Generate and backup keys
openssl rand -hex 32 > master_key.txt
# Copy to secure storage, then delete:
# rm master_key.txt
Step 2: Deploy Services
# Build images
./deploy.sh build
# Start services
./deploy.sh up
# Verify all healthy
./deploy.sh health
Step 3: Configure Firewall
# Allow backend server to access MPC ports
sudo ufw allow from <BACKEND_IP> to any port 4000
sudo ufw allow from <BACKEND_IP> to any port 8081
sudo ufw allow from <BACKEND_IP> to any port 8082
sudo ufw allow from <BACKEND_IP> to any port 8083
# Deny all other external access
sudo ufw default deny incoming
sudo ufw enable
Step 4: Test Integration
# From backend server, test API access
curl -H "X-API-Key: YOUR_MPC_API_KEY" \
http://<MPC_SERVER_IP>:4000/health
Monitoring
Monitor these metrics:
- Service health status
- API request rate and latency
- Failed authentication attempts
- Database connection pool usage
- RabbitMQ queue depths
- Key generation/signing success rates
Backup Strategy
# Database backup (daily)
docker compose exec postgres pg_dump -U mpc_user mpc_system > backup_$(date +%Y%m%d).sql
# Configuration backup
tar -czf config_backup_$(date +%Y%m%d).tar.gz .env kong.yml
# Encryption key backup (secure storage only!)
Disaster Recovery
- Service Failure: Restart affected service using
./deploy.sh restart - Database Corruption: Restore from latest backup
- Key Loss: If
CRYPTO_MASTER_KEYlost, all encrypted shares are unrecoverable - Full System Recovery: Redeploy from backups, restore database
Performance Tuning
# docker-compose.yml - adjust resources
services:
session-coordinator:
deploy:
resources:
limits:
cpus: '2'
memory: 2G
API Reference
Account Service API (Port 4000)
# Health check
curl http://localhost:4000/health
# Create account (keygen)
curl -X POST http://localhost:4000/api/v1/accounts \
-H "X-API-Key: YOUR_MPC_API_KEY" \
-H "Content-Type: application/json" \
-d '{"user_id": "user123"}'
# Sign transaction
curl -X POST http://localhost:4000/api/v1/accounts/{account_id}/sign \
-H "X-API-Key: YOUR_MPC_API_KEY" \
-H "Content-Type: application/json" \
-d '{"message": "tx_hash"}'
Server Party API (Port 8083)
# Generate user share
curl -X POST http://localhost:8083/api/v1/shares/generate \
-H "X-API-Key: YOUR_MPC_API_KEY" \
-H "Content-Type: application/json" \
-d '{"session_id": "session123"}'
Getting Help
- Check logs:
./deploy.sh logs - Health check:
./deploy.sh health - View commands:
./deploy.sh help - Review
.env.examplefor configuration options
License
Copyright © 2024 RWADurian. All rights reserved.