rwadurian/backend/services/leaderboard-service/docs/DEPLOYMENT.md

758 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Leaderboard Service 部署文档
## 1. 部署概述
本文档描述 Leaderboard Service 的部署架构、配置和操作流程。
### 1.1 部署架构
```
┌─────────────────────────────────────────────┐
│ Load Balancer │
│ (Nginx / ALB / etc.) │
└────────────────────┬────────────────────────┘
┌──────────────────────────────┼──────────────────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Service │ │ Service │ │ Service │
│ Instance 1 │ │ Instance 2 │ │ Instance N │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└──────────────────────────────┼──────────────────────────────┘
┌─────────────────────────────────────────┼─────────────────────────────────────────┐
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│PostgreSQL│ │ Redis │ │ Kafka │
│ Primary │◀──── Replication ────▶ │ Cluster │ │ Cluster │
│ │ │ │ │ │
└─────────┘ └────────────┘ └───────────┘
```
### 1.2 部署环境
| 环境 | 用途 | 域名示例 |
|------|------|----------|
| Development | 本地开发 | localhost:3000 |
| Staging | 预发布测试 | staging-leaderboard.example.com |
| Production | 生产环境 | leaderboard.example.com |
## 2. Docker 部署
### 2.1 Dockerfile
```dockerfile
# Multi-stage build for production
FROM node:20-alpine AS builder
WORKDIR /app
# Install OpenSSL for Prisma
RUN apk add --no-cache openssl
# Copy package files
COPY package*.json ./
COPY prisma ./prisma/
# Install dependencies
RUN npm ci
# Generate Prisma client
RUN npx prisma generate
# Copy source code
COPY . .
# Build the application
RUN npm run build
# Production stage
FROM node:20-alpine AS production
WORKDIR /app
# Install OpenSSL for Prisma
RUN apk add --no-cache openssl
# Copy package files and install production dependencies
COPY package*.json ./
RUN npm ci --only=production
# Copy Prisma files and generate client
COPY prisma ./prisma/
RUN npx prisma generate
# Copy built application
COPY --from=builder /app/dist ./dist
# Create non-root user
RUN addgroup -g 1001 -S nodejs
RUN adduser -S nestjs -u 1001
USER nestjs
# Expose port
EXPOSE 3000
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
# Start the application
CMD ["node", "dist/main"]
```
### 2.2 Docker Compose 生产配置
```yaml
# docker-compose.prod.yml
version: '3.8'
services:
app:
build:
context: .
dockerfile: Dockerfile
target: production
image: leaderboard-service:${VERSION:-latest}
container_name: leaderboard-service
restart: unless-stopped
ports:
- "3000:3000"
environment:
NODE_ENV: production
DATABASE_URL: ${DATABASE_URL}
REDIS_HOST: ${REDIS_HOST}
REDIS_PORT: ${REDIS_PORT}
REDIS_PASSWORD: ${REDIS_PASSWORD}
KAFKA_BROKERS: ${KAFKA_BROKERS}
JWT_SECRET: ${JWT_SECRET}
JWT_EXPIRES_IN: ${JWT_EXPIRES_IN}
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
deploy:
resources:
limits:
cpus: '1'
memory: 1G
reservations:
cpus: '0.5'
memory: 512M
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
networks:
- leaderboard-network
networks:
leaderboard-network:
driver: bridge
```
### 2.3 构建和推送镜像
```bash
# 构建镜像
docker build -t leaderboard-service:1.0.0 .
# 标记镜像
docker tag leaderboard-service:1.0.0 registry.example.com/leaderboard-service:1.0.0
docker tag leaderboard-service:1.0.0 registry.example.com/leaderboard-service:latest
# 推送到镜像仓库
docker push registry.example.com/leaderboard-service:1.0.0
docker push registry.example.com/leaderboard-service:latest
```
## 3. Kubernetes 部署
### 3.1 Deployment
```yaml
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: leaderboard-service
labels:
app: leaderboard-service
spec:
replicas: 3
selector:
matchLabels:
app: leaderboard-service
template:
metadata:
labels:
app: leaderboard-service
spec:
containers:
- name: leaderboard-service
image: registry.example.com/leaderboard-service:1.0.0
ports:
- containerPort: 3000
env:
- name: NODE_ENV
value: "production"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: leaderboard-secrets
key: database-url
- name: REDIS_HOST
valueFrom:
configMapKeyRef:
name: leaderboard-config
key: redis-host
- name: REDIS_PORT
valueFrom:
configMapKeyRef:
name: leaderboard-config
key: redis-port
- name: JWT_SECRET
valueFrom:
secretKeyRef:
name: leaderboard-secrets
key: jwt-secret
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: leaderboard-service
topologyKey: kubernetes.io/hostname
```
### 3.2 Service
```yaml
# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
name: leaderboard-service
spec:
type: ClusterIP
selector:
app: leaderboard-service
ports:
- port: 80
targetPort: 3000
protocol: TCP
```
### 3.3 Ingress
```yaml
# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: leaderboard-service-ingress
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
spec:
tls:
- hosts:
- leaderboard.example.com
secretName: leaderboard-tls
rules:
- host: leaderboard.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: leaderboard-service
port:
number: 80
```
### 3.4 ConfigMap
```yaml
# k8s/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: leaderboard-config
data:
redis-host: "redis-master.redis.svc.cluster.local"
redis-port: "6379"
kafka-brokers: "kafka-0.kafka.svc.cluster.local:9092,kafka-1.kafka.svc.cluster.local:9092"
log-level: "info"
```
### 3.5 Secrets
```yaml
# k8s/secrets.yaml (示例,实际使用需加密)
apiVersion: v1
kind: Secret
metadata:
name: leaderboard-secrets
type: Opaque
stringData:
database-url: "postgresql://user:password@host:5432/leaderboard_db"
jwt-secret: "your-production-jwt-secret"
redis-password: "your-redis-password"
```
### 3.6 HPA (Horizontal Pod Autoscaler)
```yaml
# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: leaderboard-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: leaderboard-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
```
## 4. 环境配置
### 4.1 生产环境变量
```env
# 应用配置
NODE_ENV=production
PORT=3000
# 数据库配置
DATABASE_URL=postgresql://user:password@db-host:5432/leaderboard_db?connection_limit=20
# Redis 配置
REDIS_HOST=redis-host
REDIS_PORT=6379
REDIS_PASSWORD=your-redis-password
# Kafka 配置
KAFKA_BROKERS=kafka-1:9092,kafka-2:9092,kafka-3:9092
KAFKA_GROUP_ID=leaderboard-service-group
KAFKA_CLIENT_ID=leaderboard-service-prod
# JWT 配置
JWT_SECRET=your-production-jwt-secret-at-least-32-chars
JWT_EXPIRES_IN=7d
# 外部服务
REFERRAL_SERVICE_URL=http://referral-service:3000
IDENTITY_SERVICE_URL=http://identity-service:3000
# 日志配置
LOG_LEVEL=info
LOG_FORMAT=json
# 性能配置
DISPLAY_LIMIT_DEFAULT=30
REFRESH_INTERVAL_MINUTES=5
CACHE_TTL_SECONDS=300
```
### 4.2 数据库迁移
```bash
# 生产环境迁移
DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate deploy
# 检查迁移状态
DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate status
```
## 5. 监控与告警
### 5.1 健康检查端点
| 端点 | 用途 | 响应 |
|------|------|------|
| `/health` | 存活检查 | `{"status": "ok"}` |
| `/health/ready` | 就绪检查 | `{"status": "ok", "details": {...}}` |
### 5.2 Prometheus 指标
```yaml
# prometheus-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: leaderboard-service
spec:
selector:
matchLabels:
app: leaderboard-service
endpoints:
- port: http
path: /metrics
interval: 30s
```
### 5.3 告警规则
```yaml
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: leaderboard-service-alerts
spec:
groups:
- name: leaderboard-service
rules:
- alert: LeaderboardServiceDown
expr: up{job="leaderboard-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Leaderboard Service is down"
description: "Leaderboard Service has been down for more than 1 minute."
- alert: LeaderboardServiceHighLatency
expr: histogram_quantile(0.95, http_request_duration_seconds_bucket{job="leaderboard-service"}) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on Leaderboard Service"
description: "95th percentile latency is above 2 seconds."
- alert: LeaderboardServiceHighErrorRate
expr: rate(http_requests_total{job="leaderboard-service",status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on Leaderboard Service"
description: "Error rate is above 10%."
```
### 5.4 日志收集
```yaml
# fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
data:
fluent-bit.conf: |
[INPUT]
Name tail
Path /var/log/containers/leaderboard-service*.log
Parser docker
Tag leaderboard.*
Refresh_Interval 5
[OUTPUT]
Name es
Match leaderboard.*
Host elasticsearch
Port 9200
Index leaderboard-logs
Type _doc
```
## 6. 运维操作
### 6.1 常用命令
```bash
# 查看服务状态
kubectl get pods -l app=leaderboard-service
# 查看日志
kubectl logs -f deployment/leaderboard-service
# 扩缩容
kubectl scale deployment leaderboard-service --replicas=5
# 重启服务
kubectl rollout restart deployment/leaderboard-service
# 回滚
kubectl rollout undo deployment/leaderboard-service
# 查看资源使用
kubectl top pods -l app=leaderboard-service
```
### 6.2 数据库维护
```bash
# 数据库备份
pg_dump -h $DB_HOST -U $DB_USER -d leaderboard_db > backup_$(date +%Y%m%d).sql
# 数据库恢复
psql -h $DB_HOST -U $DB_USER -d leaderboard_db < backup_20240115.sql
# 清理过期数据
psql -h $DB_HOST -U $DB_USER -d leaderboard_db -c "
DELETE FROM leaderboard_rankings
WHERE period_end_at < NOW() - INTERVAL '90 days';
"
```
### 6.3 缓存维护
```bash
# 连接 Redis
redis-cli -h $REDIS_HOST -p $REDIS_PORT -a $REDIS_PASSWORD
# 查看缓存键
KEYS leaderboard:*
# 清除特定缓存
DEL leaderboard:DAILY:2024-01-15:rankings
# 清除所有排行榜缓存
KEYS leaderboard:* | xargs DEL
```
## 7. 故障排查
### 7.1 常见问题
| 问题 | 可能原因 | 解决方案 |
|------|----------|----------|
| 服务启动失败 | 数据库连接失败 | 检查 DATABASE_URL 配置 |
| 排名不更新 | 定时任务未执行 | 检查 Scheduler 日志 |
| 响应超时 | 数据库查询慢 | 检查索引和查询计划 |
| 缓存失效 | Redis 连接问题 | 检查 Redis 服务状态 |
| 消息丢失 | Kafka 配置错误 | 检查 Kafka 连接和主题 |
### 7.2 诊断命令
```bash
# 检查服务连通性
curl -v http://localhost:3000/health
# 检查数据库连接
kubectl exec -it deployment/leaderboard-service -- \
npx prisma db execute --stdin <<< "SELECT 1"
# 检查 Redis 连接
kubectl exec -it deployment/leaderboard-service -- \
redis-cli -h $REDIS_HOST ping
# 查看详细日志
kubectl logs deployment/leaderboard-service --since=1h | grep ERROR
```
### 7.3 性能诊断
```bash
# CPU Profile
kubectl exec -it deployment/leaderboard-service -- \
node --prof dist/main.js
# 内存分析
kubectl exec -it deployment/leaderboard-service -- \
node --expose-gc --inspect dist/main.js
```
## 8. 安全加固
### 8.1 网络策略
```yaml
# k8s/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: leaderboard-service-network-policy
spec:
podSelector:
matchLabels:
app: leaderboard-service
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 3000
egress:
- to:
- namespaceSelector:
matchLabels:
name: database
ports:
- protocol: TCP
port: 5432
- to:
- namespaceSelector:
matchLabels:
name: redis
ports:
- protocol: TCP
port: 6379
```
### 8.2 安全检查清单
- [ ] 所有敏感信息使用 Secrets 存储
- [ ] 数据库使用强密码和 SSL 连接
- [ ] Redis 启用密码认证
- [ ] JWT Secret 足够长且随机
- [ ] 容器以非 root 用户运行
- [ ] 启用网络策略限制流量
- [ ] 定期更新依赖和基础镜像
- [ ] 启用审计日志
## 9. 备份与恢复
### 9.1 备份策略
| 数据类型 | 备份频率 | 保留期限 |
|----------|----------|----------|
| 数据库 | 每日全量 + 每小时增量 | 30 天 |
| 配置 | 每次变更 | 永久Git |
| 日志 | 实时同步 | 90 天 |
### 9.2 灾难恢复
```bash
# 1. 恢复数据库
pg_restore -h $DB_HOST -U $DB_USER -d leaderboard_db latest_backup.dump
# 2. 重新部署服务
kubectl apply -f k8s/
# 3. 验证服务
curl http://leaderboard.example.com/health
# 4. 清除并重建缓存
redis-cli FLUSHDB
curl -X POST http://leaderboard.example.com/leaderboard/config/refresh
```
## 10. 版本发布
### 10.1 发布流程
```
1. 开发完成
└── 代码审查
└── 合并到 develop
└── CI 测试通过
└── 合并到 main
└── 打标签
└── 构建镜像
└── 部署到 Staging
└── 验收测试
└── 部署到 Production
```
### 10.2 蓝绿部署
```bash
# 部署新版本(绿)
kubectl apply -f k8s/deployment-green.yaml
# 验证新版本
curl http://leaderboard-green.internal/health
# 切换流量
kubectl patch service leaderboard-service \
-p '{"spec":{"selector":{"version":"green"}}}'
# 验证
curl http://leaderboard.example.com/health
# 清理旧版本(蓝)
kubectl delete -f k8s/deployment-blue.yaml
```
### 10.3 金丝雀发布
```yaml
# k8s/canary-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: leaderboard-service-canary
spec:
replicas: 1
selector:
matchLabels:
app: leaderboard-service
version: canary
template:
spec:
containers:
- name: leaderboard-service
image: registry.example.com/leaderboard-service:1.1.0-canary
```
```bash
# 逐步增加金丝雀流量
kubectl scale deployment leaderboard-service-canary --replicas=2
kubectl scale deployment leaderboard-service --replicas=8
# 观察指标,无异常则继续
kubectl scale deployment leaderboard-service-canary --replicas=5
kubectl scale deployment leaderboard-service --replicas=5
# 完全切换
kubectl scale deployment leaderboard-service-canary --replicas=10
kubectl scale deployment leaderboard-service --replicas=0
```