Leaderboard Service 部署文档
1. 部署概述
本文档描述 Leaderboard Service 的部署架构、配置和操作流程。
1.1 部署架构
┌─────────────────────────────────────────────┐
│ Load Balancer │
│ (Nginx / ALB / etc.) │
└────────────────────┬────────────────────────┘
│
┌──────────────────────────────┼──────────────────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Service │ │ Service │ │ Service │
│ Instance 1 │ │ Instance 2 │ │ Instance N │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└──────────────────────────────┼──────────────────────────────┘
│
┌─────────────────────────────────────────┼─────────────────────────────────────────┐
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│PostgreSQL│ │ Redis │ │ Kafka │
│ Primary │◀──── Replication ────▶ │ Cluster │ │ Cluster │
│ │ │ │ │ │
└─────────┘ └────────────┘ └───────────┘
1.2 部署环境
| 环境 |
用途 |
域名示例 |
| Development |
本地开发 |
localhost:3000 |
| Staging |
预发布测试 |
staging-leaderboard.example.com |
| Production |
生产环境 |
leaderboard.example.com |
2. Docker 部署
2.1 Dockerfile
# Multi-stage build for production
FROM node:20-alpine AS builder
WORKDIR /app
# Install OpenSSL for Prisma
RUN apk add --no-cache openssl
# Copy package files
COPY package*.json ./
COPY prisma ./prisma/
# Install dependencies
RUN npm ci
# Generate Prisma client
RUN npx prisma generate
# Copy source code
COPY . .
# Build the application
RUN npm run build
# Production stage
FROM node:20-alpine AS production
WORKDIR /app
# Install OpenSSL for Prisma
RUN apk add --no-cache openssl
# Copy package files and install production dependencies
COPY package*.json ./
RUN npm ci --only=production
# Copy Prisma files and generate client
COPY prisma ./prisma/
RUN npx prisma generate
# Copy built application
COPY --from=builder /app/dist ./dist
# Create non-root user
RUN addgroup -g 1001 -S nodejs
RUN adduser -S nestjs -u 1001
USER nestjs
# Expose port
EXPOSE 3000
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
# Start the application
CMD ["node", "dist/main"]
2.2 Docker Compose 生产配置
# docker-compose.prod.yml
version: '3.8'
services:
app:
build:
context: .
dockerfile: Dockerfile
target: production
image: leaderboard-service:${VERSION:-latest}
container_name: leaderboard-service
restart: unless-stopped
ports:
- "3000:3000"
environment:
NODE_ENV: production
DATABASE_URL: ${DATABASE_URL}
REDIS_HOST: ${REDIS_HOST}
REDIS_PORT: ${REDIS_PORT}
REDIS_PASSWORD: ${REDIS_PASSWORD}
KAFKA_BROKERS: ${KAFKA_BROKERS}
JWT_SECRET: ${JWT_SECRET}
JWT_EXPIRES_IN: ${JWT_EXPIRES_IN}
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
deploy:
resources:
limits:
cpus: '1'
memory: 1G
reservations:
cpus: '0.5'
memory: 512M
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
networks:
- leaderboard-network
networks:
leaderboard-network:
driver: bridge
2.3 构建和推送镜像
# 构建镜像
docker build -t leaderboard-service:1.0.0 .
# 标记镜像
docker tag leaderboard-service:1.0.0 registry.example.com/leaderboard-service:1.0.0
docker tag leaderboard-service:1.0.0 registry.example.com/leaderboard-service:latest
# 推送到镜像仓库
docker push registry.example.com/leaderboard-service:1.0.0
docker push registry.example.com/leaderboard-service:latest
3. Kubernetes 部署
3.1 Deployment
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: leaderboard-service
labels:
app: leaderboard-service
spec:
replicas: 3
selector:
matchLabels:
app: leaderboard-service
template:
metadata:
labels:
app: leaderboard-service
spec:
containers:
- name: leaderboard-service
image: registry.example.com/leaderboard-service:1.0.0
ports:
- containerPort: 3000
env:
- name: NODE_ENV
value: "production"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: leaderboard-secrets
key: database-url
- name: REDIS_HOST
valueFrom:
configMapKeyRef:
name: leaderboard-config
key: redis-host
- name: REDIS_PORT
valueFrom:
configMapKeyRef:
name: leaderboard-config
key: redis-port
- name: JWT_SECRET
valueFrom:
secretKeyRef:
name: leaderboard-secrets
key: jwt-secret
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: leaderboard-service
topologyKey: kubernetes.io/hostname
3.2 Service
# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
name: leaderboard-service
spec:
type: ClusterIP
selector:
app: leaderboard-service
ports:
- port: 80
targetPort: 3000
protocol: TCP
3.3 Ingress
# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: leaderboard-service-ingress
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
spec:
tls:
- hosts:
- leaderboard.example.com
secretName: leaderboard-tls
rules:
- host: leaderboard.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: leaderboard-service
port:
number: 80
3.4 ConfigMap
# k8s/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: leaderboard-config
data:
redis-host: "redis-master.redis.svc.cluster.local"
redis-port: "6379"
kafka-brokers: "kafka-0.kafka.svc.cluster.local:9092,kafka-1.kafka.svc.cluster.local:9092"
log-level: "info"
3.5 Secrets
# k8s/secrets.yaml (示例,实际使用需加密)
apiVersion: v1
kind: Secret
metadata:
name: leaderboard-secrets
type: Opaque
stringData:
database-url: "postgresql://user:password@host:5432/leaderboard_db"
jwt-secret: "your-production-jwt-secret"
redis-password: "your-redis-password"
3.6 HPA (Horizontal Pod Autoscaler)
# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: leaderboard-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: leaderboard-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
4. 环境配置
4.1 生产环境变量
# 应用配置
NODE_ENV=production
PORT=3000
# 数据库配置
DATABASE_URL=postgresql://user:password@db-host:5432/leaderboard_db?connection_limit=20
# Redis 配置
REDIS_HOST=redis-host
REDIS_PORT=6379
REDIS_PASSWORD=your-redis-password
# Kafka 配置
KAFKA_BROKERS=kafka-1:9092,kafka-2:9092,kafka-3:9092
KAFKA_GROUP_ID=leaderboard-service-group
KAFKA_CLIENT_ID=leaderboard-service-prod
# JWT 配置
JWT_SECRET=your-production-jwt-secret-at-least-32-chars
JWT_EXPIRES_IN=7d
# 外部服务
REFERRAL_SERVICE_URL=http://referral-service:3000
IDENTITY_SERVICE_URL=http://identity-service:3000
# 日志配置
LOG_LEVEL=info
LOG_FORMAT=json
# 性能配置
DISPLAY_LIMIT_DEFAULT=30
REFRESH_INTERVAL_MINUTES=5
CACHE_TTL_SECONDS=300
4.2 数据库迁移
# 生产环境迁移
DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate deploy
# 检查迁移状态
DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate status
5. 监控与告警
5.1 健康检查端点
| 端点 |
用途 |
响应 |
/health |
存活检查 |
{"status": "ok"} |
/health/ready |
就绪检查 |
{"status": "ok", "details": {...}} |
5.2 Prometheus 指标
# prometheus-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: leaderboard-service
spec:
selector:
matchLabels:
app: leaderboard-service
endpoints:
- port: http
path: /metrics
interval: 30s
5.3 告警规则
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: leaderboard-service-alerts
spec:
groups:
- name: leaderboard-service
rules:
- alert: LeaderboardServiceDown
expr: up{job="leaderboard-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Leaderboard Service is down"
description: "Leaderboard Service has been down for more than 1 minute."
- alert: LeaderboardServiceHighLatency
expr: histogram_quantile(0.95, http_request_duration_seconds_bucket{job="leaderboard-service"}) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on Leaderboard Service"
description: "95th percentile latency is above 2 seconds."
- alert: LeaderboardServiceHighErrorRate
expr: rate(http_requests_total{job="leaderboard-service",status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on Leaderboard Service"
description: "Error rate is above 10%."
5.4 日志收集
# fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
data:
fluent-bit.conf: |
[INPUT]
Name tail
Path /var/log/containers/leaderboard-service*.log
Parser docker
Tag leaderboard.*
Refresh_Interval 5
[OUTPUT]
Name es
Match leaderboard.*
Host elasticsearch
Port 9200
Index leaderboard-logs
Type _doc
6. 运维操作
6.1 常用命令
# 查看服务状态
kubectl get pods -l app=leaderboard-service
# 查看日志
kubectl logs -f deployment/leaderboard-service
# 扩缩容
kubectl scale deployment leaderboard-service --replicas=5
# 重启服务
kubectl rollout restart deployment/leaderboard-service
# 回滚
kubectl rollout undo deployment/leaderboard-service
# 查看资源使用
kubectl top pods -l app=leaderboard-service
6.2 数据库维护
# 数据库备份
pg_dump -h $DB_HOST -U $DB_USER -d leaderboard_db > backup_$(date +%Y%m%d).sql
# 数据库恢复
psql -h $DB_HOST -U $DB_USER -d leaderboard_db < backup_20240115.sql
# 清理过期数据
psql -h $DB_HOST -U $DB_USER -d leaderboard_db -c "
DELETE FROM leaderboard_rankings
WHERE period_end_at < NOW() - INTERVAL '90 days';
"
6.3 缓存维护
# 连接 Redis
redis-cli -h $REDIS_HOST -p $REDIS_PORT -a $REDIS_PASSWORD
# 查看缓存键
KEYS leaderboard:*
# 清除特定缓存
DEL leaderboard:DAILY:2024-01-15:rankings
# 清除所有排行榜缓存
KEYS leaderboard:* | xargs DEL
7. 故障排查
7.1 常见问题
| 问题 |
可能原因 |
解决方案 |
| 服务启动失败 |
数据库连接失败 |
检查 DATABASE_URL 配置 |
| 排名不更新 |
定时任务未执行 |
检查 Scheduler 日志 |
| 响应超时 |
数据库查询慢 |
检查索引和查询计划 |
| 缓存失效 |
Redis 连接问题 |
检查 Redis 服务状态 |
| 消息丢失 |
Kafka 配置错误 |
检查 Kafka 连接和主题 |
7.2 诊断命令
# 检查服务连通性
curl -v http://localhost:3000/health
# 检查数据库连接
kubectl exec -it deployment/leaderboard-service -- \
npx prisma db execute --stdin <<< "SELECT 1"
# 检查 Redis 连接
kubectl exec -it deployment/leaderboard-service -- \
redis-cli -h $REDIS_HOST ping
# 查看详细日志
kubectl logs deployment/leaderboard-service --since=1h | grep ERROR
7.3 性能诊断
# CPU Profile
kubectl exec -it deployment/leaderboard-service -- \
node --prof dist/main.js
# 内存分析
kubectl exec -it deployment/leaderboard-service -- \
node --expose-gc --inspect dist/main.js
8. 安全加固
8.1 网络策略
# k8s/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: leaderboard-service-network-policy
spec:
podSelector:
matchLabels:
app: leaderboard-service
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 3000
egress:
- to:
- namespaceSelector:
matchLabels:
name: database
ports:
- protocol: TCP
port: 5432
- to:
- namespaceSelector:
matchLabels:
name: redis
ports:
- protocol: TCP
port: 6379
8.2 安全检查清单
9. 备份与恢复
9.1 备份策略
| 数据类型 |
备份频率 |
保留期限 |
| 数据库 |
每日全量 + 每小时增量 |
30 天 |
| 配置 |
每次变更 |
永久(Git) |
| 日志 |
实时同步 |
90 天 |
9.2 灾难恢复
# 1. 恢复数据库
pg_restore -h $DB_HOST -U $DB_USER -d leaderboard_db latest_backup.dump
# 2. 重新部署服务
kubectl apply -f k8s/
# 3. 验证服务
curl http://leaderboard.example.com/health
# 4. 清除并重建缓存
redis-cli FLUSHDB
curl -X POST http://leaderboard.example.com/leaderboard/config/refresh
10. 版本发布
10.1 发布流程
1. 开发完成
└── 代码审查
└── 合并到 develop
└── CI 测试通过
└── 合并到 main
└── 打标签
└── 构建镜像
└── 部署到 Staging
└── 验收测试
└── 部署到 Production
10.2 蓝绿部署
# 部署新版本(绿)
kubectl apply -f k8s/deployment-green.yaml
# 验证新版本
curl http://leaderboard-green.internal/health
# 切换流量
kubectl patch service leaderboard-service \
-p '{"spec":{"selector":{"version":"green"}}}'
# 验证
curl http://leaderboard.example.com/health
# 清理旧版本(蓝)
kubectl delete -f k8s/deployment-blue.yaml
10.3 金丝雀发布
# k8s/canary-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: leaderboard-service-canary
spec:
replicas: 1
selector:
matchLabels:
app: leaderboard-service
version: canary
template:
spec:
containers:
- name: leaderboard-service
image: registry.example.com/leaderboard-service:1.1.0-canary
# 逐步增加金丝雀流量
kubectl scale deployment leaderboard-service-canary --replicas=2
kubectl scale deployment leaderboard-service --replicas=8
# 观察指标,无异常则继续
kubectl scale deployment leaderboard-service-canary --replicas=5
kubectl scale deployment leaderboard-service --replicas=5
# 完全切换
kubectl scale deployment leaderboard-service-canary --replicas=10
kubectl scale deployment leaderboard-service --replicas=0