rwadurian/backend/services/leaderboard-service/docs/DEPLOYMENT.md

20 KiB
Raw Blame History

Leaderboard Service 部署文档

1. 部署概述

本文档描述 Leaderboard Service 的部署架构、配置和操作流程。

1.1 部署架构

                              ┌─────────────────────────────────────────────┐
                              │              Load Balancer                   │
                              │            (Nginx / ALB / etc.)             │
                              └────────────────────┬────────────────────────┘
                                                   │
                    ┌──────────────────────────────┼──────────────────────────────┐
                    │                              │                              │
             ┌──────▼──────┐               ┌──────▼──────┐               ┌──────▼──────┐
             │   Service   │               │   Service   │               │   Service   │
             │  Instance 1 │               │  Instance 2 │               │  Instance N │
             └──────┬──────┘               └──────┬──────┘               └──────┬──────┘
                    │                              │                              │
                    └──────────────────────────────┼──────────────────────────────┘
                                                   │
         ┌─────────────────────────────────────────┼─────────────────────────────────────────┐
         │                                         │                                         │
    ┌────▼────┐                              ┌─────▼─────┐                            ┌─────▼─────┐
    │PostgreSQL│                              │   Redis    │                            │   Kafka   │
    │ Primary │◀──── Replication ────▶       │  Cluster   │                            │  Cluster  │
    │         │                              │            │                            │           │
    └─────────┘                              └────────────┘                            └───────────┘

1.2 部署环境

环境 用途 域名示例
Development 本地开发 localhost:3000
Staging 预发布测试 staging-leaderboard.example.com
Production 生产环境 leaderboard.example.com

2. Docker 部署

2.1 Dockerfile

# Multi-stage build for production
FROM node:20-alpine AS builder

WORKDIR /app

# Install OpenSSL for Prisma
RUN apk add --no-cache openssl

# Copy package files
COPY package*.json ./
COPY prisma ./prisma/

# Install dependencies
RUN npm ci

# Generate Prisma client
RUN npx prisma generate

# Copy source code
COPY . .

# Build the application
RUN npm run build

# Production stage
FROM node:20-alpine AS production

WORKDIR /app

# Install OpenSSL for Prisma
RUN apk add --no-cache openssl

# Copy package files and install production dependencies
COPY package*.json ./
RUN npm ci --only=production

# Copy Prisma files and generate client
COPY prisma ./prisma/
RUN npx prisma generate

# Copy built application
COPY --from=builder /app/dist ./dist

# Create non-root user
RUN addgroup -g 1001 -S nodejs
RUN adduser -S nestjs -u 1001
USER nestjs

# Expose port
EXPOSE 3000

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

# Start the application
CMD ["node", "dist/main"]

2.2 Docker Compose 生产配置

# docker-compose.prod.yml
version: '3.8'

services:
  app:
    build:
      context: .
      dockerfile: Dockerfile
      target: production
    image: leaderboard-service:${VERSION:-latest}
    container_name: leaderboard-service
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      NODE_ENV: production
      DATABASE_URL: ${DATABASE_URL}
      REDIS_HOST: ${REDIS_HOST}
      REDIS_PORT: ${REDIS_PORT}
      REDIS_PASSWORD: ${REDIS_PASSWORD}
      KAFKA_BROKERS: ${KAFKA_BROKERS}
      JWT_SECRET: ${JWT_SECRET}
      JWT_EXPIRES_IN: ${JWT_EXPIRES_IN}
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    deploy:
      resources:
        limits:
          cpus: '1'
          memory: 1G
        reservations:
          cpus: '0.5'
          memory: 512M
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
    networks:
      - leaderboard-network

networks:
  leaderboard-network:
    driver: bridge

2.3 构建和推送镜像

# 构建镜像
docker build -t leaderboard-service:1.0.0 .

# 标记镜像
docker tag leaderboard-service:1.0.0 registry.example.com/leaderboard-service:1.0.0
docker tag leaderboard-service:1.0.0 registry.example.com/leaderboard-service:latest

# 推送到镜像仓库
docker push registry.example.com/leaderboard-service:1.0.0
docker push registry.example.com/leaderboard-service:latest

3. Kubernetes 部署

3.1 Deployment

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: leaderboard-service
  labels:
    app: leaderboard-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: leaderboard-service
  template:
    metadata:
      labels:
        app: leaderboard-service
    spec:
      containers:
        - name: leaderboard-service
          image: registry.example.com/leaderboard-service:1.0.0
          ports:
            - containerPort: 3000
          env:
            - name: NODE_ENV
              value: "production"
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: leaderboard-secrets
                  key: database-url
            - name: REDIS_HOST
              valueFrom:
                configMapKeyRef:
                  name: leaderboard-config
                  key: redis-host
            - name: REDIS_PORT
              valueFrom:
                configMapKeyRef:
                  name: leaderboard-config
                  key: redis-port
            - name: JWT_SECRET
              valueFrom:
                secretKeyRef:
                  name: leaderboard-secrets
                  key: jwt-secret
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "1000m"
              memory: "1Gi"
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: leaderboard-service
                topologyKey: kubernetes.io/hostname

3.2 Service

# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: leaderboard-service
spec:
  type: ClusterIP
  selector:
    app: leaderboard-service
  ports:
    - port: 80
      targetPort: 3000
      protocol: TCP

3.3 Ingress

# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: leaderboard-service-ingress
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/rate-limit: "100"
    nginx.ingress.kubernetes.io/rate-limit-window: "1m"
spec:
  tls:
    - hosts:
        - leaderboard.example.com
      secretName: leaderboard-tls
  rules:
    - host: leaderboard.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: leaderboard-service
                port:
                  number: 80

3.4 ConfigMap

# k8s/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: leaderboard-config
data:
  redis-host: "redis-master.redis.svc.cluster.local"
  redis-port: "6379"
  kafka-brokers: "kafka-0.kafka.svc.cluster.local:9092,kafka-1.kafka.svc.cluster.local:9092"
  log-level: "info"

3.5 Secrets

# k8s/secrets.yaml (示例,实际使用需加密)
apiVersion: v1
kind: Secret
metadata:
  name: leaderboard-secrets
type: Opaque
stringData:
  database-url: "postgresql://user:password@host:5432/leaderboard_db"
  jwt-secret: "your-production-jwt-secret"
  redis-password: "your-redis-password"

3.6 HPA (Horizontal Pod Autoscaler)

# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: leaderboard-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: leaderboard-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

4. 环境配置

4.1 生产环境变量

# 应用配置
NODE_ENV=production
PORT=3000

# 数据库配置
DATABASE_URL=postgresql://user:password@db-host:5432/leaderboard_db?connection_limit=20

# Redis 配置
REDIS_HOST=redis-host
REDIS_PORT=6379
REDIS_PASSWORD=your-redis-password

# Kafka 配置
KAFKA_BROKERS=kafka-1:9092,kafka-2:9092,kafka-3:9092
KAFKA_GROUP_ID=leaderboard-service-group
KAFKA_CLIENT_ID=leaderboard-service-prod

# JWT 配置
JWT_SECRET=your-production-jwt-secret-at-least-32-chars
JWT_EXPIRES_IN=7d

# 外部服务
REFERRAL_SERVICE_URL=http://referral-service:3000
IDENTITY_SERVICE_URL=http://identity-service:3000

# 日志配置
LOG_LEVEL=info
LOG_FORMAT=json

# 性能配置
DISPLAY_LIMIT_DEFAULT=30
REFRESH_INTERVAL_MINUTES=5
CACHE_TTL_SECONDS=300

4.2 数据库迁移

# 生产环境迁移
DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate deploy

# 检查迁移状态
DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate status

5. 监控与告警

5.1 健康检查端点

端点 用途 响应
/health 存活检查 {"status": "ok"}
/health/ready 就绪检查 {"status": "ok", "details": {...}}

5.2 Prometheus 指标

# prometheus-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: leaderboard-service
spec:
  selector:
    matchLabels:
      app: leaderboard-service
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

5.3 告警规则

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: leaderboard-service-alerts
spec:
  groups:
    - name: leaderboard-service
      rules:
        - alert: LeaderboardServiceDown
          expr: up{job="leaderboard-service"} == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Leaderboard Service is down"
            description: "Leaderboard Service has been down for more than 1 minute."

        - alert: LeaderboardServiceHighLatency
          expr: histogram_quantile(0.95, http_request_duration_seconds_bucket{job="leaderboard-service"}) > 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High latency on Leaderboard Service"
            description: "95th percentile latency is above 2 seconds."

        - alert: LeaderboardServiceHighErrorRate
          expr: rate(http_requests_total{job="leaderboard-service",status=~"5.."}[5m]) > 0.1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High error rate on Leaderboard Service"
            description: "Error rate is above 10%."

5.4 日志收集

# fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  fluent-bit.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/containers/leaderboard-service*.log
        Parser            docker
        Tag               leaderboard.*
        Refresh_Interval  5

    [OUTPUT]
        Name              es
        Match             leaderboard.*
        Host              elasticsearch
        Port              9200
        Index             leaderboard-logs
        Type              _doc    

6. 运维操作

6.1 常用命令

# 查看服务状态
kubectl get pods -l app=leaderboard-service

# 查看日志
kubectl logs -f deployment/leaderboard-service

# 扩缩容
kubectl scale deployment leaderboard-service --replicas=5

# 重启服务
kubectl rollout restart deployment/leaderboard-service

# 回滚
kubectl rollout undo deployment/leaderboard-service

# 查看资源使用
kubectl top pods -l app=leaderboard-service

6.2 数据库维护

# 数据库备份
pg_dump -h $DB_HOST -U $DB_USER -d leaderboard_db > backup_$(date +%Y%m%d).sql

# 数据库恢复
psql -h $DB_HOST -U $DB_USER -d leaderboard_db < backup_20240115.sql

# 清理过期数据
psql -h $DB_HOST -U $DB_USER -d leaderboard_db -c "
  DELETE FROM leaderboard_rankings
  WHERE period_end_at < NOW() - INTERVAL '90 days';
"

6.3 缓存维护

# 连接 Redis
redis-cli -h $REDIS_HOST -p $REDIS_PORT -a $REDIS_PASSWORD

# 查看缓存键
KEYS leaderboard:*

# 清除特定缓存
DEL leaderboard:DAILY:2024-01-15:rankings

# 清除所有排行榜缓存
KEYS leaderboard:* | xargs DEL

7. 故障排查

7.1 常见问题

问题 可能原因 解决方案
服务启动失败 数据库连接失败 检查 DATABASE_URL 配置
排名不更新 定时任务未执行 检查 Scheduler 日志
响应超时 数据库查询慢 检查索引和查询计划
缓存失效 Redis 连接问题 检查 Redis 服务状态
消息丢失 Kafka 配置错误 检查 Kafka 连接和主题

7.2 诊断命令

# 检查服务连通性
curl -v http://localhost:3000/health

# 检查数据库连接
kubectl exec -it deployment/leaderboard-service -- \
  npx prisma db execute --stdin <<< "SELECT 1"

# 检查 Redis 连接
kubectl exec -it deployment/leaderboard-service -- \
  redis-cli -h $REDIS_HOST ping

# 查看详细日志
kubectl logs deployment/leaderboard-service --since=1h | grep ERROR

7.3 性能诊断

# CPU Profile
kubectl exec -it deployment/leaderboard-service -- \
  node --prof dist/main.js

# 内存分析
kubectl exec -it deployment/leaderboard-service -- \
  node --expose-gc --inspect dist/main.js

8. 安全加固

8.1 网络策略

# k8s/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: leaderboard-service-network-policy
spec:
  podSelector:
    matchLabels:
      app: leaderboard-service
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - protocol: TCP
          port: 3000
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: database
      ports:
        - protocol: TCP
          port: 5432
    - to:
        - namespaceSelector:
            matchLabels:
              name: redis
      ports:
        - protocol: TCP
          port: 6379

8.2 安全检查清单

  • 所有敏感信息使用 Secrets 存储
  • 数据库使用强密码和 SSL 连接
  • Redis 启用密码认证
  • JWT Secret 足够长且随机
  • 容器以非 root 用户运行
  • 启用网络策略限制流量
  • 定期更新依赖和基础镜像
  • 启用审计日志

9. 备份与恢复

9.1 备份策略

数据类型 备份频率 保留期限
数据库 每日全量 + 每小时增量 30 天
配置 每次变更 永久Git
日志 实时同步 90 天

9.2 灾难恢复

# 1. 恢复数据库
pg_restore -h $DB_HOST -U $DB_USER -d leaderboard_db latest_backup.dump

# 2. 重新部署服务
kubectl apply -f k8s/

# 3. 验证服务
curl http://leaderboard.example.com/health

# 4. 清除并重建缓存
redis-cli FLUSHDB
curl -X POST http://leaderboard.example.com/leaderboard/config/refresh

10. 版本发布

10.1 发布流程

1. 开发完成
   └── 代码审查
       └── 合并到 develop
           └── CI 测试通过
               └── 合并到 main
                   └── 打标签
                       └── 构建镜像
                           └── 部署到 Staging
                               └── 验收测试
                                   └── 部署到 Production

10.2 蓝绿部署

# 部署新版本(绿)
kubectl apply -f k8s/deployment-green.yaml

# 验证新版本
curl http://leaderboard-green.internal/health

# 切换流量
kubectl patch service leaderboard-service \
  -p '{"spec":{"selector":{"version":"green"}}}'

# 验证
curl http://leaderboard.example.com/health

# 清理旧版本(蓝)
kubectl delete -f k8s/deployment-blue.yaml

10.3 金丝雀发布

# k8s/canary-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: leaderboard-service-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: leaderboard-service
      version: canary
  template:
    spec:
      containers:
        - name: leaderboard-service
          image: registry.example.com/leaderboard-service:1.1.0-canary
# 逐步增加金丝雀流量
kubectl scale deployment leaderboard-service-canary --replicas=2
kubectl scale deployment leaderboard-service --replicas=8

# 观察指标,无异常则继续
kubectl scale deployment leaderboard-service-canary --replicas=5
kubectl scale deployment leaderboard-service --replicas=5

# 完全切换
kubectl scale deployment leaderboard-service-canary --replicas=10
kubectl scale deployment leaderboard-service --replicas=0