feat: add infrastructure components for observability and service discovery

Add modular infrastructure stack with:
- Consul: service discovery and configuration center
- Jaeger: distributed tracing
- Loki + Promtail: log aggregation
- Prometheus: metrics collection with alert rules
- Grafana: unified visualization dashboard

All components are optional and can be enabled on-demand using Docker profiles.
No changes required to existing microservices.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
hailin 2025-12-06 17:51:29 -08:00
parent 9c36e6772b
commit bc82f58549
13 changed files with 2101 additions and 0 deletions

View File

@ -0,0 +1,47 @@
# =============================================================================
# RWA Infrastructure - 环境变量配置
# =============================================================================
# 复制此文件为 .env 并修改配置
# =============================================================================
# -----------------------------------------------------------------------------
# Consul 配置
# -----------------------------------------------------------------------------
CONSUL_HTTP_PORT=8500
CONSUL_DNS_PORT=8600
# -----------------------------------------------------------------------------
# Jaeger 配置
# -----------------------------------------------------------------------------
JAEGER_UI_PORT=16686
# -----------------------------------------------------------------------------
# Loki 配置
# -----------------------------------------------------------------------------
LOKI_PORT=3100
# -----------------------------------------------------------------------------
# Grafana 配置
# -----------------------------------------------------------------------------
GRAFANA_PORT=3030
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=admin123
GRAFANA_ROOT_URL=http://localhost:3030
GRAFANA_LOG_LEVEL=info
# -----------------------------------------------------------------------------
# Prometheus 配置
# -----------------------------------------------------------------------------
PROMETHEUS_PORT=9090
# -----------------------------------------------------------------------------
# 后端服务器 IP (用于 Prometheus 抓取)
# -----------------------------------------------------------------------------
BACKEND_SERVER_IP=192.168.1.111
KONG_SERVER_IP=192.168.1.100
# -----------------------------------------------------------------------------
# PostgreSQL 配置 (用于 Grafana 数据源)
# -----------------------------------------------------------------------------
POSTGRES_USER=rwa_user
POSTGRES_PASSWORD=your_password_here

View File

@ -0,0 +1,257 @@
# RWA Infrastructure - 可观测性与服务治理
可插拔的基础设施组件,支持按需启用,不影响现有微服务代码。
## 架构概览
```
┌─────────────────────────────────────────────────────┐
│ Grafana │
│ (统一可视化仪表盘 :3030) │
└───────┬─────────────┬─────────────┬─────────────────┘
│ │ │
┌──────────────▼──┐ ┌───────▼───────┐ ┌──▼──────────────┐
│ Prometheus │ │ Loki │ │ Jaeger │
│ (指标 :9090) │ │ (日志 :3100) │ │ (追踪 :16686) │
└────────┬────────┘ └───────┬───────┘ └────────┬────────┘
│ │ │
│ ┌──────▼───────┐ │
│ │ Promtail │ │
│ │ (日志收集) │ │
│ └──────┬───────┘ │
│ │ │
┌──────────────────▼───────────────────▼────────────────────▼────────┐
│ │
│ RWA 微服务集群 │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ identity │ │ wallet │ │ mpc │ │ reward │ │ presence │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ ... │
└─────────────────────────────────┬───────────────────────────────────┘
┌────────▼────────┐
│ Consul │
│ (服务发现 :8500) │
└─────────────────┘
```
## 快速开始
### 1. 启动所有组件
```bash
cd infrastructure
./deploy.sh up
```
### 2. 按需启动
```bash
# 只启动服务发现
./deploy.sh up consul
# 启动日志系统
./deploy.sh up loki grafana
# 启动追踪系统
./deploy.sh up jaeger grafana
# 启动监控系统
./deploy.sh up prometheus grafana
```
### 3. 访问地址
| 服务 | 地址 | 用途 |
|------|------|------|
| Consul | http://localhost:8500 | 服务发现 & 配置中心 |
| Jaeger | http://localhost:16686 | 链路追踪 UI |
| Grafana | http://localhost:3030 | 统一仪表盘 |
| Prometheus | http://localhost:9090 | 指标查询 |
| Loki | http://localhost:3100 | 日志 API |
## 组件说明
### Consul - 服务发现与配置中心
**功能:**
- 服务注册与发现
- 健康检查
- KV 配置存储
- 多数据中心支持
**配置文件:**
- `consul/config/services.json` - 服务注册定义
- `consul/config/kv-defaults.json` - 默认 KV 配置
**使用示例:**
```bash
# 查看已注册服务
curl http://localhost:8500/v1/catalog/services
# 读取配置
curl http://localhost:8500/v1/kv/rwa/config/global/log_level?raw
# 更新配置
curl -X PUT -d 'debug' http://localhost:8500/v1/kv/rwa/config/global/log_level
```
### Jaeger - 分布式链路追踪
**功能:**
- 请求链路追踪
- 性能瓶颈分析
- 服务依赖可视化
- 错误定位
**接入方式NestJS**
```typescript
// 安装依赖
npm install @opentelemetry/sdk-node @opentelemetry/exporter-jaeger
// 在 main.ts 中初始化
import { NodeSDK } from '@opentelemetry/sdk-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
const sdk = new NodeSDK({
traceExporter: new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces',
}),
serviceName: 'identity-service',
});
sdk.start();
```
### Loki + Promtail - 日志聚合
**功能:**
- 自动收集 Docker 容器日志
- 日志标签化与索引
- LogQL 查询
- 与 Grafana 深度集成
**日志查询示例Grafana**
```logql
# 查看所有错误日志
{job="rwa-backend"} |~ "error|Error|ERROR"
# 按服务筛选
{service="identity-service"} | json | level="error"
# 查看特定 trace
{trace_id="abc123"}
```
### Prometheus - 指标监控
**功能:**
- 指标收集
- 告警规则
- PromQL 查询
**告警规则:**
- `prometheus/rules/rwa-alerts.yml` - 预定义告警规则
### Grafana - 统一可视化
**预置仪表盘:**
- RWA Services Overview - 服务概览
- Kong Dashboard - API 网关监控
- Presence Dashboard - 用户在线状态
**数据源:**
- Prometheus (指标)
- Loki (日志)
- Jaeger (追踪)
## 目录结构
```
infrastructure/
├── docker-compose.yml # 主编排文件
├── deploy.sh # 部署脚本
├── .env.example # 环境变量模板
├── README.md # 本文档
├── consul/
│ └── config/
│ ├── services.json # 服务注册
│ └── kv-defaults.json # KV 默认配置
├── jaeger/ # Jaeger 配置 (使用默认)
├── loki/
│ ├── loki-config.yml # Loki 配置
│ └── promtail-config.yml # Promtail 配置
├── prometheus/
│ ├── prometheus.yml # Prometheus 配置
│ └── rules/
│ └── rwa-alerts.yml # 告警规则
└── grafana/
└── provisioning/
├── datasources/
│ └── datasources.yml # 数据源配置
└── dashboards/
├── dashboards.yml # 仪表盘配置
└── rwa-services-overview.json
```
## 常用命令
```bash
# 启动
./deploy.sh up # 启动所有
./deploy.sh up consul jaeger # 启动指定组件
# 管理
./deploy.sh status # 查看状态
./deploy.sh health # 健康检查
./deploy.sh logs grafana # 查看日志
./deploy.sh restart # 重启
# 停止
./deploy.sh down # 停止所有
```
## 与现有服务集成
这些组件是**完全可选的**,不需要修改现有微服务代码即可获得以下能力:
| 能力 | 无需改代码 | 需要少量改动 |
|------|-----------|-------------|
| 服务健康监控 | ✅ Consul 健康检查 | - |
| 日志聚合 | ✅ Docker 日志自动收集 | - |
| 基础指标 | ✅ Kong Prometheus 插件 | - |
| 详细指标 | - | 添加 Prometheus 中间件 |
| 链路追踪 | - | 添加 OpenTelemetry SDK |
| 动态配置 | - | 集成 Consul KV |
## 扩展配置
### 添加告警通知
编辑 `prometheus/prometheus.yml`,配置 Alertmanager
```yaml
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
```
### 添加更多服务到 Consul
编辑 `consul/config/services.json`,添加新服务定义。
### 自定义 Grafana 仪表盘
将 JSON 文件放入 `grafana/provisioning/dashboards/` 目录即可自动加载。
## 生产环境建议
1. **持久化存储**:当前使用 Docker volumes生产环境建议使用外部存储
2. **高可用**Consul 建议 3-5 节点集群
3. **安全**:配置 TLS 和访问控制
4. **资源限制**:添加 Docker 资源限制配置

View File

@ -0,0 +1,46 @@
{
"description": "Consul KV 默认配置 - 可通过 API 或 UI 动态修改",
"usage": "consul kv import @kv-defaults.json",
"config": [
{
"key": "rwa/config/global/log_level",
"value": "info"
},
{
"key": "rwa/config/global/environment",
"value": "production"
},
{
"key": "rwa/config/rate-limit/default",
"value": "{\"requests_per_minute\": 100, \"requests_per_hour\": 5000}"
},
{
"key": "rwa/config/rate-limit/auth",
"value": "{\"requests_per_minute\": 20, \"requests_per_hour\": 200}"
},
{
"key": "rwa/config/cache/ttl",
"value": "{\"default\": 300, \"session\": 3600, \"static\": 86400}"
},
{
"key": "rwa/config/features/maintenance_mode",
"value": "false"
},
{
"key": "rwa/config/features/new_user_registration",
"value": "true"
},
{
"key": "rwa/config/features/kyc_required",
"value": "true"
},
{
"key": "rwa/config/database/pool_size",
"value": "20"
},
{
"key": "rwa/config/kafka/batch_size",
"value": "100"
}
]
}

View File

@ -0,0 +1,232 @@
{
"services": [
{
"name": "identity-service",
"id": "identity-service-1",
"address": "192.168.1.111",
"port": 3000,
"tags": ["rwa", "api", "identity"],
"meta": {
"version": "1.0.0",
"environment": "production"
},
"checks": [
{
"http": "http://192.168.1.111:3000/api/v1/health",
"interval": "10s",
"timeout": "5s",
"deregister_critical_service_after": "1m"
}
]
},
{
"name": "wallet-service",
"id": "wallet-service-1",
"address": "192.168.1.111",
"port": 3001,
"tags": ["rwa", "api", "wallet"],
"meta": {
"version": "1.0.0",
"environment": "production"
},
"checks": [
{
"http": "http://192.168.1.111:3001/api/v1/health",
"interval": "10s",
"timeout": "5s",
"deregister_critical_service_after": "1m"
}
]
},
{
"name": "backup-service",
"id": "backup-service-1",
"address": "192.168.1.111",
"port": 3002,
"tags": ["rwa", "api", "backup", "mpc"],
"meta": {
"version": "1.0.0",
"environment": "production"
},
"checks": [
{
"http": "http://192.168.1.111:3002/health",
"interval": "10s",
"timeout": "5s",
"deregister_critical_service_after": "1m"
}
]
},
{
"name": "planting-service",
"id": "planting-service-1",
"address": "192.168.1.111",
"port": 3003,
"tags": ["rwa", "api", "planting"],
"meta": {
"version": "1.0.0",
"environment": "production"
},
"checks": [
{
"http": "http://192.168.1.111:3003/api/v1/health",
"interval": "10s",
"timeout": "5s",
"deregister_critical_service_after": "1m"
}
]
},
{
"name": "referral-service",
"id": "referral-service-1",
"address": "192.168.1.111",
"port": 3004,
"tags": ["rwa", "api", "referral"],
"meta": {
"version": "1.0.0",
"environment": "production"
},
"checks": [
{
"http": "http://192.168.1.111:3004/api/v1/health",
"interval": "10s",
"timeout": "5s",
"deregister_critical_service_after": "1m"
}
]
},
{
"name": "reward-service",
"id": "reward-service-1",
"address": "192.168.1.111",
"port": 3005,
"tags": ["rwa", "api", "reward"],
"meta": {
"version": "1.0.0",
"environment": "production"
},
"checks": [
{
"http": "http://192.168.1.111:3005/api/v1/health",
"interval": "10s",
"timeout": "5s",
"deregister_critical_service_after": "1m"
}
]
},
{
"name": "mpc-service",
"id": "mpc-service-1",
"address": "192.168.1.111",
"port": 3006,
"tags": ["rwa", "api", "mpc", "crypto"],
"meta": {
"version": "1.0.0",
"environment": "production"
},
"checks": [
{
"http": "http://192.168.1.111:3006/api/v1/health",
"interval": "10s",
"timeout": "5s",
"deregister_critical_service_after": "1m"
}
]
},
{
"name": "leaderboard-service",
"id": "leaderboard-service-1",
"address": "192.168.1.111",
"port": 3007,
"tags": ["rwa", "api", "leaderboard"],
"meta": {
"version": "1.0.0",
"environment": "production"
},
"checks": [
{
"http": "http://192.168.1.111:3007/api/health",
"interval": "10s",
"timeout": "5s",
"deregister_critical_service_after": "1m"
}
]
},
{
"name": "reporting-service",
"id": "reporting-service-1",
"address": "192.168.1.111",
"port": 3008,
"tags": ["rwa", "api", "reporting"],
"meta": {
"version": "1.0.0",
"environment": "production"
},
"checks": [
{
"http": "http://192.168.1.111:3008/api/v1/health",
"interval": "10s",
"timeout": "5s",
"deregister_critical_service_after": "1m"
}
]
},
{
"name": "authorization-service",
"id": "authorization-service-1",
"address": "192.168.1.111",
"port": 3009,
"tags": ["rwa", "api", "authorization", "rbac"],
"meta": {
"version": "1.0.0",
"environment": "production"
},
"checks": [
{
"http": "http://192.168.1.111:3009/api/v1/health",
"interval": "10s",
"timeout": "5s",
"deregister_critical_service_after": "1m"
}
]
},
{
"name": "admin-service",
"id": "admin-service-1",
"address": "192.168.1.111",
"port": 3010,
"tags": ["rwa", "api", "admin"],
"meta": {
"version": "1.0.0",
"environment": "production"
},
"checks": [
{
"http": "http://192.168.1.111:3010/api/v1/health",
"interval": "10s",
"timeout": "5s",
"deregister_critical_service_after": "1m"
}
]
},
{
"name": "presence-service",
"id": "presence-service-1",
"address": "192.168.1.111",
"port": 3011,
"tags": ["rwa", "api", "presence", "realtime"],
"meta": {
"version": "1.0.0",
"environment": "production"
},
"checks": [
{
"http": "http://192.168.1.111:3011/api/v1/health",
"interval": "10s",
"timeout": "5s",
"deregister_critical_service_after": "1m"
}
]
}
]
}

View File

@ -0,0 +1,273 @@
#!/bin/bash
# =============================================================================
# RWA Infrastructure - 部署脚本
# =============================================================================
#
# 用法:
# ./deploy.sh up [组件...] 启动组件 (默认: full)
# ./deploy.sh down 停止所有组件
# ./deploy.sh restart [组件...] 重启组件
# ./deploy.sh logs [组件] 查看日志
# ./deploy.sh status 查看状态
# ./deploy.sh health 健康检查
#
# 可用组件:
# consul - 服务发现与配置中心
# jaeger - 分布式链路追踪
# loki - 日志聚合 (包含 promtail)
# grafana - 可视化仪表盘
# prometheus- 指标收集
# full - 所有组件
#
# 示例:
# ./deploy.sh up # 启动所有组件
# ./deploy.sh up consul jaeger # 只启动 Consul 和 Jaeger
# ./deploy.sh up loki grafana # 启动日志和可视化
# ./deploy.sh logs jaeger # 查看 Jaeger 日志
#
# =============================================================================
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"
# 颜色定义
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# 配置
COMPOSE_FILE="docker-compose.yml"
ENV_FILE=".env"
ENV_EXAMPLE=".env.example"
# =============================================================================
# 工具函数
# =============================================================================
log_info() {
echo -e "${BLUE}[INFO]${NC} $1"
}
log_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
log_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
# 检查 .env 文件
check_env() {
if [ ! -f "$ENV_FILE" ]; then
if [ -f "$ENV_EXAMPLE" ]; then
log_info "创建 .env 文件..."
cp "$ENV_EXAMPLE" "$ENV_FILE"
log_warning "请检查 .env 文件并配置必要的环境变量"
else
log_warning ".env 文件不存在,使用默认配置"
fi
fi
}
# 获取 profile 参数
get_profiles() {
local profiles=""
if [ $# -eq 0 ] || [ "$1" = "full" ]; then
profiles="--profile full"
else
for component in "$@"; do
case "$component" in
consul|jaeger|loki|logging|grafana|prometheus|metrics|tracing)
profiles="$profiles --profile $component"
;;
*)
log_error "未知组件: $component"
echo "可用组件: consul, jaeger, loki, grafana, prometheus, full"
exit 1
;;
esac
done
fi
echo "$profiles"
}
# =============================================================================
# 命令实现
# =============================================================================
cmd_up() {
check_env
local profiles=$(get_profiles "$@")
log_info "启动基础设施组件..."
docker compose -f "$COMPOSE_FILE" $profiles up -d
log_success "组件已启动!"
echo ""
cmd_status
}
cmd_down() {
log_info "停止所有组件..."
docker compose -f "$COMPOSE_FILE" --profile full down
log_success "所有组件已停止"
}
cmd_restart() {
local profiles=$(get_profiles "$@")
log_info "重启组件..."
docker compose -f "$COMPOSE_FILE" $profiles restart
log_success "组件已重启"
}
cmd_logs() {
local service="${1:-}"
if [ -n "$service" ]; then
docker compose -f "$COMPOSE_FILE" logs -f "$service"
else
docker compose -f "$COMPOSE_FILE" --profile full logs -f
fi
}
cmd_status() {
echo ""
echo "=========================================="
echo " RWA Infrastructure 状态"
echo "=========================================="
echo ""
docker compose -f "$COMPOSE_FILE" --profile full ps --format "table {{.Name}}\t{{.Status}}\t{{.Ports}}"
echo ""
echo "=========================================="
echo " 访问地址"
echo "=========================================="
echo ""
echo " Consul UI: http://localhost:${CONSUL_HTTP_PORT:-8500}"
echo " Jaeger UI: http://localhost:${JAEGER_UI_PORT:-16686}"
echo " Grafana: http://localhost:${GRAFANA_PORT:-3030}"
echo " Prometheus: http://localhost:${PROMETHEUS_PORT:-9090}"
echo " Loki: http://localhost:${LOKI_PORT:-3100}"
echo ""
}
cmd_health() {
echo ""
echo "=========================================="
echo " 健康检查"
echo "=========================================="
echo ""
# Consul
if curl -s http://localhost:${CONSUL_HTTP_PORT:-8500}/v1/status/leader > /dev/null 2>&1; then
echo -e " Consul: ${GREEN}✓ Healthy${NC}"
else
echo -e " Consul: ${RED}✗ Unhealthy${NC}"
fi
# Jaeger
if curl -s http://localhost:${JAEGER_UI_PORT:-16686} > /dev/null 2>&1; then
echo -e " Jaeger: ${GREEN}✓ Healthy${NC}"
else
echo -e " Jaeger: ${RED}✗ Unhealthy${NC}"
fi
# Grafana
if curl -s http://localhost:${GRAFANA_PORT:-3030}/api/health > /dev/null 2>&1; then
echo -e " Grafana: ${GREEN}✓ Healthy${NC}"
else
echo -e " Grafana: ${RED}✗ Unhealthy${NC}"
fi
# Prometheus
if curl -s http://localhost:${PROMETHEUS_PORT:-9090}/-/healthy > /dev/null 2>&1; then
echo -e " Prometheus: ${GREEN}✓ Healthy${NC}"
else
echo -e " Prometheus: ${RED}✗ Unhealthy${NC}"
fi
# Loki
if curl -s http://localhost:${LOKI_PORT:-3100}/ready > /dev/null 2>&1; then
echo -e " Loki: ${GREEN}✓ Healthy${NC}"
else
echo -e " Loki: ${RED}✗ Unhealthy${NC}"
fi
echo ""
}
cmd_help() {
echo ""
echo "RWA Infrastructure 部署工具"
echo ""
echo "用法: $0 <命令> [参数...]"
echo ""
echo "命令:"
echo " up [组件...] 启动组件 (默认启动全部)"
echo " down 停止所有组件"
echo " restart [组件...] 重启组件"
echo " logs [组件] 查看日志"
echo " status 查看运行状态"
echo " health 健康检查"
echo " help 显示帮助"
echo ""
echo "可用组件:"
echo " consul 服务发现与配置中心"
echo " jaeger 分布式链路追踪"
echo " loki 日志聚合系统"
echo " grafana 可视化仪表盘"
echo " prometheus 指标收集"
echo " full 所有组件 (默认)"
echo ""
echo "示例:"
echo " $0 up # 启动所有组件"
echo " $0 up consul jaeger # 只启动 Consul 和 Jaeger"
echo " $0 logs grafana # 查看 Grafana 日志"
echo " $0 health # 检查所有组件健康状态"
echo ""
}
# =============================================================================
# 主入口
# =============================================================================
case "${1:-help}" in
up)
shift
cmd_up "$@"
;;
down)
cmd_down
;;
restart)
shift
cmd_restart "$@"
;;
logs)
shift
cmd_logs "$@"
;;
status)
cmd_status
;;
health)
cmd_health
;;
help|--help|-h)
cmd_help
;;
*)
log_error "未知命令: $1"
cmd_help
exit 1
;;
esac

View File

@ -0,0 +1,254 @@
# =============================================================================
# RWA Infrastructure - 可观测性与服务治理基础设施
# =============================================================================
#
# 模块化设计,可按需启用:
# - consul: 服务发现与配置中心
# - jaeger: 分布式链路追踪
# - loki: 日志聚合
# - grafana: 统一可视化仪表盘
#
# 使用方法:
# ./deploy.sh up # 启动所有组件
# ./deploy.sh up consul # 只启动 Consul
# ./deploy.sh up jaeger loki # 启动指定组件
# ./deploy.sh down # 停止所有组件
#
# =============================================================================
services:
# ===========================================================================
# Consul - 服务发现与配置中心
# ===========================================================================
# 功能:
# - 服务注册与发现
# - 健康检查
# - KV 配置存储
# - 多数据中心支持
# ===========================================================================
consul:
image: docker.io/hashicorp/consul:1.18
container_name: rwa-consul
command: agent -server -bootstrap-expect=1 -ui -client=0.0.0.0 -datacenter=rwa-dc1
environment:
CONSUL_BIND_INTERFACE: eth0
ports:
- "${CONSUL_HTTP_PORT:-8500}:8500" # HTTP API + UI
- "${CONSUL_DNS_PORT:-8600}:8600/udp" # DNS
- "8301:8301" # Serf LAN
- "8302:8302" # Serf WAN
volumes:
- consul_data:/consul/data
- ./consul/config:/consul/config:ro
healthcheck:
test: ["CMD", "consul", "members"]
interval: 10s
timeout: 5s
retries: 5
restart: unless-stopped
networks:
- rwa-infra
profiles:
- consul
- full
# ===========================================================================
# Jaeger - 分布式链路追踪
# ===========================================================================
# 功能:
# - 请求链路追踪
# - 性能瓶颈分析
# - 服务依赖可视化
# - 错误定位
# ===========================================================================
jaeger:
image: docker.io/jaegertracing/all-in-one:1.54
container_name: rwa-jaeger
environment:
COLLECTOR_ZIPKIN_HOST_PORT: :9411
COLLECTOR_OTLP_ENABLED: true
SPAN_STORAGE_TYPE: badger
BADGER_EPHEMERAL: false
BADGER_DIRECTORY_VALUE: /badger/data
BADGER_DIRECTORY_KEY: /badger/key
ports:
- "${JAEGER_UI_PORT:-16686}:16686" # UI
- "6831:6831/udp" # Thrift compact (agent)
- "6832:6832/udp" # Thrift binary (agent)
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "14250:14250" # gRPC (collector)
- "14268:14268" # HTTP (collector)
- "9411:9411" # Zipkin compatible
volumes:
- jaeger_data:/badger
healthcheck:
test: ["CMD-SHELL", "wget -q --spider http://localhost:16686 || exit 1"]
interval: 10s
timeout: 5s
retries: 5
restart: unless-stopped
networks:
- rwa-infra
profiles:
- jaeger
- tracing
- full
# ===========================================================================
# Loki - 日志聚合系统
# ===========================================================================
# 功能:
# - 日志收集与存储
# - 日志查询 (LogQL)
# - 与 Grafana 深度集成
# - 低资源占用
# ===========================================================================
loki:
image: docker.io/grafana/loki:2.9.4
container_name: rwa-loki
command: -config.file=/etc/loki/loki-config.yml
ports:
- "${LOKI_PORT:-3100}:3100"
volumes:
- ./loki/loki-config.yml:/etc/loki/loki-config.yml:ro
- loki_data:/loki
healthcheck:
test: ["CMD-SHELL", "wget -q --spider http://localhost:3100/ready || exit 1"]
interval: 10s
timeout: 5s
retries: 5
restart: unless-stopped
networks:
- rwa-infra
profiles:
- loki
- logging
- full
# ===========================================================================
# Promtail - 日志收集代理
# ===========================================================================
# 功能:
# - 收集 Docker 容器日志
# - 日志标签化
# - 推送到 Loki
# ===========================================================================
promtail:
image: docker.io/grafana/promtail:2.9.4
container_name: rwa-promtail
command: -config.file=/etc/promtail/promtail-config.yml
volumes:
- ./loki/promtail-config.yml:/etc/promtail/promtail-config.yml:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- promtail_positions:/tmp
depends_on:
loki:
condition: service_healthy
restart: unless-stopped
networks:
- rwa-infra
profiles:
- loki
- logging
- full
# ===========================================================================
# Grafana - 统一可视化平台
# ===========================================================================
# 功能:
# - 多数据源集成 (Prometheus, Loki, Jaeger)
# - 自定义仪表盘
# - 告警管理
# - 团队协作
# ===========================================================================
grafana:
image: docker.io/grafana/grafana:10.3.1
container_name: rwa-grafana
environment:
- GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin123}
- GF_USERS_ALLOW_SIGN_UP=false
# 服务器配置
- GF_SERVER_ROOT_URL=${GRAFANA_ROOT_URL:-http://localhost:3030}
- GF_SERVER_SERVE_FROM_SUB_PATH=false
# 安全配置
- GF_SECURITY_ALLOW_EMBEDDING=true
- GF_SECURITY_COOKIE_SAMESITE=lax
# 功能开关
- GF_FEATURE_TOGGLES_ENABLE=traceqlEditor tempoSearch tempoBackendSearch
# 日志级别
- GF_LOG_LEVEL=${GRAFANA_LOG_LEVEL:-info}
ports:
- "${GRAFANA_PORT:-3030}:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
healthcheck:
test: ["CMD-SHELL", "wget -q --spider http://localhost:3000/api/health || exit 1"]
interval: 10s
timeout: 5s
retries: 5
restart: unless-stopped
networks:
- rwa-infra
profiles:
- grafana
- full
# ===========================================================================
# Prometheus - 指标收集 (可选,如果 api-gateway 已有可跳过)
# ===========================================================================
prometheus:
image: docker.io/prom/prometheus:v2.49.1
container_name: rwa-prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
ports:
- "${PROMETHEUS_PORT:-9090}:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/rules:/etc/prometheus/rules:ro
- prometheus_data:/prometheus
healthcheck:
test: ["CMD-SHELL", "wget -q --spider http://localhost:9090/-/healthy || exit 1"]
interval: 10s
timeout: 5s
retries: 5
restart: unless-stopped
networks:
- rwa-infra
profiles:
- prometheus
- metrics
- full
# =============================================================================
# Volumes - 持久化存储
# =============================================================================
volumes:
consul_data:
driver: local
jaeger_data:
driver: local
loki_data:
driver: local
promtail_positions:
driver: local
grafana_data:
driver: local
prometheus_data:
driver: local
# =============================================================================
# Networks
# =============================================================================
networks:
rwa-infra:
driver: bridge
name: rwa-infra

View File

@ -0,0 +1,18 @@
# =============================================================================
# Grafana Dashboard Provisioning
# =============================================================================
apiVersion: 1
providers:
- name: 'RWA Dashboards'
orgId: 1
folder: 'RWA'
folderUid: 'rwa'
type: file
disableDeletion: false
updateIntervalSeconds: 30
allowUiUpdates: true
options:
path: /etc/grafana/provisioning/dashboards
foldersFromFilesStructure: false

View File

@ -0,0 +1,387 @@
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"description": "RWA 微服务集群监控概览",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"liveNow": false,
"panels": [
{
"gridPos": { "h": 3, "w": 24, "x": 0, "y": 0 },
"id": 1,
"options": {
"code": { "language": "plaintext", "showLineNumbers": false },
"content": "# RWA 微服务监控中心\n\n实时监控 12 个微服务的健康状态、性能指标和日志",
"mode": "markdown"
},
"pluginVersion": "10.3.1",
"title": "",
"transparent": true,
"type": "text"
},
{
"datasource": { "type": "prometheus", "uid": "prometheus" },
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"mappings": [
{ "options": { "0": { "color": "red", "text": "DOWN" }, "1": { "color": "green", "text": "UP" } }, "type": "value" }
],
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
},
"unit": "none"
}
},
"gridPos": { "h": 4, "w": 4, "x": 0, "y": 3 },
"id": 2,
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"orientation": "auto",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"pluginVersion": "10.3.1",
"targets": [
{
"expr": "up{job=\"identity-service\"}",
"refId": "A"
}
],
"title": "Identity Service",
"type": "stat"
},
{
"datasource": { "type": "prometheus", "uid": "prometheus" },
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"mappings": [
{ "options": { "0": { "color": "red", "text": "DOWN" }, "1": { "color": "green", "text": "UP" } }, "type": "value" }
],
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
}
}
},
"gridPos": { "h": 4, "w": 4, "x": 4, "y": 3 },
"id": 3,
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"orientation": "auto",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"targets": [
{
"expr": "up{job=\"wallet-service\"}",
"refId": "A"
}
],
"title": "Wallet Service",
"type": "stat"
},
{
"datasource": { "type": "prometheus", "uid": "prometheus" },
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"mappings": [
{ "options": { "0": { "color": "red", "text": "DOWN" }, "1": { "color": "green", "text": "UP" } }, "type": "value" }
],
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
}
}
},
"gridPos": { "h": 4, "w": 4, "x": 8, "y": 3 },
"id": 4,
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"orientation": "auto",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"targets": [
{
"expr": "up{job=\"mpc-service\"}",
"refId": "A"
}
],
"title": "MPC Service",
"type": "stat"
},
{
"datasource": { "type": "prometheus", "uid": "prometheus" },
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"mappings": [
{ "options": { "0": { "color": "red", "text": "DOWN" }, "1": { "color": "green", "text": "UP" } }, "type": "value" }
],
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
}
}
},
"gridPos": { "h": 4, "w": 4, "x": 12, "y": 3 },
"id": 5,
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"orientation": "auto",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"targets": [
{
"expr": "up{job=\"reward-service\"}",
"refId": "A"
}
],
"title": "Reward Service",
"type": "stat"
},
{
"datasource": { "type": "prometheus", "uid": "prometheus" },
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"mappings": [
{ "options": { "0": { "color": "red", "text": "DOWN" }, "1": { "color": "green", "text": "UP" } }, "type": "value" }
],
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
}
}
},
"gridPos": { "h": 4, "w": 4, "x": 16, "y": 3 },
"id": 6,
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"orientation": "auto",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"targets": [
{
"expr": "up{job=\"presence-service\"}",
"refId": "A"
}
],
"title": "Presence Service",
"type": "stat"
},
{
"datasource": { "type": "prometheus", "uid": "prometheus" },
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"mappings": [
{ "options": { "0": { "color": "red", "text": "DOWN" }, "1": { "color": "green", "text": "UP" } }, "type": "value" }
],
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
}
}
},
"gridPos": { "h": 4, "w": 4, "x": 20, "y": 3 },
"id": 7,
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"orientation": "auto",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"targets": [
{
"expr": "up{job=\"backup-service\"}",
"refId": "A"
}
],
"title": "Backup Service",
"type": "stat"
},
{
"datasource": { "type": "prometheus", "uid": "prometheus" },
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": { "legend": false, "tooltip": false, "viz": false },
"insertNulls": false,
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": { "type": "linear" },
"showPoints": "never",
"spanNulls": false,
"stacking": { "group": "A", "mode": "none" },
"thresholdsStyle": { "mode": "off" }
},
"unit": "reqps"
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 7 },
"id": 8,
"options": {
"legend": { "calcs": ["mean", "max"], "displayMode": "table", "placement": "bottom", "showLegend": true },
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"expr": "sum(rate(http_requests_total{job=~\".*-service\"}[5m])) by (job)",
"legendFormat": "{{job}}",
"refId": "A"
}
],
"title": "Request Rate by Service",
"type": "timeseries"
},
{
"datasource": { "type": "prometheus", "uid": "prometheus" },
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": { "legend": false, "tooltip": false, "viz": false },
"insertNulls": false,
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": { "type": "linear" },
"showPoints": "never",
"spanNulls": false,
"stacking": { "group": "A", "mode": "none" },
"thresholdsStyle": { "mode": "off" }
},
"unit": "ms"
}
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 7 },
"id": 9,
"options": {
"legend": { "calcs": ["mean", "p95"], "displayMode": "table", "placement": "bottom", "showLegend": true },
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=~\".*-service\"}[5m])) by (le, job)) * 1000",
"legendFormat": "{{job}} P95",
"refId": "A"
}
],
"title": "Response Time P95 by Service",
"type": "timeseries"
},
{
"datasource": { "type": "loki", "uid": "loki" },
"gridPos": { "h": 10, "w": 24, "x": 0, "y": 15 },
"id": 10,
"options": {
"dedupStrategy": "none",
"enableLogDetails": true,
"prettifyLogMessage": true,
"showCommonLabels": false,
"showLabels": true,
"showTime": true,
"sortOrder": "Descending",
"wrapLogMessage": true
},
"targets": [
{
"expr": "{job=\"rwa-backend\"} |~ \"error|Error|ERROR|warn|Warn|WARN\" | json",
"refId": "A"
}
],
"title": "Error & Warning Logs",
"type": "logs"
}
],
"refresh": "10s",
"schemaVersion": 39,
"tags": ["rwa", "microservices", "overview"],
"templating": {
"list": []
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {},
"timezone": "browser",
"title": "RWA Services Overview",
"uid": "rwa-services-overview",
"version": 1,
"weekStart": ""
}

View File

@ -0,0 +1,104 @@
# =============================================================================
# Grafana Datasources - 自动配置数据源
# =============================================================================
apiVersion: 1
datasources:
# ===========================================================================
# Prometheus - 指标数据源
# ===========================================================================
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
jsonData:
httpMethod: POST
manageAlerts: true
prometheusType: Prometheus
prometheusVersion: 2.49.1
# ===========================================================================
# Loki - 日志数据源
# ===========================================================================
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: false
jsonData:
maxLines: 1000
derivedFields:
# 从日志中提取 trace_id 并链接到 Jaeger
- name: TraceID
matcherRegex: '"trace_id":"([a-f0-9]+)"'
url: 'http://localhost:16686/trace/$${__value.raw}'
datasourceUid: jaeger
urlDisplayLabel: View Trace
# ===========================================================================
# Jaeger - 链路追踪数据源
# ===========================================================================
- name: Jaeger
type: jaeger
uid: jaeger
access: proxy
url: http://jaeger:16686
editable: false
jsonData:
tracesToLogsV2:
datasourceUid: loki
spanStartTimeShift: '-1h'
spanEndTimeShift: '1h'
filterByTraceID: true
filterBySpanID: true
tracesToMetrics:
datasourceUid: prometheus
spanStartTimeShift: '-1h'
spanEndTimeShift: '1h'
nodeGraph:
enabled: true
traceQuery:
timeShiftEnabled: true
spanStartTimeShift: '1h'
spanEndTimeShift: '-1h'
# ===========================================================================
# Kong Prometheus (如果 api-gateway 的 Prometheus 单独部署)
# ===========================================================================
- name: Kong-Prometheus
type: prometheus
access: proxy
url: http://192.168.1.100:9099
editable: false
jsonData:
httpMethod: POST
# ===========================================================================
# PostgreSQL - 直接查询数据库 (可选)
# ===========================================================================
- name: PostgreSQL-RWA
type: postgres
access: proxy
url: 192.168.1.111:5432
user: ${POSTGRES_USER:-rwa_user}
editable: false
jsonData:
database: rwa_identity
sslmode: disable
maxOpenConns: 5
maxIdleConns: 2
connMaxLifetime: 14400
secureJsonData:
password: ${POSTGRES_PASSWORD:-}
# ===========================================================================
# Redis - 缓存监控 (需要 Redis 插件)
# ===========================================================================
# - name: Redis
# type: redis-datasource
# access: proxy
# url: redis://192.168.1.111:6379
# editable: false

View File

@ -0,0 +1,81 @@
# =============================================================================
# Loki Configuration - 日志聚合系统
# =============================================================================
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: info
common:
instance_addr: 127.0.0.1
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 100
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
ruler:
alertmanager_url: http://localhost:9093
storage:
type: local
local:
directory: /loki/rules
rule_path: /loki/rules-temp
ring:
kvstore:
store: inmemory
enable_api: true
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h # 7 days
max_cache_freshness_per_query: 10m
split_queries_by_interval: 15m
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
per_stream_rate_limit: 5MB
per_stream_rate_limit_burst: 15MB
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: true
retention_period: 336h # 14 days retention
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
delete_request_store: filesystem

View File

@ -0,0 +1,129 @@
# =============================================================================
# Promtail Configuration - 日志收集代理
# =============================================================================
# 功能:
# - 自动发现 Docker 容器日志
# - 解析并标签化日志
# - 推送到 Loki
# =============================================================================
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
tenant_id: rwa
scrape_configs:
# ===========================================================================
# Docker 容器日志收集
# ===========================================================================
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
filters:
- name: label
values: ["com.docker.compose.project"]
relabel_configs:
# 使用容器名作为日志标签
- source_labels: ['__meta_docker_container_name']
regex: '/(.*)'
target_label: 'container'
# 使用 compose 项目名
- source_labels: ['__meta_docker_container_label_com_docker_compose_project']
target_label: 'project'
# 使用 compose 服务名
- source_labels: ['__meta_docker_container_label_com_docker_compose_service']
target_label: 'service'
# 容器 ID
- source_labels: ['__meta_docker_container_id']
target_label: 'container_id'
pipeline_stages:
# 解析 JSON 格式日志 (NestJS 默认格式)
- json:
expressions:
level: level
message: message
context: context
timestamp: timestamp
trace_id: traceId
span_id: spanId
# 设置日志级别标签
- labels:
level:
context:
trace_id:
span_id:
# 时间戳解析
- timestamp:
source: timestamp
format: RFC3339Nano
fallback_formats:
- RFC3339
- UnixMs
# 过滤健康检查日志 (可选,减少噪音)
- match:
selector: '{service=~".+"}'
stages:
- regex:
expression: '.*(health|healthcheck|ready|live).*'
- drop:
expression: '.*(health|healthcheck|ready|live).*'
drop_counter_reason: healthcheck_noise
# ===========================================================================
# RWA 微服务日志 (直接从服务器采集)
# ===========================================================================
- job_name: rwa-services
static_configs:
- targets:
- localhost
labels:
job: rwa-backend
__path__: /var/log/rwa/*.log
pipeline_stages:
- json:
expressions:
level: level
message: message
context: context
service: service
- labels:
level:
context:
service:
# ===========================================================================
# 系统日志 (可选)
# ===========================================================================
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: system
__path__: /var/log/syslog
pipeline_stages:
- regex:
expression: '^(?P<timestamp>\w+\s+\d+\s+\d+:\d+:\d+)\s+(?P<host>\S+)\s+(?P<process>\S+):\s+(?P<message>.*)$'
- labels:
host:
process:
- timestamp:
source: timestamp
format: 'Jan 02 15:04:05'

View File

@ -0,0 +1,147 @@
# =============================================================================
# Prometheus Configuration - RWA 微服务监控
# =============================================================================
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'rwa-production'
env: 'production'
# 告警规则文件
rule_files:
- /etc/prometheus/rules/*.yml
# Alertmanager 配置 (可选)
# alerting:
# alertmanagers:
# - static_configs:
# - targets:
# - alertmanager:9093
scrape_configs:
# ===========================================================================
# Prometheus 自身监控
# ===========================================================================
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# ===========================================================================
# Kong API Gateway 监控
# ===========================================================================
- job_name: 'kong'
static_configs:
- targets: ['192.168.1.100:8001']
metrics_path: /metrics
scrape_interval: 10s
# ===========================================================================
# RWA 微服务监控
# ===========================================================================
- job_name: 'identity-service'
static_configs:
- targets: ['192.168.1.111:3000']
metrics_path: /api/v1/metrics
scrape_interval: 15s
- job_name: 'wallet-service'
static_configs:
- targets: ['192.168.1.111:3001']
metrics_path: /api/v1/metrics
scrape_interval: 15s
- job_name: 'backup-service'
static_configs:
- targets: ['192.168.1.111:3002']
metrics_path: /metrics
scrape_interval: 15s
- job_name: 'planting-service'
static_configs:
- targets: ['192.168.1.111:3003']
metrics_path: /api/v1/metrics
scrape_interval: 15s
- job_name: 'referral-service'
static_configs:
- targets: ['192.168.1.111:3004']
metrics_path: /api/v1/metrics
scrape_interval: 15s
- job_name: 'reward-service'
static_configs:
- targets: ['192.168.1.111:3005']
metrics_path: /api/v1/metrics
scrape_interval: 15s
- job_name: 'mpc-service'
static_configs:
- targets: ['192.168.1.111:3006']
metrics_path: /api/v1/metrics
scrape_interval: 15s
- job_name: 'leaderboard-service'
static_configs:
- targets: ['192.168.1.111:3007']
metrics_path: /api/v1/metrics
scrape_interval: 15s
- job_name: 'reporting-service'
static_configs:
- targets: ['192.168.1.111:3008']
metrics_path: /api/v1/metrics
scrape_interval: 15s
- job_name: 'authorization-service'
static_configs:
- targets: ['192.168.1.111:3009']
metrics_path: /api/v1/metrics
scrape_interval: 15s
- job_name: 'admin-service'
static_configs:
- targets: ['192.168.1.111:3010']
metrics_path: /api/v1/metrics
scrape_interval: 15s
- job_name: 'presence-service'
static_configs:
- targets: ['192.168.1.111:3011']
metrics_path: /api/v1/metrics
scrape_interval: 15s
# ===========================================================================
# 基础设施监控
# ===========================================================================
- job_name: 'consul'
static_configs:
- targets: ['consul:8500']
metrics_path: /v1/agent/metrics
params:
format: ['prometheus']
- job_name: 'jaeger'
static_configs:
- targets: ['jaeger:14269']
metrics_path: /metrics
- job_name: 'loki'
static_configs:
- targets: ['loki:3100']
metrics_path: /metrics
# ===========================================================================
# Docker 容器监控 (需要 cAdvisor)
# ===========================================================================
# - job_name: 'cadvisor'
# static_configs:
# - targets: ['cadvisor:8080']
# ===========================================================================
# Node Exporter (主机监控)
# ===========================================================================
# - job_name: 'node'
# static_configs:
# - targets: ['192.168.1.111:9100', '192.168.1.100:9100']

View File

@ -0,0 +1,126 @@
# =============================================================================
# RWA 微服务告警规则
# =============================================================================
groups:
# ===========================================================================
# 服务可用性告警
# ===========================================================================
- name: service-availability
rules:
- alert: ServiceDown
expr: up{job=~".*-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务不可用: {{ $labels.job }}"
description: "服务 {{ $labels.job }} 已经停止响应超过 1 分钟"
- alert: ServiceHighErrorRate
expr: |
sum(rate(http_requests_total{job=~".*-service", status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total{job=~".*-service"}[5m])) by (job)
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "服务错误率过高: {{ $labels.job }}"
description: "服务 {{ $labels.job }} 的 5xx 错误率超过 5%"
- alert: ServiceHighLatency
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=~".*-service"}[5m])) by (le, job))
> 2
for: 5m
labels:
severity: warning
annotations:
summary: "服务响应延迟过高: {{ $labels.job }}"
description: "服务 {{ $labels.job }} 的 P95 响应时间超过 2 秒"
# ===========================================================================
# Kong 网关告警
# ===========================================================================
- name: kong-gateway
rules:
- alert: KongHighLatency
expr: |
histogram_quantile(0.99, sum(rate(kong_latency_bucket{type="request"}[5m])) by (le))
> 5000
for: 5m
labels:
severity: warning
annotations:
summary: "Kong 网关延迟过高"
description: "Kong 网关 P99 延迟超过 5 秒"
- alert: KongHighErrorRate
expr: |
sum(rate(kong_http_status{code=~"5.."}[5m]))
/
sum(rate(kong_http_status[5m]))
> 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Kong 网关错误率过高"
description: "Kong 网关 5xx 错误率超过 1%"
- alert: KongRateLimitTriggered
expr: sum(rate(kong_rate_limiting_requests_total{status="over_limit"}[5m])) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "触发限流保护"
description: "每分钟有超过 10 个请求被限流"
# ===========================================================================
# 基础设施告警
# ===========================================================================
- name: infrastructure
rules:
- alert: ConsulServiceUnhealthy
expr: consul_health_service_status{status!="passing"} > 0
for: 2m
labels:
severity: warning
annotations:
summary: "Consul 服务健康检查失败: {{ $labels.service }}"
description: "服务 {{ $labels.service }} 在 Consul 中的健康检查未通过"
- alert: LokiIngestionErrors
expr: sum(rate(loki_distributor_bytes_received_total[5m])) == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Loki 没有接收到日志"
description: "Loki 在过去 5 分钟内没有接收到任何日志数据"
# ===========================================================================
# 业务指标告警 (示例)
# ===========================================================================
- name: business-metrics
rules:
- alert: LowDAU
expr: rwa_presence_dau < 100
for: 1h
labels:
severity: info
annotations:
summary: "日活用户数较低"
description: "当前 DAU 仅为 {{ $value }}"
- alert: HighPendingRewards
expr: rwa_rewards_pending_count > 10000
for: 30m
labels:
severity: warning
annotations:
summary: "待领取奖励积压"
description: "待领取奖励数量超过 10000"