docs: add API architecture and usage guide
This commit is contained in:
parent
1a8fc81549
commit
34650cd09a
|
|
@ -0,0 +1,431 @@
|
|||
# Open-Sora 推理 API 架构与使用说明
|
||||
|
||||
## 一、整体架构
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ 训练服务器 89.185.24.182 │
|
||||
│ │
|
||||
外部请求 │ ┌──────────────────────────────────┐ │
|
||||
─────────────► :80 │ │ NGINX(反向代理 + 限流) │ │
|
||||
│ │ • 5 次/分钟/IP 限流 │ │
|
||||
│ │ • 超时 3600s(适配长时生成) │ │
|
||||
│ └────────────────┬─────────────────┘ │
|
||||
│ │ :8000 │
|
||||
│ ┌────────────────▼─────────────────┐ │
|
||||
│ │ Gunicorn + FastAPI(4 workers) │ │
|
||||
│ │ • POST /v1/generate │ │
|
||||
│ │ • GET /v1/jobs/{id} │ │
|
||||
│ │ • GET /v1/videos/{id} │ │
|
||||
│ │ • GET /health │ │
|
||||
│ └────────────────┬─────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌────────────────▼─────────────────┐ │
|
||||
│ │ Redis(任务队列 + 结果存储) │ │
|
||||
│ │ • AOF 持久化(重启不丢任务) │ │
|
||||
│ │ • 结果保留 24 小时 │ │
|
||||
│ └──┬──┬──┬──┬──┬──┬──┬──┬──────────┘ │
|
||||
│ │ │ │ │ │ │ │ │ │
|
||||
│ ┌──▼──▼──▼──▼──▼──▼──▼──▼──────────┐ │
|
||||
│ │ 8× Celery Worker(GPU 0–7) │ │
|
||||
│ │ • 每 Worker 独占一张 A100 80GB │ │
|
||||
│ │ • 并发=1,串行处理本 GPU 任务 │ │
|
||||
│ │ • 每 10 任务自动重启(防碎片) │ │
|
||||
│ └──┬──────────────────────────────┘ │
|
||||
│ │ subprocess │
|
||||
│ ┌──▼──────────────────────────────┐ │
|
||||
│ │ torchrun --nproc_per_node=1 │ │
|
||||
│ │ scripts/diffusion/inference.py │ │
|
||||
│ │ • 每次推理完全隔离 │ │
|
||||
│ │ • 崩溃不影响其他 GPU │ │
|
||||
│ └──┬──────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──▼──────────────────────────────┐ │
|
||||
│ │ /data/train-output/api-outputs/│ │
|
||||
│ │ 视频文件存储(写盘 Huawei NVMe) │ │
|
||||
│ └─────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 关键设计决策
|
||||
|
||||
| 决策 | 原因 |
|
||||
|------|------|
|
||||
| subprocess 调用 torchrun | 进程级隔离,单卡崩溃不影响整体服务 |
|
||||
| 每 Worker 独占一 GPU | 避免显存竞争,最大化并发吞吐(8 并发) |
|
||||
| Celery `task_acks_late=True` | Worker 崩溃时任务重新入队,不丢失 |
|
||||
| Redis AOF 持久化 | 服务器重启后任务状态不丢失 |
|
||||
| systemd 守护所有进程 | 任何服务崩溃 5 秒内自动拉起 |
|
||||
| NGINX 限流 5r/m | 防止单个调用方打满 GPU 队列 |
|
||||
|
||||
---
|
||||
|
||||
## 二、服务组件
|
||||
|
||||
### 文件结构
|
||||
|
||||
```
|
||||
my-sora/
|
||||
├── api/
|
||||
│ ├── config.py # 路径、GPU数量等配置
|
||||
│ ├── schemas.py # 请求/响应 Pydantic 模型
|
||||
│ ├── tasks.py # Celery 任务(核心推理逻辑)
|
||||
│ └── main.py # FastAPI 路由
|
||||
├── deploy/
|
||||
│ ├── nginx.conf # NGINX 站点配置
|
||||
│ ├── opensora-limit.conf # 限流区定义(http 上下文)
|
||||
│ ├── opensora-api.service # FastAPI systemd 单元
|
||||
│ ├── opensora-worker@.service # Worker systemd 模板(GPU 0-7)
|
||||
│ └── setup.sh # 一键部署脚本
|
||||
└── requirements-api.txt
|
||||
```
|
||||
|
||||
### 进程清单
|
||||
|
||||
| 进程 | 数量 | systemd 单元 |
|
||||
|------|------|-------------|
|
||||
| NGINX | 1 | nginx.service |
|
||||
| Redis | 1 | redis-server.service |
|
||||
| Gunicorn(FastAPI) | 1×4 workers | opensora-api.service |
|
||||
| Celery Worker | 8(GPU 0-7) | opensora-worker@{0-7}.service |
|
||||
|
||||
---
|
||||
|
||||
## 三、API 接口
|
||||
|
||||
### 基础信息
|
||||
|
||||
- **Base URL**:`http://89.185.24.182`
|
||||
- **认证**:无(内网使用)
|
||||
- **Content-Type**:`application/json`
|
||||
- **编码**:UTF-8
|
||||
|
||||
---
|
||||
|
||||
### 3.1 健康检查
|
||||
|
||||
```
|
||||
GET /health
|
||||
```
|
||||
|
||||
**响应示例:**
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"time": "2026-03-06T12:00:00.000000+00:00"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.2 提交生成任务
|
||||
|
||||
```
|
||||
POST /v1/generate
|
||||
```
|
||||
|
||||
**请求参数:**
|
||||
|
||||
| 字段 | 类型 | 必填 | 默认值 | 说明 |
|
||||
|------|------|------|--------|------|
|
||||
| `prompt` | string | ✅ | — | 文本提示词,1–2000 字符 |
|
||||
| `resolution` | string | | `"256px"` | 分辨率:`"256px"` 或 `"768px"` |
|
||||
| `aspect_ratio` | string | | `"16:9"` | 比例:`"16:9"` `"9:16"` `"1:1"` `"2.39:1"` |
|
||||
| `num_frames` | int | | `49` | 帧数,建议 `49`(≈2s)`97`(≈4s)`129`(≈5s) |
|
||||
| `motion_score` | int | | `4` | 运动幅度 1–7,1=几乎静止,7=剧烈运动 |
|
||||
| `num_steps` | int | | `50` | 扩散步数 10–100,越大质量越高越慢 |
|
||||
| `seed` | int | | 随机 | 随机种子,固定可复现 |
|
||||
| `cond_type` | string | | `"t2v"` | `"t2v"`(文生视频)或 `"i2v_head"`(图生视频) |
|
||||
|
||||
**请求示例:**
|
||||
|
||||
```bash
|
||||
curl -X POST http://89.185.24.182/v1/generate \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"prompt": "a golden retriever running on the beach at sunset, slow motion",
|
||||
"resolution": "256px",
|
||||
"aspect_ratio": "16:9",
|
||||
"num_frames": 49,
|
||||
"motion_score": 5,
|
||||
"num_steps": 50
|
||||
}'
|
||||
```
|
||||
|
||||
**响应示例(HTTP 202):**
|
||||
|
||||
```json
|
||||
{
|
||||
"job_id": "3b7e2a1c-4f8d-4b9e-a12c-8d7f3e6b2c1a",
|
||||
"message": "任务已提交"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.3 查询任务状态
|
||||
|
||||
```
|
||||
GET /v1/jobs/{job_id}
|
||||
```
|
||||
|
||||
**响应字段:**
|
||||
|
||||
| 字段 | 说明 |
|
||||
|------|------|
|
||||
| `job_id` | 任务 ID |
|
||||
| `status` | `pending`(排队)/ `processing`(生成中)/ `completed`(完成)/ `failed`(失败) |
|
||||
| `video_url` | 完成后的视频下载路径(仅 `completed` 状态有) |
|
||||
| `error` | 失败原因(仅 `failed` 状态有) |
|
||||
| `completed_at` | 完成时间(ISO 8601) |
|
||||
|
||||
**轮询示例:**
|
||||
|
||||
```bash
|
||||
curl http://89.185.24.182/v1/jobs/3b7e2a1c-4f8d-4b9e-a12c-8d7f3e6b2c1a
|
||||
```
|
||||
|
||||
**生成中响应:**
|
||||
|
||||
```json
|
||||
{
|
||||
"job_id": "3b7e2a1c-4f8d-4b9e-a12c-8d7f3e6b2c1a",
|
||||
"status": "processing",
|
||||
"video_url": null,
|
||||
"error": null,
|
||||
"completed_at": null
|
||||
}
|
||||
```
|
||||
|
||||
**完成响应:**
|
||||
|
||||
```json
|
||||
{
|
||||
"job_id": "3b7e2a1c-4f8d-4b9e-a12c-8d7f3e6b2c1a",
|
||||
"status": "completed",
|
||||
"video_url": "/v1/videos/3b7e2a1c-4f8d-4b9e-a12c-8d7f3e6b2c1a",
|
||||
"error": null,
|
||||
"completed_at": "2026-03-06T12:01:05.123456+00:00"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.4 下载视频
|
||||
|
||||
```
|
||||
GET /v1/videos/{job_id}
|
||||
```
|
||||
|
||||
返回 `video/mp4` 文件流,文件名为 `{job_id}.mp4`。
|
||||
|
||||
```bash
|
||||
curl -O http://89.185.24.182/v1/videos/3b7e2a1c-4f8d-4b9e-a12c-8d7f3e6b2c1a
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 四、使用示例
|
||||
|
||||
### Python 客户端(完整轮询流程)
|
||||
|
||||
```python
|
||||
import time
|
||||
import requests
|
||||
|
||||
BASE_URL = "http://89.185.24.182"
|
||||
|
||||
def generate_video(prompt: str, **kwargs) -> str:
|
||||
"""提交任务并等待完成,返回本地视频路径。"""
|
||||
# 1. 提交任务
|
||||
resp = requests.post(f"{BASE_URL}/v1/generate", json={"prompt": prompt, **kwargs})
|
||||
resp.raise_for_status()
|
||||
job_id = resp.json()["job_id"]
|
||||
print(f"任务已提交: {job_id}")
|
||||
|
||||
# 2. 轮询等待(每 10 秒查一次)
|
||||
while True:
|
||||
status_resp = requests.get(f"{BASE_URL}/v1/jobs/{job_id}")
|
||||
status_resp.raise_for_status()
|
||||
data = status_resp.json()
|
||||
status = data["status"]
|
||||
print(f"状态: {status}")
|
||||
|
||||
if status == "completed":
|
||||
break
|
||||
if status == "failed":
|
||||
raise RuntimeError(f"生成失败: {data['error']}")
|
||||
|
||||
time.sleep(10)
|
||||
|
||||
# 3. 下载视频
|
||||
video_resp = requests.get(f"{BASE_URL}/v1/videos/{job_id}", stream=True)
|
||||
video_resp.raise_for_status()
|
||||
local_path = f"{job_id}.mp4"
|
||||
with open(local_path, "wb") as f:
|
||||
for chunk in video_resp.iter_content(chunk_size=8192):
|
||||
f.write(chunk)
|
||||
print(f"视频已保存: {local_path}")
|
||||
return local_path
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
generate_video(
|
||||
prompt="a panda eating bamboo in a misty forest, cinematic",
|
||||
resolution="256px",
|
||||
num_frames=49,
|
||||
motion_score=4,
|
||||
)
|
||||
```
|
||||
|
||||
### Shell 一行命令(快速测试)
|
||||
|
||||
```bash
|
||||
# 提交 → 拿 job_id → 等 90 秒 → 下载
|
||||
JOB=$(curl -s -X POST http://89.185.24.182/v1/generate \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"prompt":"ocean waves at sunset","num_frames":49}' \
|
||||
| python3 -c "import sys,json; print(json.load(sys.stdin)['job_id'])")
|
||||
echo "job_id: $JOB"
|
||||
sleep 90
|
||||
curl -O http://89.185.24.182/v1/videos/$JOB
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 五、性能参考
|
||||
|
||||
基于官方 H100 基准数据,A100 80GB 性能略低,实际参考:
|
||||
|
||||
| 分辨率 | 帧数 | GPU 数 | 预计耗时 | 显存峰值 |
|
||||
|--------|------|--------|---------|---------|
|
||||
| 256px | 49 | 1 | ~60s | ~52GB |
|
||||
| 256px | 129 | 1 | ~90s | ~55GB |
|
||||
| 768px | 49 | 1 | ~900s | ~62GB |
|
||||
| 768px | 129 | 8 | ~350s | ~44GB/卡 |
|
||||
|
||||
> **最大并发**:8 个 256px 任务可同时执行(每卡一个)。768px 高质量视频建议串行或减少并发。
|
||||
|
||||
---
|
||||
|
||||
## 六、测试流程
|
||||
|
||||
### 6.1 环境就绪检查
|
||||
|
||||
```bash
|
||||
# 服务全部 active
|
||||
ssh ceshi@89.185.24.182 "sudo systemctl is-active opensora-api opensora-worker@{0..7} nginx redis-server"
|
||||
|
||||
# API 健康检查
|
||||
curl http://89.185.24.182/health
|
||||
|
||||
# Redis 连通
|
||||
ssh ceshi@89.185.24.182 "redis-cli ping"
|
||||
|
||||
# 权重文件确认
|
||||
ssh ceshi@89.185.24.182 "ls -lh /data/train-input/ckpts/*.safetensors"
|
||||
```
|
||||
|
||||
### 6.2 冒烟测试(首次验证)
|
||||
|
||||
```bash
|
||||
# 提交一个最简单的任务
|
||||
curl -X POST http://89.185.24.182/v1/generate \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"prompt": "a red ball bouncing", "num_frames": 17, "num_steps": 20}'
|
||||
```
|
||||
|
||||
查询直到 `completed`,预计 30–60 秒。
|
||||
|
||||
### 6.3 并发测试(8 GPU 全满)
|
||||
|
||||
```bash
|
||||
# 同时提交 8 个任务,验证 8 卡并发
|
||||
for i in $(seq 1 8); do
|
||||
curl -s -X POST http://89.185.24.182/v1/generate \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "{\"prompt\": \"test video $i, nature scene\", \"num_frames\": 17, \"num_steps\": 10}" &
|
||||
done
|
||||
wait
|
||||
echo "8 个任务已提交"
|
||||
```
|
||||
|
||||
### 6.4 故障恢复测试
|
||||
|
||||
```bash
|
||||
# 模拟 API 服务崩溃后自动恢复
|
||||
sudo systemctl kill opensora-api
|
||||
sleep 8
|
||||
curl http://89.185.24.182/health # 应在 5-10 秒内恢复返回 200
|
||||
|
||||
# 模拟某个 Worker 崩溃后恢复
|
||||
sudo systemctl kill opensora-worker@3
|
||||
sleep 15
|
||||
sudo systemctl is-active opensora-worker@3 # 应为 active
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 七、运维操作
|
||||
|
||||
### 查看日志
|
||||
|
||||
```bash
|
||||
# API 服务实时日志
|
||||
sudo journalctl -u opensora-api -f
|
||||
|
||||
# GPU 0 Worker 日志
|
||||
tail -f /data/train-output/logs/worker-gpu0.log
|
||||
|
||||
# NGINX 访问日志
|
||||
tail -f /data/train-output/logs/api-access.log
|
||||
```
|
||||
|
||||
### 重启服务
|
||||
|
||||
```bash
|
||||
# 重启 API
|
||||
sudo systemctl restart opensora-api
|
||||
|
||||
# 重启单个 Worker
|
||||
sudo systemctl restart opensora-worker@2
|
||||
|
||||
# 重启全部 Worker
|
||||
for i in $(seq 0 7); do sudo systemctl restart opensora-worker@$i; done
|
||||
|
||||
# 重启全栈
|
||||
sudo systemctl restart opensora-api redis-server nginx
|
||||
for i in $(seq 0 7); do sudo systemctl restart opensora-worker@$i; done
|
||||
```
|
||||
|
||||
### 磁盘清理
|
||||
|
||||
视频文件存储在 `/data/train-output/api-outputs/`,每个任务占约 50–300MB。
|
||||
|
||||
```bash
|
||||
# 查看占用
|
||||
du -sh /data/train-output/api-outputs/
|
||||
|
||||
# 删除 7 天前的视频
|
||||
find /data/train-output/api-outputs/ -maxdepth 1 -type d -mtime +7 -exec rm -rf {} +
|
||||
```
|
||||
|
||||
### 查看 GPU 使用
|
||||
|
||||
```bash
|
||||
# 实时监控各 GPU 显存
|
||||
watch -n 2 nvidia-smi --query-gpu=index,name,memory.used,memory.free,utilization.gpu --format=csv
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 八、常见问题
|
||||
|
||||
| 现象 | 原因 | 解决 |
|
||||
|------|------|------|
|
||||
| 任务长时间 `pending` | Worker 全部在忙 | 等待或减少并发请求 |
|
||||
| 任务 `failed`,报 OOM | 768px 显存不足 | 改用 256px 或减少 num_frames |
|
||||
| `/health` 返回 502 | API 服务未启动 | `sudo systemctl restart opensora-api` |
|
||||
| 任务 `failed`,报权重找不到 | 权重未下载完成 | 检查 `/data/train-input/ckpts/` 目录 |
|
||||
| Worker 一直重启 | 模型加载失败 | `tail /data/train-output/logs/worker-gpu0.log` |
|
||||
Loading…
Reference in New Issue