164 lines
5.3 KiB
Markdown
164 lines
5.3 KiB
Markdown
# How to reproduce the result of GPT-OSS with SGLang
|
|
|
|
### Install the latest SGLang
|
|
|
|
```bash
|
|
git clone https://github.com/sgl-project/sglang.git
|
|
cd sglang
|
|
git checkout v0.5.1.post3
|
|
|
|
pip install --upgrade pip
|
|
pip install -e "python[all]"
|
|
```
|
|
|
|
### Reproduce the benchmark throughput result (Batch Size 1)
|
|
|
|
Launch Command
|
|
|
|
```bash
|
|
# MXFP4 120B on H100
|
|
python3 -m sglang.launch_server --model openai/gpt-oss-120b --tp 8 --attention-backend triton
|
|
|
|
# BF16 120B on H100
|
|
python3 -m sglang.launch_server --model lmsys/gpt-oss-120b-bf16 --tp 8 --attention-backend triton
|
|
|
|
# MXFP4 120B on B200
|
|
python3 -m sglang.launch_server --model openai/gpt-oss-120b --tp 4
|
|
|
|
# BF16 120B on B200
|
|
python3 -m sglang.launch_server --model lmsys/gpt-oss-120b-bf16 --tp 4
|
|
```
|
|
|
|
Benchmark Command
|
|
|
|
```bash
|
|
|
|
# MXFP4 120B on H100
|
|
python3 -m sglang.bench_one_batch_server --model openai/gpt-oss-120b --base-url http://localhost:30000 --batch-size 1 --input-len 1024 --output-len 512 --show-report
|
|
```
|
|
|
|
### Reproduce the benchmark throughput result (Batch Size 32)
|
|
|
|
Launch Command
|
|
|
|
```bash
|
|
# MXFP4 120B on H100
|
|
python3 -m sglang.launch_server --model openai/gpt-oss-120b --tp 8
|
|
|
|
# BF16 120B on H100
|
|
python3 -m sglang.launch_server --model lmsys/gpt-oss-120b-bf16 --tp 8
|
|
|
|
# MXFP4 120B on B200
|
|
python3 -m sglang.launch_server --model openai/gpt-oss-120b --tp 4
|
|
|
|
# BF16 120B on B200
|
|
python3 -m sglang.launch_server --model lmsys/gpt-oss-120b-bf16 --tp 4
|
|
```
|
|
|
|
Benchmark Command
|
|
|
|
```bash
|
|
python3 -m sglang.bench_one_batch_server --model openai/gpt-oss-120b --base-url http://localhost:30000 --batch-size 32 --input-len 1024 8192 --output-len 512 --show-report
|
|
```
|
|
|
|
### Reproduce the evaluation result
|
|
|
|
Install gpt-oss
|
|
|
|
```bash
|
|
git clone https://github.com/openai/gpt-oss.git
|
|
cd gpt-oss
|
|
pip install -e .
|
|
```
|
|
|
|
Evaluation Command
|
|
|
|
```bash
|
|
DATASET=gpqa
|
|
BASE_URL=YOUR_BASE_URL
|
|
OPENAI_API_KEY=dummy python -m gpt_oss.evals \
|
|
--base-url ${BASE_URL}/v1 \
|
|
--model dummy \
|
|
--reasoning-effort low,medium,high \
|
|
--eval $DATASET \
|
|
--n-threads 1000
|
|
```
|
|
|
|
### Reproduce the benchmark result of acceptance length
|
|
> Note: On B200, if top k is 1, set `--attention-backend trtllm_mha`
|
|
```bash
|
|
git clone https://github.com/sgl-project/SpecForge.git
|
|
cd SpecForge/benchmarks
|
|
config_list=(
|
|
"1,0,0,0"
|
|
"1,3,1,4"
|
|
"1,5,4,8"
|
|
)
|
|
python3 bench_model_speedup.py \
|
|
--model-path openai/gpt-oss-120b \
|
|
--speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 \
|
|
--port 20001 \
|
|
--trust-remote-code \
|
|
--mem-fraction-static 0.8 \
|
|
--tp-size 4 \
|
|
--attention-backend fa3 \
|
|
--config-list "${config_list[@]}" \
|
|
--benchmark-list mtbench:80 gsm8k:200 humaneval:200 math500:200 \
|
|
--output lmsys_gpt-oss-120b_Eagle3_result.jsonl
|
|
|
|
python3 bench_model_speedup.py \
|
|
--model-path openai/gpt-oss-120b \
|
|
--speculative-draft-model-path nvidia/gpt-oss-120b-Eagle3 \
|
|
--port 20001 \
|
|
--trust-remote-code \
|
|
--mem-fraction-static 0.8 \
|
|
--tp-size 4 \
|
|
--attention-backend fa3 \
|
|
--config-list "${config_list[@]}" \
|
|
--benchmark-list mtbench:80 gsm8k:200 humaneval:200 math500:200 \
|
|
--output nv_gpt-oss-120b_Eagle3_result.jsonl
|
|
```
|
|
|
|
### Reproduce the result of speculative decoding speedup
|
|
|
|
Launch Command
|
|
|
|
```bash
|
|
# On Hopper:
|
|
# - Tree decoding (topk > 1) and chain decoding (topk = 1) are supported on both FA3 and Triton backends.
|
|
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --tp 4
|
|
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 --tp 4
|
|
|
|
# On Blackwell:
|
|
# - Chain decoding (topk = 1) is supported on TRTLLM-MHA backend. Tree decoding (topk > 1) is in progress, stay tuned!
|
|
# - Both tree decoding (topk > 1) and chain decoding (topk = 1) are supported on the Triton backend.
|
|
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algo EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --tp 4
|
|
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algo EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 --attention-backend triton --tp 4
|
|
```
|
|
|
|
Benchmark Command
|
|
|
|
```bash
|
|
config_list=(
|
|
"1,0,0,0"
|
|
"1,3,1,4"
|
|
"1,5,4,8"
|
|
)
|
|
python3 bench_model_speedup.py \
|
|
--model-path openai/gpt-oss-120b \
|
|
--speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 \
|
|
--port 20001 \
|
|
--trust-remote-code \
|
|
--mem-fraction-static 0.8 \
|
|
--tp-size 4 \
|
|
--attention-backend fa3 \
|
|
--config-list "${config_list[@]}" \
|
|
--benchmark-list gsm8k:200 humaneval:200 math500:200 \
|
|
--output lmsys_gpt-oss-120b_Eagle3_result.jsonl
|
|
```
|
|
|
|
We can gain the best speedup with the following settings:
|
|
|
|
- **1.39x** speedup with the `--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4` setting.
|
|
- **1.52x** speedup with the `--speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8` setting.
|