History

hailin cc76bab27e first commit		2025-09-15 10:32:17 +08:00
..
README.md	first commit	2025-09-15 10:32:17 +08:00

README.md

How to reproduce the result of GPT-OSS with SGLang

Install the latest SGLang

git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout v0.5.1.post3

pip install --upgrade pip
pip install -e "python[all]"

Reproduce the benchmark throughput result (Batch Size 1)

Launch Command

# MXFP4 120B on H100
python3 -m sglang.launch_server --model openai/gpt-oss-120b --tp 8 --attention-backend triton

# BF16 120B on H100
python3 -m sglang.launch_server --model lmsys/gpt-oss-120b-bf16 --tp 8 --attention-backend triton

# MXFP4 120B on B200
python3 -m sglang.launch_server --model openai/gpt-oss-120b --tp 4

# BF16 120B on B200
python3 -m sglang.launch_server --model lmsys/gpt-oss-120b-bf16 --tp 4

Benchmark Command


# MXFP4 120B on H100
python3 -m sglang.bench_one_batch_server --model openai/gpt-oss-120b --base-url http://localhost:30000 --batch-size 1 --input-len 1024 --output-len 512 --show-report

Reproduce the benchmark throughput result (Batch Size 32)

Launch Command

# MXFP4 120B on H100
python3 -m sglang.launch_server --model openai/gpt-oss-120b --tp 8

# BF16 120B on H100
python3 -m sglang.launch_server --model lmsys/gpt-oss-120b-bf16 --tp 8

# MXFP4 120B on B200
python3 -m sglang.launch_server --model openai/gpt-oss-120b --tp 4

# BF16 120B on B200
python3 -m sglang.launch_server --model lmsys/gpt-oss-120b-bf16 --tp 4

Benchmark Command

python3 -m sglang.bench_one_batch_server --model openai/gpt-oss-120b --base-url http://localhost:30000 --batch-size 32 --input-len 1024 8192 --output-len 512 --show-report

Reproduce the evaluation result

Install gpt-oss

git clone https://github.com/openai/gpt-oss.git
cd gpt-oss
pip install -e .

Evaluation Command

DATASET=gpqa
BASE_URL=YOUR_BASE_URL
OPENAI_API_KEY=dummy python -m gpt_oss.evals \
    --base-url ${BASE_URL}/v1 \
    --model dummy \
    --reasoning-effort low,medium,high \
    --eval $DATASET \
    --n-threads 1000

Reproduce the benchmark result of acceptance length

Note: On B200, if top k is 1, set --attention-backend trtllm_mha

git clone https://github.com/sgl-project/SpecForge.git
cd SpecForge/benchmarks
config_list=(
    "1,0,0,0"
    "1,3,1,4"
    "1,5,4,8"
)
python3 bench_model_speedup.py \
    --model-path openai/gpt-oss-120b \
    --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 \
    --port 20001 \
    --trust-remote-code \
    --mem-fraction-static 0.8 \
    --tp-size 4 \
    --attention-backend fa3 \
    --config-list "${config_list[@]}" \
    --benchmark-list mtbench:80 gsm8k:200 humaneval:200 math500:200 \
    --output lmsys_gpt-oss-120b_Eagle3_result.jsonl

python3 bench_model_speedup.py \
    --model-path openai/gpt-oss-120b \
    --speculative-draft-model-path nvidia/gpt-oss-120b-Eagle3 \
    --port 20001 \
    --trust-remote-code \
    --mem-fraction-static 0.8 \
    --tp-size 4 \
    --attention-backend fa3 \
    --config-list "${config_list[@]}" \
    --benchmark-list mtbench:80 gsm8k:200 humaneval:200 math500:200 \
    --output nv_gpt-oss-120b_Eagle3_result.jsonl

Reproduce the result of speculative decoding speedup

Launch Command

# On Hopper:
# - Tree decoding (topk > 1) and chain decoding (topk = 1) are supported on both FA3 and Triton backends.
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --tp 4
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 --tp 4

# On Blackwell:
# - Chain decoding (topk = 1) is supported on TRTLLM-MHA backend. Tree decoding (topk > 1) is in progress, stay tuned!
# - Both tree decoding (topk > 1) and chain decoding (topk = 1) are supported on the Triton backend.
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algo EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --tp 4
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algo EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 --attention-backend triton --tp 4

Benchmark Command

config_list=(
    "1,0,0,0"
    "1,3,1,4"
    "1,5,4,8"
)
python3 bench_model_speedup.py \
    --model-path openai/gpt-oss-120b \
    --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 \
    --port 20001 \
    --trust-remote-code \
    --mem-fraction-static 0.8 \
    --tp-size 4 \
    --attention-backend fa3 \
    --config-list "${config_list[@]}" \
    --benchmark-list gsm8k:200 humaneval:200 math500:200 \
    --output lmsys_gpt-oss-120b_Eagle3_result.jsonl

We can gain the best speedup with the following settings:

1.39x speedup with the --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 setting.
1.52x speedup with the --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 setting.