5.3 KiB

Raw Permalink Blame History

How to reproduce the result of GPT-OSS with SGLang

Install the latest SGLang

git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout v0.5.1.post3

pip install --upgrade pip
pip install -e "python[all]"

Reproduce the benchmark throughput result (Batch Size 1)

Launch Command

# MXFP4 120B on H100
python3 -m sglang.launch_server --model openai/gpt-oss-120b --tp 8 --attention-backend triton

# BF16 120B on H100
python3 -m sglang.launch_server --model lmsys/gpt-oss-120b-bf16 --tp 8 --attention-backend triton

# MXFP4 120B on B200
python3 -m sglang.launch_server --model openai/gpt-oss-120b --tp 4

# BF16 120B on B200
python3 -m sglang.launch_server --model lmsys/gpt-oss-120b-bf16 --tp 4

Benchmark Command


# MXFP4 120B on H100
python3 -m sglang.bench_one_batch_server --model openai/gpt-oss-120b --base-url http://localhost:30000 --batch-size 1 --input-len 1024 --output-len 512 --show-report

Reproduce the benchmark throughput result (Batch Size 32)

Launch Command

# MXFP4 120B on H100
python3 -m sglang.launch_server --model openai/gpt-oss-120b --tp 8

# BF16 120B on H100
python3 -m sglang.launch_server --model lmsys/gpt-oss-120b-bf16 --tp 8

# MXFP4 120B on B200
python3 -m sglang.launch_server --model openai/gpt-oss-120b --tp 4

# BF16 120B on B200
python3 -m sglang.launch_server --model lmsys/gpt-oss-120b-bf16 --tp 4

Benchmark Command

python3 -m sglang.bench_one_batch_server --model openai/gpt-oss-120b --base-url http://localhost:30000 --batch-size 32 --input-len 1024 8192 --output-len 512 --show-report

Reproduce the evaluation result

Install gpt-oss

git clone https://github.com/openai/gpt-oss.git
cd gpt-oss
pip install -e .

Evaluation Command

DATASET=gpqa
BASE_URL=YOUR_BASE_URL
OPENAI_API_KEY=dummy python -m gpt_oss.evals \
    --base-url ${BASE_URL}/v1 \
    --model dummy \
    --reasoning-effort low,medium,high \
    --eval $DATASET \
    --n-threads 1000

Reproduce the benchmark result of acceptance length

Note: On B200, if top k is 1, set --attention-backend trtllm_mha

git clone https://github.com/sgl-project/SpecForge.git
cd SpecForge/benchmarks
config_list=(
    "1,0,0,0"
    "1,3,1,4"
    "1,5,4,8"
)
python3 bench_model_speedup.py \
    --model-path openai/gpt-oss-120b \
    --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 \
    --port 20001 \
    --trust-remote-code \
    --mem-fraction-static 0.8 \
    --tp-size 4 \
    --attention-backend fa3 \
    --config-list "${config_list[@]}" \
    --benchmark-list mtbench:80 gsm8k:200 humaneval:200 math500:200 \
    --output lmsys_gpt-oss-120b_Eagle3_result.jsonl

python3 bench_model_speedup.py \
    --model-path openai/gpt-oss-120b \
    --speculative-draft-model-path nvidia/gpt-oss-120b-Eagle3 \
    --port 20001 \
    --trust-remote-code \
    --mem-fraction-static 0.8 \
    --tp-size 4 \
    --attention-backend fa3 \
    --config-list "${config_list[@]}" \
    --benchmark-list mtbench:80 gsm8k:200 humaneval:200 math500:200 \
    --output nv_gpt-oss-120b_Eagle3_result.jsonl

Reproduce the result of speculative decoding speedup

Launch Command

# On Hopper:
# - Tree decoding (topk > 1) and chain decoding (topk = 1) are supported on both FA3 and Triton backends.
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --tp 4
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 --tp 4

# On Blackwell:
# - Chain decoding (topk = 1) is supported on TRTLLM-MHA backend. Tree decoding (topk > 1) is in progress, stay tuned!
# - Both tree decoding (topk > 1) and chain decoding (topk = 1) are supported on the Triton backend.
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algo EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --tp 4
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algo EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 --attention-backend triton --tp 4

Benchmark Command

config_list=(
    "1,0,0,0"
    "1,3,1,4"
    "1,5,4,8"
)
python3 bench_model_speedup.py \
    --model-path openai/gpt-oss-120b \
    --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 \
    --port 20001 \
    --trust-remote-code \
    --mem-fraction-static 0.8 \
    --tp-size 4 \
    --attention-backend fa3 \
    --config-list "${config_list[@]}" \
    --benchmark-list gsm8k:200 humaneval:200 math500:200 \
    --output lmsys_gpt-oss-120b_Eagle3_result.jsonl

We can gain the best speedup with the following settings:

1.39x speedup with the --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 setting.
1.52x speedup with the --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 setting.

5.3 KiB Raw Permalink Blame History

How to reproduce the result of GPT-OSS with SGLang

Install the latest SGLang

Reproduce the benchmark throughput result (Batch Size 1)

Reproduce the benchmark throughput result (Batch Size 32)

Reproduce the evaluation result

Reproduce the benchmark result of acceptance length

Reproduce the result of speculative decoding speedup

5.3 KiB

Raw Permalink Blame History