sglang_v0.5.2/pytorch_2.8.0/third_party/cudnn_frontend/benchmark
hailin c8e8c1e9ff . 2025-09-20 16:09:34 +08:00
..
Dockerfile . 2025-09-20 16:09:34 +08:00
README.md . 2025-09-20 16:09:34 +08:00
benchmark_flash_attention.py . 2025-09-20 16:09:34 +08:00

README.md

Attention-benchmark

Contents

  • Dockerfile to create a docker container for the dependencies.
  • benchmark_flash_attention.py which runs cudnn, pytorch upto 64k sequence length.
  • Pytorch native sdpa operation vs cudnn sdpa operation.

Steps to run

Lock the clocks. For eg. in H200, use nvidia-smi -q -d SUPPORTED_CLOCKS to get the supported clocks

sudo nvidia-smi -pm 1
sudo nvidia-smi -ac 3201,1980
sudo nvidia-smi -pl 700

Launch the docker build and run.

docker build -t cudnn_attention_benchmark . && docker run --gpus=all --rm --shm-size=16g -it cudnn_attention_benchmark

Sample output

$ python benchmark_flash_attention.py 
Is flash sdp enabled in Pytorch : True
cudnn backend version : 90100
### causal=False, headdim=128, batch_size=32, seqlen=512 ###
Pytorch fwd: 302.38 TFLOPs/s, bwd: 169.09 TFLOPs/s, fwd + bwd: 193.45 TFLOPs/s
cudnn_bf16 fwd: 501.13 TFLOPs/s, bwd: 351.28 TFLOPs/s, fwd + bwd: 384.09 TFLOPs/s
cudnn_fp8 fwd: 678.07 TFLOPs/s, bwd: 418.37 TFLOPs/s, fwd + bwd: 469.78 TFLOPs/s

Please refer to the benchmark_results.csv for sample output.

Results

Forward

Comparison of pytorch and cudnn

Bprop

Comparison of pytorch and cudnn

Fwd + Bprop

Comparison of pytorch and cudnn

Pytorch adoption

cuDNN v9 can achieve over 2x the performance of the comparable PyTorch eager implementation, as detailed in (Beta) Implementing High-Performance Transformers with Scaled Dot Product Attention (SDPA) PyTorch eager mode SDPA doesn't use cuDNN today, but adding a cuDNN-based implementation is in progress (see the PyTorch PRs for Fprop, and Bprop).