History

hailin 06e45b5ff9 local source code for flashinfer_0.3.1 && torch vision_0.22.1		2025-09-20 12:33:58 +08:00
..
.gitignore	local source code for flashinfer_0.3.1 && torch vision_0.22.1	2025-09-20 12:33:58 +08:00
README.md	local source code for flashinfer_0.3.1 && torch vision_0.22.1	2025-09-20 12:33:58 +08:00
batch_attention.py	local source code for flashinfer_0.3.1 && torch vision_0.22.1	2025-09-20 12:33:58 +08:00
mla.py	local source code for flashinfer_0.3.1 && torch vision_0.22.1	2025-09-20 12:33:58 +08:00

README.md

FlashInfer Profiler (Experimental)

FlashInfer Profiler is a tool for intra-kernel profiling for diagnosing kernel performance.

Prerequisites

The Chrome tracing do not support overlapping events inside a single thread (the wgmma instructions are asynchronous, and the execution of several wgmma instructions might overlap). We use our fork of tg4perfetto, modified to use the latest protobuf, to generate perfetto traces.

# pip install from github
pip install protobuf
pip install git+https://github.com/flashinfer-ai/tg4perfetto.git

Examples

MLA

Run the following command to profile the MLA kernel for different configurations.

python mla.py --batch-size 64 --seq-len 1024 --num-heads 128 --profiler-buffer-size 1048576

The generated traces will be saved in the current directory.

ls *.perfetto-trace

User can use ui.perfetto.dev to visualize the traces.

Below is a screenshot of the trace generated by the above command.

Limitations

The instrumentation is intrusive (we insert __threadfence_block() in the kernel to avoid instruction reordering) and will slow down the kernel execution.

Acknowledgements

This work is in-part inspired by Mosaic GPU DSL's warp-level profiling, as well as Proton Intra-kernel profiling in Triton.

We thank tg4perfetto for providing examples of generating perfetto traces from python.