# FlashInfer Profiler (Experimental) FlashInfer Profiler is a tool for intra-kernel profiling for diagnosing kernel performance. ## Prerequisites The Chrome tracing do not support overlapping events inside a single thread (the wgmma instructions are asynchronous, and the execution of several wgmma instructions might overlap). We use our fork of [tg4perfetto](https://github.com/ihavnoid/tg4perfetto), modified to use the latest protobuf, to generate perfetto traces. ```bash # pip install from github pip install protobuf pip install git+https://github.com/flashinfer-ai/tg4perfetto.git ``` ## Examples ### MLA Run the following command to profile the MLA kernel for different configurations. ```bash python mla.py --batch-size 64 --seq-len 1024 --num-heads 128 --profiler-buffer-size 1048576 ``` The generated traces will be saved in the current directory. ```bash ls *.perfetto-trace ``` User can use [ui.perfetto.dev](https://ui.perfetto.dev/) to visualize the traces. Below is a screenshot of the trace generated by the above command. ![MLA Trace](https://raw.githubusercontent.com/flashinfer-ai/web-data/main/examples/flashinfer-profiler-mla.png) ## Limitations - The instrumentation is intrusive (we insert `__threadfence_block()` in the kernel to avoid instruction reordering) and will slow down the kernel execution. ## Acknowledgements This work is in-part inspired by [Mosaic GPU DSL](https://github.com/jax-ml/jax/tree/main/jax/experimental/mosaic)'s warp-level profiling, as well as [Proton Intra-kernel profiling](https://github.com/triton-lang/triton/pull/4861) in Triton. We thank [tg4perfetto](https://github.com/ihavnoid/tg4perfetto) for providing examples of generating perfetto traces from python.