sglang_v0.5.2/flashinfer_0.3.1/profiler/README.md

46 lines
1.7 KiB
Markdown

# FlashInfer Profiler (Experimental)
FlashInfer Profiler is a tool for intra-kernel profiling for diagnosing kernel performance.
## Prerequisites
The Chrome tracing do not support overlapping events inside a single thread (the wgmma instructions are asynchronous, and the execution of several wgmma instructions might overlap). We use our fork of [tg4perfetto](https://github.com/ihavnoid/tg4perfetto), modified to use the latest protobuf, to generate perfetto traces.
```bash
# pip install from github
pip install protobuf
pip install git+https://github.com/flashinfer-ai/tg4perfetto.git
```
## Examples
### MLA
Run the following command to profile the MLA kernel for different configurations.
```bash
python mla.py --batch-size 64 --seq-len 1024 --num-heads 128 --profiler-buffer-size 1048576
```
The generated traces will be saved in the current directory.
```bash
ls *.perfetto-trace
```
User can use [ui.perfetto.dev](https://ui.perfetto.dev/) to visualize the traces.
Below is a screenshot of the trace generated by the above command.
![MLA Trace](https://raw.githubusercontent.com/flashinfer-ai/web-data/main/examples/flashinfer-profiler-mla.png)
## Limitations
- The instrumentation is intrusive (we insert `__threadfence_block()` in the kernel to avoid instruction reordering) and will slow down the kernel execution.
## Acknowledgements
This work is in-part inspired by [Mosaic GPU DSL](https://github.com/jax-ml/jax/tree/main/jax/experimental/mosaic)'s warp-level profiling, as well as [Proton Intra-kernel profiling](https://github.com/triton-lang/triton/pull/4861) in Triton.
We thank [tg4perfetto](https://github.com/ihavnoid/tg4perfetto) for providing examples of generating perfetto traces from python.