# FlashInfer Profiler (Experimental)

FlashInfer Profiler is a tool for intra-kernel profiling for diagnosing kernel performance.

## Prerequisites

The Chrome tracing do not support overlapping events inside a single thread (the wgmma instructions are asynchronous, and the execution of several wgmma instructions might overlap). We use our fork of [tg4perfetto](https://github.com/ihavnoid/tg4perfetto), modified to use the latest protobuf, to generate perfetto traces.

```bash
# pip install from github
pip install protobuf
pip install git+https://github.com/flashinfer-ai/tg4perfetto.git
```

## Examples

### MLA

Run the following command to profile the MLA kernel for different configurations.

```bash
python mla.py --batch-size 64 --seq-len 1024 --num-heads 128 --profiler-buffer-size 1048576
```

The generated traces will be saved in the current directory.

```bash
ls *.perfetto-trace
```

User can use [ui.perfetto.dev](https://ui.perfetto.dev/) to visualize the traces.

Below is a screenshot of the trace generated by the above command.

![MLA Trace](https://raw.githubusercontent.com/flashinfer-ai/web-data/main/examples/flashinfer-profiler-mla.png)

## Limitations

- The instrumentation is intrusive (we insert `__threadfence_block()` in the kernel to avoid instruction reordering) and will slow down the kernel execution.

## Acknowledgements

This work is in-part inspired by [Mosaic GPU DSL](https://github.com/jax-ml/jax/tree/main/jax/experimental/mosaic)'s warp-level profiling, as well as [Proton Intra-kernel profiling](https://github.com/triton-lang/triton/pull/4861) in Triton.

We thank [tg4perfetto](https://github.com/ihavnoid/tg4perfetto) for providing examples of generating perfetto traces from python.