History

hailin c8e8c1e9ff .		2025-09-20 16:09:34 +08:00
..
cpp	.	2025-09-20 16:09:34 +08:00
legacy_samples	.	2025-09-20 16:09:34 +08:00
python	.	2025-09-20 16:09:34 +08:00
CMakeLists.txt	.	2025-09-20 16:09:34 +08:00
README.md	.	2025-09-20 16:09:34 +08:00

README.md

FE - Programming Samples

Python Interface Samples

Samples leveraging FE's Python interface are located in samples/python.

01_epilogue Shows how to fuse elementwise functions to a GEMM graph.
02_serialization Shows how to serialize and deserialize a graph for future execution.
03_mixed_precision Shows how to mutiply tensors of different data types.
50_sdpa Shows how to run causal self attention with dropout in forward pass.
51_sdpa Shows how to run causal self attention in bprop.
52_sdpa Shows how to run scaled dot product attention where the K and V caches are stored in non contiguous memory.

C++ Interface Samples

Samples leveraging FE's C++ interface are located in samples/cpp.

Building the samples

mkdir build
cd build
cmake -DCUDNN_PATH=/path/to/cudnn -DCUDAToolkit_ROOT=/path/to/cuda  ../
cmake --build . -j16
bin/samples

To run a single sample, for eg. TEST_CASE("Cached sdpa", "[graph][sdpa][flash]")

./bin/samples "Cached sdpa"

Scaled dot product attention SDPA examples

samples/cpp/sdpa shows how to use cudnn's sdpa operation.

Cached SDPA

Users are expected to build a graph once and then execute it multiple times. This example shows how to cache cudnn sdpa graph building.

Fwd SDPA and Bwd SDPA

cudnn's sdpa operation enables various customizations on itself. These examples show how to build a graph with sdpa operation for your own custom sdpa needs.

Fwd SDPA with paged caches

Similar to Fwd SDPA, but here with the ability to use non contiguous K and V caches in combination with page tables, as described in the PagedAttention paper.

Fwd FP8 SDPA and Bwd SDPA

Extends the sdpa sample to fp8 precision.

Fwd SDPA with CUDA graph

Demonstrates the building and execution of a CUDA graph representing the SDPA operation, followed by the update (and another execution) of the CUDA graph with new variant pointers.

Convolution fusion examples

samples/cpp/convolution shows how to use cudnn fprop, dgrad, wgrad operation and some fusions with them.

Fprop

Showcases a simple fprop, fprop with pointwise fusion of scale bias and relu, fprop with bias and relu for channels first layout and fusions before convolution in the form of scale bias relu conv and stats. Also epilogue fusion of concatenate.

Fp8 fprop

Showcases fp8 convolution with scaling and amax reduction.

Int8 fprop

Showcases Int8 convolution.

Dgrad

Has samples for simple dgrad, fusion for dgrad + drelu and Dgrad + Drelu + DBNweight fused operation.

Wgrad

Similar to dgrad was simple wgrad and scale+bias+relu+wgrad fused operation.

Matmul fusion examples

Matmul showcases different matmul samples.

Matmul fusion

Has samples for simple Matmul, matmul fusions like matmul+abs, matmul+bias and matmul+scale+bias+relu operation.

Fp8 Matmul

Showcases fp8 matmul with scaling and amax reduction.

Int8 Matmul

Showcases Int8 mamtul.

Mixed precision matmul

Mixed precision multiplication between int8 and bf16 data-type with int8 operand being upcasted to bf16

Normaliization examples

Norm showcases different matmul samples.

LayerNorm

Eg for layernorm training, inference and back propagation

AdaLayerNorm

Eg for adaptive layernorm training, inference and back propagation

RMSNorm

Eg for rmsnorm training, inference and back propagation

BatchNorm

Shows different fusions in batch norm fprop and bprop. And split batch norm fusions.

Block scale quantize

Showcases normalization with block scale quantize epilogue fusion.

Norm zero centered gamma

Showcases layer normalization with zero centered gamma usage.

Miscellaneous examples

Misc Miscellaneous samples

Pointwise fusions

pointwise fusions with scalar are shown in this sample.

Resample

resample fprop operation with different resampling modes.

Serialization

How to serialize a graph into a file and read it back on another thread/process.

Autotuning

How to choose the best performing plan among multiple plans suggested by the heuristics.

Cuda Graphs

Shows how to use the native cuda graph API. The samples show how to create cudnn's cuda graph, and how to repeatedly update it with new device buffers for multiple execution.

SM Carveout

Showcases a Batch norm example, where only a partial number of SMs participate in executing the kernel.

[Deprecated] C++ v0.x Interface Samples

Samples leveraging FE's C++ 0.x interface are located in samples/legacy_samples.