|
|
||
|---|---|---|
| .. | ||
| cpp | ||
| legacy_samples | ||
| python | ||
| CMakeLists.txt | ||
| README.md | ||
README.md
FE - Programming Samples
Python Interface Samples
Samples leveraging FE's Python interface are located in samples/python.
-
01_epilogue Shows how to fuse elementwise functions to a GEMM graph.
-
02_serialization Shows how to serialize and deserialize a graph for future execution.
-
03_mixed_precision Shows how to mutiply tensors of different data types.
-
50_sdpa Shows how to run causal self attention with dropout in forward pass.
-
51_sdpa Shows how to run causal self attention in bprop.
-
52_sdpa Shows how to run scaled dot product attention where the K and V caches are stored in non contiguous memory.
C++ Interface Samples
Samples leveraging FE's C++ interface are located in samples/cpp.
Building the samples
mkdir build
cd build
cmake -DCUDNN_PATH=/path/to/cudnn -DCUDAToolkit_ROOT=/path/to/cuda ../
cmake --build . -j16
bin/samples
To run a single sample, for eg. TEST_CASE("Cached sdpa", "[graph][sdpa][flash]")
./bin/samples "Cached sdpa"
Scaled dot product attention SDPA examples
samples/cpp/sdpa shows how to use cudnn's sdpa operation.
Users are expected to build a graph once and then execute it multiple times. This example shows how to cache cudnn sdpa graph building.
cudnn's sdpa operation enables various customizations on itself. These examples show how to build a graph with sdpa operation for your own custom sdpa needs.
Similar to Fwd SDPA, but here with the ability to use non contiguous K and V caches in combination with page tables, as described in the PagedAttention paper.
- Fwd FP8 SDPA and Bwd SDPA
Extends the sdpa sample to fp8 precision.
Demonstrates the building and execution of a CUDA graph representing the SDPA operation, followed by the update (and another execution) of the CUDA graph with new variant pointers.
Convolution fusion examples
samples/cpp/convolution shows how to use cudnn fprop, dgrad, wgrad operation and some fusions with them.
Showcases a simple fprop, fprop with pointwise fusion of scale bias and relu, fprop with bias and relu for channels first layout and fusions before convolution in the form of scale bias relu conv and stats. Also epilogue fusion of concatenate.
Showcases fp8 convolution with scaling and amax reduction.
Showcases Int8 convolution.
Has samples for simple dgrad, fusion for dgrad + drelu and Dgrad + Drelu + DBNweight fused operation.
Similar to dgrad was simple wgrad and scale+bias+relu+wgrad fused operation.
Matmul fusion examples
Matmul showcases different matmul samples.
Has samples for simple Matmul, matmul fusions like matmul+abs, matmul+bias and matmul+scale+bias+relu operation.
Showcases fp8 matmul with scaling and amax reduction.
Showcases Int8 mamtul.
Mixed precision multiplication between int8 and bf16 data-type with int8 operand being upcasted to bf16
Normaliization examples
Norm showcases different matmul samples.
Eg for layernorm training, inference and back propagation
Eg for adaptive layernorm training, inference and back propagation
Eg for rmsnorm training, inference and back propagation
Shows different fusions in batch norm fprop and bprop. And split batch norm fusions.
Showcases normalization with block scale quantize epilogue fusion.
Showcases layer normalization with zero centered gamma usage.
Miscellaneous examples
Misc Miscellaneous samples
pointwise fusions with scalar are shown in this sample.
resample fprop operation with different resampling modes.
How to serialize a graph into a file and read it back on another thread/process.
How to choose the best performing plan among multiple plans suggested by the heuristics.
Shows how to use the native cuda graph API. The samples show how to create cudnn's cuda graph, and how to repeatedly update it with new device buffers for multiple execution.
Showcases a Batch norm example, where only a partial number of SMs participate in executing the kernel.
[Deprecated] C++ v0.x Interface Samples
Samples leveraging FE's C++ 0.x interface are located in samples/legacy_samples.