sglang_v0.5.2/flashinfer_0.3.1/3rdparty/cutlass/media/docs/pythonDSL/faqs.rst

139 lines
7.9 KiB
ReStructuredText

.. _faqs:
FAQs
====
General
---------------------
**Are the DSLs replacing C++ templates?**
TL;DR: No - but also yes. The CUTLASS 4.0 release (CuTe DSL), along with all
future extensions to our Python-native programming models, does not come at the
expense of CUTLASS C++. CUTLASS 2.x and 3.x C++ APIs are both going to continue
receiving fixes and updates for the architectures we support them for. However,
CUTLASS 4.x CuTe DSL is fully isomorphic in its programming model and performance
with CuTe C++ for Blackwell, and it is our hope that the community embraces this
for much easier while still equally performant custom kernel development. This is
why we are releasing CuTe DSL with support for all architectures starting with the
NVIDIA Ampere Architecture.
**What is the difference between CuTe DSL, CUTLASS Python, and CUTLASS DSLs?**
CUTLASS Python was the Python interface for instantiating C++ kernels via a Python
frontend. This is now deprecated with the release of CUTLASS 4.0. CUTLASS DSLs are
a family of Python DSLs for native device programming in Python. Currently, this is
limited to our initial release of CuTe DSL, but future versions will include higher-level
abstractions that gradually trade off control for convenience.
**What should I learn, CUTLASS C++ or the Python DSLs?**
We believe the Python DSLs will significantly improve the learning curve and recommend starting
with them for all newcomers, as they eliminate the inherent complexity of learning C++
metaprogramming for GPU kernel programming. Since CuTe C++ and CuTe DSL share fully isomorphic
programming models and patterns, any knowledge gained can eventually be applied to C++.
**Where will the code live? PIP wheel or GitHub repo? Do I have to build it myself?**
This is a major change compared to CUTLASS C++ and Python DSLs. Going forward,
the GitHub code only exists as a way for users to file issues and pull requests against.
While it can be used with the pip wheel, we do not recommend most users do so unless they are
hacking on the DSL itself. For all other users, we recommend they
simply ``pip install nvidia-cutlass-dsl`` and use the pip wheel as the single source
of truth for the dialect compiler and DSL implementation. CUTLASS GitHub repository will
contain a ``requirements.txt`` file pinning the version of the wheel consistent with the state
of the OSS repository (please see :doc:`quick_start`). This means getting started with
CUTLASS is easier than ever: no more CMake command lines to learn and no more builds to kick
off. Simply install the pip wheel and start running the examples.
Migration
---------------------
**Should I port my code from C++ templates to Python?**
Almost certainly not, unless you need extremely fast JIT times for your kernel and C++ compile times
are a blocker for you. The 2.x and 3.x APIs will continue to be supported, and Nvidia's Hopper and
Blackwell architectures 3.x will continue to improve in terms of features
and performance.
**Are portability promises different with Python?**
For the initial release while the DSL is still in beta, we do not promise any portability
as we may make changes to the DSL itself. While we do not expect any changes to the CuTe operations,
the DSL utilities, decorators, helper classes like pipelines and schedulers may change as we refine them
with community feedback. We encourage users to file issues and discussions on GitHub during this
beta period with their feedback!
In the long term, we plan to continue to treat the OSS community with care.
Just like the prior history of CUTLASS, we plan not to break users unless necessary,
but we reserve the right to make limited breaking changes in case we believe it is a
net benefit to the community and project. These will be announced ahead of time and/or
clearly highlighted in the CHANGELOG of each release.
Technical
---------------------
**What NVIDIA architectures will it support?**
CuTe DSL will support all NVIDIA GPU architectures starting with NVIDIA Ampere Architecture (SM80).
**Will it be compatible with DL frameworks (e.g., PyTorch, JAX)?**
Yes, we will provide utilities to convert from DLPack-supported tensor formats
to ``cute.Tensor``. This should allow a user to never have to leave Python
when writing model code in their framework of choice. Our JAX interoperability story is not
as strong as PyTorch's today, however, we are actively working on improving it
and welcome contributions in this space.
**Does it compile to PTX or SASS?**
CuTe DSL compiles the program down to PTX. After that, we currently use the PTX compiler that
ships with the CUDA toolkit to compile the PTX down to SASS. We plan to remove
this limitation in the future and allow the use of the PTX JIT that is included in the
CUDA driver in case a user does not have a CUDA toolkit installed.
**Do I need to use NVCC or NVRTC?**
No, the ``nvidia-cutlass-dsl`` wheel packages is everything needed to generate GPU kernels. It
shares the driver requirements of the 12.9 toolkit which can be found
`here <https://developer.nvidia.com/cuda-toolkit-archive>`__.
**How would one debug the code?**
Since CuTe DSL is not native python and an embedded DSL instead, tools like `pdb`
cannot be used. However, if you have experience with GPU kernel programming, the debugging
techniques will be nearly identical. Typically, compile time and runtime printing
of types and values are the most expedient. Please see `documentation on printing <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/print.ipynb>`__
to learn how to print types and values at both compile time and runtime.
You can also use ``cuda-gdb`` to set breakpoints in the program and step through the execution
or use tools such as ``compute-sanitizer`` to detect and triage bugs in your program. As the DSL
matures, our source location tracking from Python user programs will also improve to provide
more helpful source-level mapping when setting breakpoints and using other tools such as nsight.
**How would one implement warp specialization in CuTe DSL?**
Exactly the same way you would in C++ but in a Python-native syntax instead.
Consult our :doc:`cute_dsl_general/dsl_control_flow` and
`"Blackwell kernel example" <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/dense_gemm_persistent.py>`__
for a detailed how-to guide.
**Can I call functions from other functions or use OOP?**
Yes. We frequently call functions from one another and set up class
hierarchies to organize and modularize our code for pipelines and schedulers.
Consult the :doc:`cute_dsl_general/dsl_introduction` documentation or our examples for more details.
License
---------------------
**What is the license for CuTe DSL and the associated GitHub samples?**
CuTe DSL components available `on Github <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL>`__ and via the nvidia-cutlass-dsl Python pip wheel
are released under the `"NVIDIA Software End User License Agreement (EULA)" <https://github.com/NVIDIA/cutlass/tree/main/EULA.txt>`__.
Because the pip package includes a compiler that shares several components with the CUDA Toolkit,
it is subject to usage terms and restrictions similar to those of the CUDA SDK. Please refer to the EULA for specific terms of use.
CuTe DSL samples and Jupyter notbooks, released `on GitHub <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL>`__ are provided under
the BSD 3-Clause License and may be used and redistributed under those terms. This distinction ensures that developers have flexibility
when using or modifying the code samples, independent of the compiler and runtime components governed by the EULA.
If you have any questions or need clarification, feel free to contact us.