sglang_v0.5.2/flashinfer_0.3.1/docs/api/attention.rst

95 lines
2.0 KiB
ReStructuredText

.. _apiattention:
FlashInfer Attention Kernels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
flashinfer.decode
=================
.. currentmodule:: flashinfer.decode
Single Request Decoding
-----------------------
.. autosummary::
:toctree: ../generated
single_decode_with_kv_cache
Batch Decoding
--------------
.. autosummary::
:toctree: ../generated
cudnn_batch_decode_with_kv_cache
trtllm_batch_decode_with_kv_cache
.. autoclass:: BatchDecodeWithPagedKVCacheWrapper
:members:
:exclude-members: begin_forward, end_forward, forward, forward_return_lse
.. automethod:: __init__
.. autoclass:: CUDAGraphBatchDecodeWithPagedKVCacheWrapper
:members:
.. automethod:: __init__
flashinfer.prefill
==================
Attention kernels for prefill & append attention in both single request and batch serving setting.
.. currentmodule:: flashinfer.prefill
Single Request Prefill/Append Attention
---------------------------------------
.. autosummary::
:toctree: ../generated
single_prefill_with_kv_cache
single_prefill_with_kv_cache_return_lse
Batch Prefill/Append Attention
------------------------------
.. autosummary::
:toctree: ../generated
cudnn_batch_prefill_with_kv_cache
trtllm_batch_context_with_kv_cache
.. autoclass:: BatchPrefillWithPagedKVCacheWrapper
:members:
:exclude-members: begin_forward, end_forward, forward, forward_return_lse
.. automethod:: __init__
.. autoclass:: BatchPrefillWithRaggedKVCacheWrapper
:members:
:exclude-members: begin_forward, end_forward, forward, forward_return_lse
.. automethod:: __init__
flashinfer.mla
==============
MLA (Multi-head Latent Attention) is an attention mechanism proposed in DeepSeek series of models (
`DeepSeek-V2 <https://arxiv.org/abs/2405.04434>`_, `DeepSeek-V3 <https://arxiv.org/abs/2412.19437>`_,
and `DeepSeek-R1 <https://arxiv.org/abs/2501.12948>`_).
.. currentmodule:: flashinfer.mla
PageAttention for MLA
---------------------
.. autoclass:: BatchMLAPagedAttentionWrapper
:members:
.. automethod:: __init__