95 lines
2.0 KiB
ReStructuredText
95 lines
2.0 KiB
ReStructuredText
.. _apiattention:
|
|
|
|
FlashInfer Attention Kernels
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
flashinfer.decode
|
|
=================
|
|
|
|
.. currentmodule:: flashinfer.decode
|
|
|
|
Single Request Decoding
|
|
-----------------------
|
|
|
|
.. autosummary::
|
|
:toctree: ../generated
|
|
|
|
single_decode_with_kv_cache
|
|
|
|
Batch Decoding
|
|
--------------
|
|
|
|
.. autosummary::
|
|
:toctree: ../generated
|
|
|
|
cudnn_batch_decode_with_kv_cache
|
|
trtllm_batch_decode_with_kv_cache
|
|
|
|
.. autoclass:: BatchDecodeWithPagedKVCacheWrapper
|
|
:members:
|
|
:exclude-members: begin_forward, end_forward, forward, forward_return_lse
|
|
|
|
.. automethod:: __init__
|
|
|
|
.. autoclass:: CUDAGraphBatchDecodeWithPagedKVCacheWrapper
|
|
:members:
|
|
|
|
.. automethod:: __init__
|
|
|
|
|
|
flashinfer.prefill
|
|
==================
|
|
|
|
Attention kernels for prefill & append attention in both single request and batch serving setting.
|
|
|
|
.. currentmodule:: flashinfer.prefill
|
|
|
|
Single Request Prefill/Append Attention
|
|
---------------------------------------
|
|
|
|
.. autosummary::
|
|
:toctree: ../generated
|
|
|
|
single_prefill_with_kv_cache
|
|
single_prefill_with_kv_cache_return_lse
|
|
|
|
Batch Prefill/Append Attention
|
|
------------------------------
|
|
|
|
.. autosummary::
|
|
:toctree: ../generated
|
|
|
|
cudnn_batch_prefill_with_kv_cache
|
|
trtllm_batch_context_with_kv_cache
|
|
|
|
.. autoclass:: BatchPrefillWithPagedKVCacheWrapper
|
|
:members:
|
|
:exclude-members: begin_forward, end_forward, forward, forward_return_lse
|
|
|
|
.. automethod:: __init__
|
|
|
|
.. autoclass:: BatchPrefillWithRaggedKVCacheWrapper
|
|
:members:
|
|
:exclude-members: begin_forward, end_forward, forward, forward_return_lse
|
|
|
|
.. automethod:: __init__
|
|
|
|
|
|
flashinfer.mla
|
|
==============
|
|
|
|
MLA (Multi-head Latent Attention) is an attention mechanism proposed in DeepSeek series of models (
|
|
`DeepSeek-V2 <https://arxiv.org/abs/2405.04434>`_, `DeepSeek-V3 <https://arxiv.org/abs/2412.19437>`_,
|
|
and `DeepSeek-R1 <https://arxiv.org/abs/2501.12948>`_).
|
|
|
|
.. currentmodule:: flashinfer.mla
|
|
|
|
PageAttention for MLA
|
|
---------------------
|
|
|
|
.. autoclass:: BatchMLAPagedAttentionWrapper
|
|
:members:
|
|
|
|
.. automethod:: __init__
|