vllm/vllm_v0.10.0/docs
hailin 38d813617c first commit 2025-08-03 20:28:19 +08:00
..
api first commit 2025-08-03 20:28:19 +08:00
assets first commit 2025-08-03 20:28:19 +08:00
cli first commit 2025-08-03 20:28:19 +08:00
community first commit 2025-08-03 20:28:19 +08:00
configuration first commit 2025-08-03 20:28:19 +08:00
contributing first commit 2025-08-03 20:28:19 +08:00
deployment first commit 2025-08-03 20:28:19 +08:00
design first commit 2025-08-03 20:28:19 +08:00
features first commit 2025-08-03 20:28:19 +08:00
getting_started first commit 2025-08-03 20:28:19 +08:00
mkdocs first commit 2025-08-03 20:28:19 +08:00
models first commit 2025-08-03 20:28:19 +08:00
serving first commit 2025-08-03 20:28:19 +08:00
training first commit 2025-08-03 20:28:19 +08:00
usage first commit 2025-08-03 20:28:19 +08:00
.nav.yml first commit 2025-08-03 20:28:19 +08:00
README.md first commit 2025-08-03 20:28:19 +08:00

README.md

Welcome to vLLM

![](./assets/logos/vllm-logo-text-light.png){ align="center" alt="vLLM Light" class="logo-light" width="60%" } ![](./assets/logos/vllm-logo-text-dark.png){ align="center" alt="vLLM Dark" class="logo-dark" width="60%" }

Easy, fast, and cheap LLM serving for everyone

Star Watch Fork

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQ, AWQ, INT4, INT8, and FP8
  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
  • Speculative decoding
  • Chunked prefill

vLLM is flexible and easy to use with:

  • Seamless integration with popular HuggingFace models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor, pipeline, data and expert parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
  • Prefix caching support
  • Multi-LoRA support

For more information, check out the following: