History

hailin 38d813617c first commit		2025-08-03 20:28:19 +08:00
..
api	first commit	2025-08-03 20:28:19 +08:00
assets	first commit	2025-08-03 20:28:19 +08:00
cli	first commit	2025-08-03 20:28:19 +08:00
community	first commit	2025-08-03 20:28:19 +08:00
configuration	first commit	2025-08-03 20:28:19 +08:00
contributing	first commit	2025-08-03 20:28:19 +08:00
deployment	first commit	2025-08-03 20:28:19 +08:00
design	first commit	2025-08-03 20:28:19 +08:00
features	first commit	2025-08-03 20:28:19 +08:00
getting_started	first commit	2025-08-03 20:28:19 +08:00
mkdocs	first commit	2025-08-03 20:28:19 +08:00
models	first commit	2025-08-03 20:28:19 +08:00
serving	first commit	2025-08-03 20:28:19 +08:00
training	first commit	2025-08-03 20:28:19 +08:00
usage	first commit	2025-08-03 20:28:19 +08:00
.nav.yml	first commit	2025-08-03 20:28:19 +08:00
README.md	first commit	2025-08-03 20:28:19 +08:00

README.md

Welcome to vLLM

![](./assets/logos/vllm-logo-text-light.png){ align="center" alt="vLLM Light" class="logo-light" width="60%" } ![](./assets/logos/vllm-logo-text-dark.png){ align="center" alt="vLLM Dark" class="logo-dark" width="60%" }

Easy, fast, and cheap LLM serving for everyone

Star Watch Fork

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill

vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor, pipeline, data and expert parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
Prefix caching support
Multi-LoRA support

For more information, check out the following:

vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
vLLM Meetups