Welcome to vLLM! ================ .. figure:: ./assets/logos/vllm-logo-text-light.png :width: 60% :align: center :alt: vLLM :class: no-scaled-link .. raw:: html

Easy, fast, and cheap LLM serving for everyone

vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: * State-of-the-art serving throughput * Efficient management of attention key and value memory with **PagedAttention** * Dynamic batching of incoming requests * Optimized CUDA kernels vLLM is flexible and easy to use with: * Seamless integration with popular HuggingFace models * High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more * Tensor parallelism support for distributed inference * Streaming outputs * OpenAI-compatible API server For more information, please refer to our `blog post `_. Documentation ------------- .. toctree:: :maxdepth: 1 :caption: Getting Started getting_started/installation getting_started/quickstart .. toctree:: :maxdepth: 1 :caption: Models models/supported_models models/adding_model