Welcome to vLLM! ================ .. figure:: ./assets/logos/vllm-logo-text-light.png :width: 60% :align: center :alt: vLLM :class: no-scaled-link .. raw:: html

Easy, fast, and cheap LLM serving for everyone

vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: * State-of-the-art serving throughput * Efficient management of attention key and value memory with **PagedAttention** * Continuous batching of incoming requests * Fast model execution with CUDA/HIP graph * Quantization: `GPTQ `_, `AWQ `_, `SqueezeLLM `_, FP8 KV Cache * Optimized CUDA kernels vLLM is flexible and easy to use with: * Seamless integration with popular HuggingFace models * High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more * Tensor parallelism support for distributed inference * Streaming outputs * OpenAI-compatible API server * Support NVIDIA GPUs and AMD GPUs * (Experimental) Prefix caching support * (Experimental) Multi-lora support For more information, check out the following: * `vLLM announcing blog post `_ (intro to PagedAttention) * `vLLM paper `_ (SOSP 2023) * `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency `_ by Cade Daniel et al. Documentation ------------- .. toctree:: :maxdepth: 1 :caption: Getting Started getting_started/installation getting_started/amd-installation getting_started/quickstart .. toctree:: :maxdepth: 1 :caption: Serving serving/distributed_serving serving/run_on_sky serving/deploying_with_triton serving/deploying_with_docker serving/serving_with_langchain serving/metrics .. toctree:: :maxdepth: 1 :caption: Models models/supported_models models/adding_model models/engine_args .. toctree:: :maxdepth: 1 :caption: Quantization quantization/auto_awq .. toctree:: :maxdepth: 2 :caption: Developer Documentation dev/engine/engine_index Indices and tables ================== * :ref:`genindex` * :ref:`modindex`