Welcome to vLLM! ================ .. figure:: ./assets/logos/vllm-logo-text-light.png :width: 60% :align: center :alt: vLLM :class: no-scaled-link .. raw:: html

Easy, fast, and cheap LLM serving for everyone

vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: * State-of-the-art serving throughput * Efficient management of attention key and value memory with **PagedAttention** * Continuous batching of incoming requests * Fast model execution with CUDA/HIP graph * Quantization: `GPTQ `_, `AWQ `_, INT4, INT8, and FP8 * Optimized CUDA kernels, including integration with FlashAttention and FlashInfer. * Speculative decoding * Chunked prefill vLLM is flexible and easy to use with: * Seamless integration with popular HuggingFace models * High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more * Tensor parallelism and pipeline parallelism support for distributed inference * Streaming outputs * OpenAI-compatible API server * Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators. * Prefix caching support * Multi-lora support For more information, check out the following: * `vLLM announcing blog post `_ (intro to PagedAttention) * `vLLM paper `_ (SOSP 2023) * `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency `_ by Cade Daniel et al. * :ref:`vLLM Meetups `. Documentation ------------- .. toctree:: :maxdepth: 1 :caption: Getting Started getting_started/installation getting_started/amd-installation getting_started/openvino-installation getting_started/cpu-installation getting_started/gaudi-installation getting_started/arm-installation getting_started/neuron-installation getting_started/tpu-installation getting_started/xpu-installation getting_started/quickstart getting_started/debugging getting_started/examples/examples_index .. toctree:: :maxdepth: 1 :caption: Serving serving/openai_compatible_server serving/deploying_with_docker serving/deploying_with_k8s serving/deploying_with_helm serving/deploying_with_nginx serving/distributed_serving serving/metrics serving/integrations serving/tensorizer .. toctree:: :maxdepth: 1 :caption: Models models/supported_models models/generative_models models/pooling_models models/adding_model models/enabling_multimodal_inputs .. toctree:: :maxdepth: 1 :caption: Usage usage/lora usage/multimodal_inputs usage/tool_calling usage/structured_outputs usage/spec_decode usage/compatibility_matrix usage/performance usage/faq usage/engine_args usage/env_vars usage/usage_stats usage/disagg_prefill .. toctree:: :maxdepth: 1 :caption: Quantization quantization/supported_hardware quantization/auto_awq quantization/bnb quantization/gguf quantization/int8 quantization/fp8 quantization/fp8_e5m2_kvcache quantization/fp8_e4m3_kvcache .. toctree:: :maxdepth: 1 :caption: Automatic Prefix Caching automatic_prefix_caching/apc automatic_prefix_caching/details .. toctree:: :maxdepth: 1 :caption: Performance performance/benchmarks .. Community: User community resources .. toctree:: :maxdepth: 1 :caption: Community community/meetups community/sponsors .. API Documentation: API reference aimed at vllm library usage .. toctree:: :maxdepth: 2 :caption: API Documentation dev/sampling_params dev/pooling_params dev/offline_inference/offline_index dev/engine/engine_index .. Design: docs about vLLM internals .. toctree:: :maxdepth: 2 :caption: Design design/arch_overview design/huggingface_integration design/plugin_system design/input_processing/model_inputs_index design/kernel/paged_attention design/multimodal/multimodal_index design/multiprocessing .. For Developers: contributing to the vLLM project .. toctree:: :maxdepth: 2 :caption: For Developers contributing/overview contributing/profiling/profiling_index contributing/dockerfile/dockerfile Indices and tables ================== * :ref:`genindex` * :ref:`modindex`