# Welcome to vLLM! ```{figure} ./assets/logos/vllm-logo-text-light.png :align: center :alt: vLLM :class: no-scaled-link :width: 60% ``` ```{raw} html

Easy, fast, and cheap LLM serving for everyone

``` vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput - Efficient management of attention key and value memory with **PagedAttention** - Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8 - Optimized CUDA kernels, including integration with FlashAttention and FlashInfer. - Speculative decoding - Chunked prefill vLLM is flexible and easy to use with: - Seamless integration with popular HuggingFace models - High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more - Tensor parallelism and pipeline parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server - Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators. - Prefix caching support - Multi-lora support For more information, check out the following: - [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention) - [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023) - [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al. - {ref}`vLLM Meetups `. ## Documentation ```{toctree} :caption: Getting Started :maxdepth: 1 getting_started/installation getting_started/amd-installation getting_started/openvino-installation getting_started/cpu-installation getting_started/gaudi-installation getting_started/arm-installation getting_started/neuron-installation getting_started/tpu-installation getting_started/xpu-installation getting_started/quickstart getting_started/debugging getting_started/examples/examples_index ``` ```{toctree} :caption: Serving :maxdepth: 1 serving/openai_compatible_server serving/deploying_with_docker serving/deploying_with_k8s serving/deploying_with_helm serving/deploying_with_nginx serving/distributed_serving serving/metrics serving/integrations serving/tensorizer serving/runai_model_streamer ``` ```{toctree} :caption: Models :maxdepth: 1 models/supported_models models/generative_models models/pooling_models models/adding_model models/enabling_multimodal_inputs ``` ```{toctree} :caption: Usage :maxdepth: 1 usage/lora usage/multimodal_inputs usage/tool_calling usage/structured_outputs usage/spec_decode usage/compatibility_matrix usage/performance usage/faq usage/engine_args usage/env_vars usage/usage_stats usage/disagg_prefill ``` ```{toctree} :caption: Quantization :maxdepth: 1 quantization/supported_hardware quantization/auto_awq quantization/bnb quantization/gguf quantization/int8 quantization/fp8 quantization/fp8_e5m2_kvcache quantization/fp8_e4m3_kvcache ``` ```{toctree} :caption: Automatic Prefix Caching :maxdepth: 1 automatic_prefix_caching/apc automatic_prefix_caching/details ``` ```{toctree} :caption: Performance :maxdepth: 1 performance/benchmarks ``` % Community: User community resources ```{toctree} :caption: Community :maxdepth: 1 community/meetups community/sponsors ``` % API Documentation: API reference aimed at vllm library usage ```{toctree} :caption: API Documentation :maxdepth: 2 dev/sampling_params dev/pooling_params dev/offline_inference/offline_index dev/engine/engine_index ``` % Design: docs about vLLM internals ```{toctree} :caption: Design :maxdepth: 2 design/arch_overview design/huggingface_integration design/plugin_system design/input_processing/model_inputs_index design/kernel/paged_attention design/multimodal/multimodal_index design/multiprocessing ``` % For Developers: contributing to the vLLM project ```{toctree} :caption: For Developers :maxdepth: 2 contributing/overview contributing/profiling/profiling_index contributing/dockerfile/dockerfile ``` # Indices and tables - {ref}`genindex` - {ref}`modindex`