Welcome to vLLM! ================ .. figure:: ./assets/logos/vllm-logo-text-light.png :width: 60% :align: center :alt: vLLM :class: no-scaled-link .. raw:: html
Easy, fast, and cheap LLM serving for everyone
vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: * State-of-the-art serving throughput * Efficient management of attention key and value memory with **PagedAttention** * Dynamic batching of incoming requests * Optimized CUDA kernels vLLM is flexible and easy to use with: * Seamless integration with popular HuggingFace models * High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more * Tensor parallelism support for distributed inference * Streaming outputs * OpenAI-compatible API server For more information, please refer to our `blog post