vllm/docs/source/index.rst
2023-06-18 03:19:38 -07:00

36 lines
918 B
ReStructuredText

Welcome to vLLM!
================
**vLLM** is a fast and easy-to-use library for LLM inference and serving.
Its core features include:
- State-of-the-art performance in serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Seamless integration with popular HuggingFace models
- Dynamic batching of incoming requests
- Optimized CUDA kernels
- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
For more information, please refer to our `blog post <>`_.
Documentation
-------------
.. toctree::
:maxdepth: 1
:caption: Getting Started
getting_started/installation
getting_started/quickstart
.. toctree::
:maxdepth: 1
:caption: Models
models/supported_models
models/adding_model