vllm/docs/source/index.rst

36 lines
918 B
ReStructuredText
Raw Normal View History

2023-06-17 03:07:40 -07:00
Welcome to vLLM!
================
2023-05-22 17:02:44 -07:00
**vLLM** is a fast and easy-to-use library for LLM inference and serving.
Its core features include:
- State-of-the-art performance in serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Seamless integration with popular HuggingFace models
- Dynamic batching of incoming requests
- Optimized CUDA kernels
- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
For more information, please refer to our `blog post <>`_.
2023-06-18 01:26:12 +08:00
2023-05-22 17:02:44 -07:00
Documentation
-------------
.. toctree::
:maxdepth: 1
:caption: Getting Started
getting_started/installation
getting_started/quickstart
2023-06-02 22:35:17 -07:00
.. toctree::
:maxdepth: 1
:caption: Models
models/supported_models
models/adding_model