vllm/docs/source/index.rst

Welcome to vLLM!
================

.. figure:: ./assets/logos/vllm-logo-text-light.png
  :width: 60%
  :align: center
  :alt: vLLM
  :class: no-scaled-link

.. raw:: html

   <p style="text-align:center">
   <strong>Easy, fast, and cheap LLM serving for everyone
   </strong>
   </p>

   <p style="text-align:center">
   <script async defer src="https://buttons.github.io/buttons.js"></script>
   <a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
   <a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
   <a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
   </p>


vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

* State-of-the-art serving throughput
* Efficient management of attention key and value memory with **PagedAttention**
* Continuous batching of incoming requests
* Fast model execution with CUDA/HIP graph
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_, FP8 KV Cache
* Optimized CUDA kernels

vLLM is flexible and easy to use with:

* Seamless integration with popular HuggingFace models
* High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
* Tensor parallelism support for distributed inference
* Streaming outputs
* OpenAI-compatible API server
* Support NVIDIA GPUs and AMD GPUs
* (Experimental) Prefix caching support
* (Experimental) Multi-lora support

For more information, check out the following:

* `vLLM announcing blog post <https://vllm.ai>`_ (intro to PagedAttention)
* `vLLM paper <https://arxiv.org/abs/2309.06180>`_ (SOSP 2023)
* `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency <https://www.anyscale.com/blog/continuous-batching-llm-inference>`_ by Cade Daniel et al.


Documentation
-------------

.. toctree::
   :maxdepth: 1
   :caption: Getting Started

   getting_started/installation
   getting_started/amd-installation
   getting_started/neuron-installation
   getting_started/cpu-installation
   getting_started/quickstart

.. toctree::
   :maxdepth: 1
   :caption: Serving

   serving/openai_compatible_server
   serving/deploying_with_docker
   serving/distributed_serving
   serving/metrics
   serving/usage_stats
   serving/integrations

.. toctree::
   :maxdepth: 1
   :caption: Models

   models/supported_models
   models/adding_model
   models/engine_args
   models/lora

.. toctree::
   :maxdepth: 1
   :caption: Quantization

   quantization/auto_awq
   quantization/fp8_e5m2_kvcache
   quantization/fp8_e4m3_kvcache

.. toctree::
   :maxdepth: 2
   :caption: Developer Documentation

   dev/sampling_params
   dev/engine/engine_index
   dev/kernel/paged_attention

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00			`Welcome to vLLM!`
			`================`
Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`.. figure:: ./assets/logos/vllm-logo-text-light.png`
			`:width: 60%`
			`:align: center`
			`:alt: vLLM`
			`:class: no-scaled-link`

			`.. raw:: html`

			`<p style="text-align:center">`
			`<strong>Easy, fast, and cheap LLM serving for everyone`
			`</strong>`
			`</p>`

			`<p style="text-align:center">`
[Minor] Fix icons in doc (#165) 2023-06-19 20:35:38 -07:00			`<script async defer src="https://buttons.github.io/buttons.js"></script>`
			`<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>`
			`<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>`
			`<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`</p>`



[Docs] Minor fix (#162) 2023-06-19 19:58:23 -07:00			`vLLM is a fast and easy-to-use library for LLM inference and serving.`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
			`vLLM is fast with:`

			`* State-of-the-art serving throughput`
			`* Efficient management of attention key and value memory with PagedAttention`
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00			`* Continuous batching of incoming requests`
[Docs] Add CUDA graph support to docs (#2148) 2023-12-17 01:49:20 -08:00			`* Fast model execution with CUDA/HIP graph`
Bump up version to v0.3.0 (#2656) 2024-01-31 00:07:07 -08:00			* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_, FP8 KV Cache
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`* Optimized CUDA kernels`

			`vLLM is flexible and easy to use with:`

			`* Seamless integration with popular HuggingFace models`
			`* High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more`
			`* Tensor parallelism support for distributed inference`
			`* Streaming outputs`
			`* OpenAI-compatible API server`
[Docs] Add CUDA graph support to docs (#2148) 2023-12-17 01:49:20 -08:00			`* Support NVIDIA GPUs and AMD GPUs`
Bump up version to v0.3.0 (#2656) 2024-01-31 00:07:07 -08:00			`* (Experimental) Prefix caching support`
			`* (Experimental) Multi-lora support`
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00			`For more information, check out the following:`

			* `vLLM announcing blog post <https://vllm.ai>`_ (intro to PagedAttention)
Announce paper release (#1036) 2023-09-13 17:38:13 -07:00			* `vLLM paper <https://arxiv.org/abs/2309.06180>`_ (SOSP 2023)
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00			* `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency <https://www.anyscale.com/blog/continuous-batching-llm-inference>`_ by Cade Daniel et al.

Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00			`Documentation`
			`-------------`

			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Getting Started`

			`getting_started/installation`
Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Amir Balwel <amoooori04@gmail.com> Co-authored-by: root <kuanfu.liu@akirakan.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com> Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com> 2023-12-08 15:16:52 +08:00			`getting_started/amd-installation`
[DOC] add setup document to support neuron backend (#2777) 2024-03-03 17:03:51 -08:00			`getting_started/neuron-installation`
[Hardware][Intel] Add CPU inference backend (#3634) Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Yuan Zhou <yuan.zhou@intel.com> 2024-04-02 13:07:30 +08:00			`getting_started/cpu-installation`
Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00			`getting_started/quickstart`
Document supported models (#127) 2023-06-02 22:35:17 -07:00
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Serving`

[Doc] Add docs about OpenAI compatible server (#3288) 2024-03-18 22:05:34 -07:00			`serving/openai_compatible_server`
Add Dockerfile (#1350) 2023-10-31 12:36:47 -07:00			`serving/deploying_with_docker`
[Doc] Add docs about OpenAI compatible server (#3288) 2024-03-18 22:05:34 -07:00			`serving/distributed_serving`
Add Production Metrics in Prometheus format (#1890) 2023-12-02 16:37:44 -08:00			`serving/metrics`
Usage Stats Collection (#2852) 2024-03-28 22:16:12 -07:00			`serving/usage_stats`
[Doc] Add docs about OpenAI compatible server (#3288) 2024-03-18 22:05:34 -07:00			`serving/integrations`
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00
Document supported models (#127) 2023-06-02 22:35:17 -07:00			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Models`

			`models/supported_models`
			`models/adding_model`
[DOCS] Add engine args documentation (#1741) 2023-11-22 21:31:27 +01:00			`models/engine_args`
Add documentation section about LoRA (#2834) 2024-02-12 08:24:45 -08:00			`models/lora`
Add Quantization and AutoAWQ to docs (#1235) 2023-11-05 06:43:39 +01:00
			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Quantization`

[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00			`quantization/auto_awq`
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> 2024-04-03 16:15:55 -05:00			`quantization/fp8_e5m2_kvcache`
			`quantization/fp8_e4m3_kvcache`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00
			`.. toctree::`
			`:maxdepth: 2`
			`:caption: Developer Documentation`

[Doc] Add docs about OpenAI compatible server (#3288) 2024-03-18 22:05:34 -07:00			`dev/sampling_params`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00			`dev/engine/engine_index`
Add document for vllm paged attention kernel. (#2978) 2024-03-04 09:23:34 -08:00			`dev/kernel/paged_attention`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00
			`Indices and tables`
			`==================`

			* :ref:`genindex`
			* :ref:`modindex`