vllm/docs/source/index.rst

Welcome to vLLM!
================

.. figure:: ./assets/logos/vllm-logo-text-light.png
  :width: 60%
  :align: center
  :alt: vLLM
  :class: no-scaled-link

.. raw:: html

   <p style="text-align:center">
   <strong>Easy, fast, and cheap LLM serving for everyone
   </strong>
   </p>

   <p style="text-align:center">
   <script async defer src="https://buttons.github.io/buttons.js"></script>
   <a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
   <a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
   <a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
   </p>


vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

* State-of-the-art serving throughput
* Efficient management of attention key and value memory with **PagedAttention**
* Continuous batching of incoming requests
* Fast model execution with CUDA/HIP graph
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_, FP8 KV Cache
* Optimized CUDA kernels

vLLM is flexible and easy to use with:

* Seamless integration with popular HuggingFace models
* High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
* Tensor parallelism and pipeline parallelism support for distributed inference
* Streaming outputs
* OpenAI-compatible API server
* Support NVIDIA GPUs and AMD GPUs
* (Experimental) Prefix caching support
* (Experimental) Multi-lora support

For more information, check out the following:

* `vLLM announcing blog post <https://vllm.ai>`_ (intro to PagedAttention)
* `vLLM paper <https://arxiv.org/abs/2309.06180>`_ (SOSP 2023)
* `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency <https://www.anyscale.com/blog/continuous-batching-llm-inference>`_ by Cade Daniel et al.
* :ref:`vLLM Meetups <meetups>`.


Documentation
-------------

.. toctree::
   :maxdepth: 1
   :caption: Getting Started

   getting_started/installation
   getting_started/amd-installation
   getting_started/openvino-installation
   getting_started/cpu-installation
   getting_started/neuron-installation
   getting_started/tpu-installation
   getting_started/xpu-installation
   getting_started/quickstart
   getting_started/debugging
   getting_started/examples/examples_index

.. toctree::
   :maxdepth: 1
   :caption: Serving

   serving/openai_compatible_server
   serving/deploying_with_docker
   serving/distributed_serving
   serving/metrics
   serving/env_vars
   serving/usage_stats
   serving/integrations
   serving/tensorizer
   serving/faq

.. toctree::
   :maxdepth: 1
   :caption: Models

   models/supported_models
   models/adding_model
   models/enabling_multimodal_inputs
   models/engine_args
   models/lora
   models/vlm
   models/spec_decode
   models/performance

.. toctree::
   :maxdepth: 1
   :caption: Quantization

   quantization/supported_hardware
   quantization/auto_awq
   quantization/bnb
   quantization/fp8
   quantization/fp8_e5m2_kvcache
   quantization/fp8_e4m3_kvcache

.. toctree::
   :maxdepth: 1
   :caption: Automatic Prefix Caching

   automatic_prefix_caching/apc
   automatic_prefix_caching/details

.. toctree::
   :maxdepth: 1
   :caption: Performance benchmarks

   performance_benchmark/benchmarks

.. toctree::
   :maxdepth: 2
   :caption: Developer Documentation

   dev/sampling_params
   dev/offline_inference/offline_index
   dev/engine/engine_index
   dev/kernel/paged_attention
   dev/input_processing/model_inputs_index
   dev/multimodal/multimodal_index
   dev/dockerfile/dockerfile

.. toctree::
   :maxdepth: 1
   :caption: Community

   community/meetups
   community/sponsors

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00			`Welcome to vLLM!`
			`================`
Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`.. figure:: ./assets/logos/vllm-logo-text-light.png`
			`:width: 60%`
			`:align: center`
			`:alt: vLLM`
			`:class: no-scaled-link`

			`.. raw:: html`

			`<p style="text-align:center">`
			`<strong>Easy, fast, and cheap LLM serving for everyone`
			`</strong>`
			`</p>`

			`<p style="text-align:center">`
[Minor] Fix icons in doc (#165) 2023-06-19 20:35:38 -07:00			`<script async defer src="https://buttons.github.io/buttons.js"></script>`
			`<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>`
			`<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>`
			`<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`</p>`



[Docs] Minor fix (#162) 2023-06-19 19:58:23 -07:00			`vLLM is a fast and easy-to-use library for LLM inference and serving.`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
			`vLLM is fast with:`

			`* State-of-the-art serving throughput`
			`* Efficient management of attention key and value memory with PagedAttention`
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00			`* Continuous batching of incoming requests`
[Docs] Add CUDA graph support to docs (#2148) 2023-12-17 01:49:20 -08:00			`* Fast model execution with CUDA/HIP graph`
Bump up version to v0.3.0 (#2656) 2024-01-31 00:07:07 -08:00			* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_, FP8 KV Cache
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`* Optimized CUDA kernels`

			`vLLM is flexible and easy to use with:`

			`* Seamless integration with popular HuggingFace models`
			`* High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more`
[Doc] Fix Typo in Doc (#6392) Co-authored-by: Saliya Ekanayake <esaliya@d-matrix.ai> 2024-07-12 17:48:23 -07:00			`* Tensor parallelism and pipeline parallelism support for distributed inference`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`* Streaming outputs`
			`* OpenAI-compatible API server`
[Docs] Add CUDA graph support to docs (#2148) 2023-12-17 01:49:20 -08:00			`* Support NVIDIA GPUs and AMD GPUs`
Bump up version to v0.3.0 (#2656) 2024-01-31 00:07:07 -08:00			`* (Experimental) Prefix caching support`
			`* (Experimental) Multi-lora support`
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00			`For more information, check out the following:`

			* `vLLM announcing blog post <https://vllm.ai>`_ (intro to PagedAttention)
Announce paper release (#1036) 2023-09-13 17:38:13 -07:00			* `vLLM paper <https://arxiv.org/abs/2309.06180>`_ (SOSP 2023)
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00			* `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency <https://www.anyscale.com/blog/continuous-batching-llm-inference>`_ by Cade Daniel et al.
[Doc] Add meetups to the doc (#4798) 2024-05-13 18:48:00 -07:00			* :ref:`vLLM Meetups <meetups>`.
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00			`Documentation`
			`-------------`

			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Getting Started`

			`getting_started/installation`
Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Amir Balwel <amoooori04@gmail.com> Co-authored-by: root <kuanfu.liu@akirakan.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com> Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com> 2023-12-08 15:16:52 +08:00			`getting_started/amd-installation`
[Hardware][Intel] OpenVINO vLLM backend (#5379) 2024-06-28 17:50:16 +04:00			`getting_started/openvino-installation`
[Hardware][Intel] Add CPU inference backend (#3634) Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Yuan Zhou <yuan.zhou@intel.com> 2024-04-02 13:07:30 +08:00			`getting_started/cpu-installation`
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00			`getting_started/neuron-installation`
			`getting_started/tpu-installation`
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com> Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> 2024-06-18 02:01:25 +08:00			`getting_started/xpu-installation`
Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00			`getting_started/quickstart`
[Doc] add debugging tips (#5409) 2024-06-10 23:21:43 -07:00			`getting_started/debugging`
Add example scripts to documentation (#4225) Co-authored-by: Harry Mellor <hmellor@oxts.com> 2024-04-22 17:36:54 +01:00			`getting_started/examples/examples_index`
Document supported models (#127) 2023-06-02 22:35:17 -07:00
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Serving`

[Doc] Add docs about OpenAI compatible server (#3288) 2024-03-18 22:05:34 -07:00			`serving/openai_compatible_server`
Add Dockerfile (#1350) 2023-10-31 12:36:47 -07:00			`serving/deploying_with_docker`
[Doc] Add docs about OpenAI compatible server (#3288) 2024-03-18 22:05:34 -07:00			`serving/distributed_serving`
Add Production Metrics in Prometheus format (#1890) 2023-12-02 16:37:44 -08:00			`serving/metrics`
[Doc] add env vars to the doc (#4572) 2024-05-02 22:13:49 -07:00			`serving/env_vars`
Usage Stats Collection (#2852) 2024-03-28 22:16:12 -07:00			`serving/usage_stats`
[Doc] Add docs about OpenAI compatible server (#3288) 2024-03-18 22:05:34 -07:00			`serving/integrations`
[Doc] Update documentation on Tensorizer (#5471) 2024-06-14 14:27:57 -04:00			`serving/tensorizer`
add FAQ doc under 'serving' (#5946) 2024-07-01 14:11:36 -07:00			`serving/faq`
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00
Document supported models (#127) 2023-06-02 22:35:17 -07:00			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Models`

			`models/supported_models`
			`models/adding_model`
[Doc] Move guide for multimodal model and other improvements (#6168) 2024-07-06 17:18:59 +08:00			`models/enabling_multimodal_inputs`
[DOCS] Add engine args documentation (#1741) 2023-11-22 21:31:27 +01:00			`models/engine_args`
Add documentation section about LoRA (#2834) 2024-02-12 08:24:45 -08:00			`models/lora`
[Core] Support image processor (#4197) 2024-06-03 13:56:41 +08:00			`models/vlm`
[Speculative decoding] Initial spec decode docs (#5400) 2024-06-11 10:15:40 -07:00			`models/spec_decode`
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 16:18:00 +09:00			`models/performance`
Add Quantization and AutoAWQ to docs (#1235) 2023-11-05 06:43:39 +01:00
			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Quantization`

[Doc] Documentation on supported hardware for quantization methods (#5745) 2024-06-21 12:44:29 -04:00			`quantization/supported_hardware`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00			`quantization/auto_awq`
[bitsandbytes]: support read bnb pre-quantized model (#5753) Co-authored-by: Michael Goin <michael@neuralmagic.com> 2024-07-23 16:45:09 -07:00			`quantization/bnb`
[Doc] Add documentation for FP8 W8A8 (#5388) 2024-06-10 20:55:12 -04:00			`quantization/fp8`
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> 2024-04-03 16:15:55 -05:00			`quantization/fp8_e5m2_kvcache`
			`quantization/fp8_e4m3_kvcache`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00
			`.. toctree::`
[Core] Support image processor (#4197) 2024-06-03 13:56:41 +08:00			`:maxdepth: 1`
[Doc] Add an automatic prefix caching section in vllm documentation (#5324) Co-authored-by: simon-mo <simon.mo@hey.com> 2024-06-11 10:24:59 -07:00			`:caption: Automatic Prefix Caching`

			`automatic_prefix_caching/apc`
			`automatic_prefix_caching/details`

[Doc] Add documentations for nightly benchmarks (#6412) 2024-07-25 11:57:16 -07:00			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Performance benchmarks`

			`performance_benchmark/benchmarks`

[Doc] Add an automatic prefix caching section in vllm documentation (#5324) Co-authored-by: simon-mo <simon.mo@hey.com> 2024-06-11 10:24:59 -07:00			`.. toctree::`
[Doc] Move guide for multimodal model and other improvements (#6168) 2024-07-06 17:18:59 +08:00			`:maxdepth: 2`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00			`:caption: Developer Documentation`
[Doc] Add an automatic prefix caching section in vllm documentation (#5324) Co-authored-by: simon-mo <simon.mo@hey.com> 2024-06-11 10:24:59 -07:00
[Core] Consolidate prompt arguments to LLM engines (#4328) Co-authored-by: Roger Wang <ywang@roblox.com> 2024-05-29 04:29:31 +08:00			`dev/sampling_params`
			`dev/offline_inference/offline_index`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00			`dev/engine/engine_index`
Add document for vllm paged attention kernel. (#2978) 2024-03-04 09:23:34 -08:00			`dev/kernel/paged_attention`
[Core] Registry for processing model inputs (#5214) Co-authored-by: ywang96 <ywang@roblox.com> 2024-06-28 20:09:56 +08:00			`dev/input_processing/model_inputs_index`
[Core] Support image processor (#4197) 2024-06-03 13:56:41 +08:00			`dev/multimodal/multimodal_index`
[Doc] add visualization for multi-stage dockerfile (#4456) Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Roger Wang <ywang@roblox.com> 2024-04-30 10:41:59 -07:00			`dev/dockerfile/dockerfile`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00
[Doc] Add meetups to the doc (#4798) 2024-05-13 18:48:00 -07:00			`.. toctree::`
[Core] Support image processor (#4197) 2024-06-03 13:56:41 +08:00			`:maxdepth: 1`
[Doc] Add meetups to the doc (#4798) 2024-05-13 18:48:00 -07:00			`:caption: Community`

			`community/meetups`
[Docs] Add acknowledgment for sponsors (#4925) 2024-05-21 00:17:25 -07:00			`community/sponsors`
[Doc] Add meetups to the doc (#4798) 2024-05-13 18:48:00 -07:00
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00			`Indices and tables`
			`==================`

			* :ref:`genindex`
			* :ref:`modindex`