vllm/docs/source/index.rst

Welcome to vLLM!
================

.. figure:: ./assets/logos/vllm-logo-text-light.png
  :width: 60%
  :align: center
  :alt: vLLM
  :class: no-scaled-link

.. raw:: html

   <p style="text-align:center">
   <strong>Easy, fast, and cheap LLM serving for everyone
   </strong>
   </p>

   <p style="text-align:center">
   <script async defer src="https://buttons.github.io/buttons.js"></script>
   <a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
   <a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
   <a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
   </p>


vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

* State-of-the-art serving throughput
* Efficient management of attention key and value memory with **PagedAttention**
* Continuous batching of incoming requests
* Fast model execution with CUDA/HIP graph
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, INT4, INT8, and FP8
* Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
* Speculative decoding
* Chunked prefill

vLLM is flexible and easy to use with:

* Seamless integration with popular HuggingFace models
* High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
* Tensor parallelism and pipeline parallelism support for distributed inference
* Streaming outputs
* OpenAI-compatible API server
* Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
* Prefix caching support
* Multi-lora support

For more information, check out the following:

* `vLLM announcing blog post <https://vllm.ai>`_ (intro to PagedAttention)
* `vLLM paper <https://arxiv.org/abs/2309.06180>`_ (SOSP 2023)
* `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency <https://www.anyscale.com/blog/continuous-batching-llm-inference>`_ by Cade Daniel et al.
* :ref:`vLLM Meetups <meetups>`.


Documentation
-------------

.. toctree::
   :maxdepth: 1
   :caption: Getting Started

   getting_started/installation
   getting_started/amd-installation
   getting_started/openvino-installation
   getting_started/cpu-installation
   getting_started/gaudi-installation
   getting_started/neuron-installation
   getting_started/tpu-installation
   getting_started/xpu-installation
   getting_started/quickstart
   getting_started/debugging
   getting_started/examples/examples_index

.. toctree::
   :maxdepth: 1
   :caption: Serving

   serving/openai_compatible_server
   serving/deploying_with_docker
   serving/deploying_with_k8s
   serving/deploying_with_nginx
   serving/distributed_serving
   serving/metrics
   serving/env_vars
   serving/usage_stats
   serving/integrations
   serving/tensorizer
   serving/compatibility_matrix
   serving/faq

.. toctree::
   :maxdepth: 1
   :caption: Models

   models/supported_models
   models/adding_model
   models/enabling_multimodal_inputs
   models/engine_args
   models/lora
   models/vlm
   models/spec_decode
   models/performance

.. toctree::
   :maxdepth: 1
   :caption: Quantization

   quantization/supported_hardware
   quantization/auto_awq
   quantization/bnb
   quantization/gguf
   quantization/int8
   quantization/fp8
   quantization/fp8_e5m2_kvcache
   quantization/fp8_e4m3_kvcache

.. toctree::
   :maxdepth: 1
   :caption: Automatic Prefix Caching

   automatic_prefix_caching/apc
   automatic_prefix_caching/details

.. toctree::
   :maxdepth: 1
   :caption: Performance

   performance/benchmarks

.. Community: User community resources

.. toctree::
   :maxdepth: 1
   :caption: Community

   community/meetups
   community/sponsors

.. API Documentation: API reference aimed at vllm library usage

.. toctree::
   :maxdepth: 2
   :caption: API Documentation

   dev/sampling_params
   dev/pooling_params
   dev/offline_inference/offline_index
   dev/engine/engine_index

.. Design: docs about vLLM internals

.. toctree::
   :maxdepth: 2
   :caption: Design

   design/class_hierarchy
   design/huggingface_integration
   design/input_processing/model_inputs_index
   design/kernel/paged_attention
   design/multimodal/multimodal_index

.. For Developers: contributing to the vLLM project

.. toctree::
   :maxdepth: 2
   :caption: For Developers

   contributing/overview
   contributing/profiling/profiling_index
   contributing/dockerfile/dockerfile

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00			`Welcome to vLLM!`
			`================`
Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`.. figure:: ./assets/logos/vllm-logo-text-light.png`
			`:width: 60%`
			`:align: center`
			`:alt: vLLM`
			`:class: no-scaled-link`

			`.. raw:: html`

			`<p style="text-align:center">`
			`<strong>Easy, fast, and cheap LLM serving for everyone`
			`</strong>`
			`</p>`

			`<p style="text-align:center">`
[Minor] Fix icons in doc (#165) 2023-06-19 20:35:38 -07:00			`<script async defer src="https://buttons.github.io/buttons.js"></script>`
			`<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>`
			`<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>`
			`<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`</p>`



[Docs] Minor fix (#162) 2023-06-19 19:58:23 -07:00			`vLLM is a fast and easy-to-use library for LLM inference and serving.`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
			`vLLM is fast with:`

			`* State-of-the-art serving throughput`
			`* Efficient management of attention key and value memory with PagedAttention`
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00			`* Continuous batching of incoming requests`
[Docs] Add CUDA graph support to docs (#2148) 2023-12-17 01:49:20 -08:00			`* Fast model execution with CUDA/HIP graph`
[Docs] Update readme (#7316) 2024-08-11 17:13:37 -07:00			* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, INT4, INT8, and FP8
			`* Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.`
			`* Speculative decoding`
			`* Chunked prefill`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
			`vLLM is flexible and easy to use with:`

			`* Seamless integration with popular HuggingFace models`
			`* High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more`
[Doc] Fix Typo in Doc (#6392) Co-authored-by: Saliya Ekanayake <esaliya@d-matrix.ai> 2024-07-12 17:48:23 -07:00			`* Tensor parallelism and pipeline parallelism support for distributed inference`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`* Streaming outputs`
			`* OpenAI-compatible API server`
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend (#6143) Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: Bob Zhu <bob.zhu@intel.com> Signed-off-by: zehao-intel <zehao.huang@intel.com> Signed-off-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> Co-authored-by: Michal Adamczyk <madamczyk@habana.ai> Co-authored-by: Marceli Fylcek <mfylcek@habana.ai> Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com> Co-authored-by: Vivek Goel <vgoel@habana.ai> Co-authored-by: yuwenzho <yuwen.zhou@intel.com> Co-authored-by: Dominika Olszewska <dolszewska@habana.ai> Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com> Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com> Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com> Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai> Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com> Co-authored-by: Ilia Taraban <tarabanil@gmail.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai> Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai> Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com> Co-authored-by: Sun Choi <schoi@habana.ai> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com> Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com> Co-authored-by: Zehao Huang <zehao.huang@intel.com> Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com> Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com> Co-authored-by: Nir David <ndavid@habana.ai> Co-authored-by: Yu-Zhou <yu.zhou@intel.com> Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai> Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Jacek Czaja <jacek.czaja@intel.com> Co-authored-by: Jacek Czaja <jczaja@habana.ai> Co-authored-by: Yuan <yuan.zhou@outlook.com> 2024-11-06 10:09:10 +01:00			`* Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.`
[Docs] Update readme (#7316) 2024-08-11 17:13:37 -07:00			`* Prefix caching support`
			`* Multi-lora support`
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00			`For more information, check out the following:`

			* `vLLM announcing blog post <https://vllm.ai>`_ (intro to PagedAttention)
Announce paper release (#1036) 2023-09-13 17:38:13 -07:00			* `vLLM paper <https://arxiv.org/abs/2309.06180>`_ (SOSP 2023)
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00			* `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency <https://www.anyscale.com/blog/continuous-batching-llm-inference>`_ by Cade Daniel et al.
[Doc] Add meetups to the doc (#4798) 2024-05-13 18:48:00 -07:00			* :ref:`vLLM Meetups <meetups>`.
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00
Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00			`Documentation`
			`-------------`

			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Getting Started`

			`getting_started/installation`
Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Amir Balwel <amoooori04@gmail.com> Co-authored-by: root <kuanfu.liu@akirakan.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com> Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com> 2023-12-08 15:16:52 +08:00			`getting_started/amd-installation`
[Hardware][Intel] OpenVINO vLLM backend (#5379) 2024-06-28 17:50:16 +04:00			`getting_started/openvino-installation`
[Hardware][Intel] Add CPU inference backend (#3634) Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Yuan Zhou <yuan.zhou@intel.com> 2024-04-02 13:07:30 +08:00			`getting_started/cpu-installation`
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend (#6143) Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: Bob Zhu <bob.zhu@intel.com> Signed-off-by: zehao-intel <zehao.huang@intel.com> Signed-off-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> Co-authored-by: Michal Adamczyk <madamczyk@habana.ai> Co-authored-by: Marceli Fylcek <mfylcek@habana.ai> Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com> Co-authored-by: Vivek Goel <vgoel@habana.ai> Co-authored-by: yuwenzho <yuwen.zhou@intel.com> Co-authored-by: Dominika Olszewska <dolszewska@habana.ai> Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com> Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com> Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com> Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai> Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com> Co-authored-by: Ilia Taraban <tarabanil@gmail.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai> Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai> Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com> Co-authored-by: Sun Choi <schoi@habana.ai> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com> Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com> Co-authored-by: Zehao Huang <zehao.huang@intel.com> Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com> Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com> Co-authored-by: Nir David <ndavid@habana.ai> Co-authored-by: Yu-Zhou <yu.zhou@intel.com> Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai> Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Jacek Czaja <jacek.czaja@intel.com> Co-authored-by: Jacek Czaja <jczaja@habana.ai> Co-authored-by: Yuan <yuan.zhou@outlook.com> 2024-11-06 10:09:10 +01:00			`getting_started/gaudi-installation`
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00			`getting_started/neuron-installation`
			`getting_started/tpu-installation`
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com> Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> 2024-06-18 02:01:25 +08:00			`getting_started/xpu-installation`
Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00			`getting_started/quickstart`
[Doc] add debugging tips (#5409) 2024-06-10 23:21:43 -07:00			`getting_started/debugging`
Add example scripts to documentation (#4225) Co-authored-by: Harry Mellor <hmellor@oxts.com> 2024-04-22 17:36:54 +01:00			`getting_started/examples/examples_index`
Document supported models (#127) 2023-06-02 22:35:17 -07:00
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Serving`

[Doc] Add docs about OpenAI compatible server (#3288) 2024-03-18 22:05:34 -07:00			`serving/openai_compatible_server`
Add Dockerfile (#1350) 2023-10-31 12:36:47 -07:00			`serving/deploying_with_docker`
[Doc]: Add deploying_with_k8s guide (#8451) 2024-10-08 04:31:45 +08:00			`serving/deploying_with_k8s`
[Hardware][Intel CPU][DOC] Update docs for CPU backend (#6212) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Gubrud, Aaron D <aaron.d.gubrud@intel.com> Co-authored-by: adgubrud <96072084+adgubrud@users.noreply.github.com> 2024-10-22 10:38:04 -07:00			`serving/deploying_with_nginx`
[Doc] Add docs about OpenAI compatible server (#3288) 2024-03-18 22:05:34 -07:00			`serving/distributed_serving`
Add Production Metrics in Prometheus format (#1890) 2023-12-02 16:37:44 -08:00			`serving/metrics`
[Doc] add env vars to the doc (#4572) 2024-05-02 22:13:49 -07:00			`serving/env_vars`
Usage Stats Collection (#2852) 2024-03-28 22:16:12 -07:00			`serving/usage_stats`
[Doc] Add docs about OpenAI compatible server (#3288) 2024-03-18 22:05:34 -07:00			`serving/integrations`
[Doc] Update documentation on Tensorizer (#5471) 2024-06-14 14:27:57 -04:00			`serving/tensorizer`
[Doc] Compatibility matrix for mutual exclusive features (#8512) Signed-off-by: Wallas Santos <wallashss@ibm.com> 2024-10-11 15:18:50 -03:00			`serving/compatibility_matrix`
add FAQ doc under 'serving' (#5946) 2024-07-01 14:11:36 -07:00			`serving/faq`
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00
Document supported models (#127) 2023-06-02 22:35:17 -07:00			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Models`

			`models/supported_models`
			`models/adding_model`
[Doc] Move guide for multimodal model and other improvements (#6168) 2024-07-06 17:18:59 +08:00			`models/enabling_multimodal_inputs`
[DOCS] Add engine args documentation (#1741) 2023-11-22 21:31:27 +01:00			`models/engine_args`
Add documentation section about LoRA (#2834) 2024-02-12 08:24:45 -08:00			`models/lora`
[Core] Support image processor (#4197) 2024-06-03 13:56:41 +08:00			`models/vlm`
[Speculative decoding] Initial spec decode docs (#5400) 2024-06-11 10:15:40 -07:00			`models/spec_decode`
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 16:18:00 +09:00			`models/performance`
Add Quantization and AutoAWQ to docs (#1235) 2023-11-05 06:43:39 +01:00
			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Quantization`

[Doc] Documentation on supported hardware for quantization methods (#5745) 2024-06-21 12:44:29 -04:00			`quantization/supported_hardware`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00			`quantization/auto_awq`
[bitsandbytes]: support read bnb pre-quantized model (#5753) Co-authored-by: Michael Goin <michael@neuralmagic.com> 2024-07-23 16:45:09 -07:00			`quantization/bnb`
[Doc] Add documentation for GGUF quantization (#8618) 2024-09-20 03:15:55 +08:00			`quantization/gguf`
[Doc] Add docs for llmcompressor INT8 and FP8 checkpoints (#7444) 2024-08-16 16:59:16 -04:00			`quantization/int8`
[Doc] Add documentation for FP8 W8A8 (#5388) 2024-06-10 20:55:12 -04:00			`quantization/fp8`
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> 2024-04-03 16:15:55 -05:00			`quantization/fp8_e5m2_kvcache`
			`quantization/fp8_e4m3_kvcache`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00
			`.. toctree::`
[Core] Support image processor (#4197) 2024-06-03 13:56:41 +08:00			`:maxdepth: 1`
[Doc] Add an automatic prefix caching section in vllm documentation (#5324) Co-authored-by: simon-mo <simon.mo@hey.com> 2024-06-11 10:24:59 -07:00			`:caption: Automatic Prefix Caching`

			`automatic_prefix_caching/apc`
			`automatic_prefix_caching/details`

[Doc] Add documentations for nightly benchmarks (#6412) 2024-07-25 11:57:16 -07:00			`.. toctree::`
			`:maxdepth: 1`
Doc: Improve benchmark documentation (#9927) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-11-07 02:50:35 -05:00			`:caption: Performance`
[Doc] Add documentations for nightly benchmarks (#6412) 2024-07-25 11:57:16 -07:00
Doc: Improve benchmark documentation (#9927) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-11-07 02:50:35 -05:00			`performance/benchmarks`
[Doc] Add documentations for nightly benchmarks (#6412) 2024-07-25 11:57:16 -07:00
[Doc] Move CONTRIBUTING to docs site (#9924) Signed-off-by: Russell Bryant <rbryant@redhat.com> 2024-11-08 00:15:12 -05:00			`.. Community: User community resources`

			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Community`

			`community/meetups`
			`community/sponsors`

			`.. API Documentation: API reference aimed at vllm library usage`

[Doc] Add an automatic prefix caching section in vllm documentation (#5324) Co-authored-by: simon-mo <simon.mo@hey.com> 2024-06-11 10:24:59 -07:00			`.. toctree::`
[Doc] Move guide for multimodal model and other improvements (#6168) 2024-07-06 17:18:59 +08:00			`:maxdepth: 2`
[Doc] Move CONTRIBUTING to docs site (#9924) Signed-off-by: Russell Bryant <rbryant@redhat.com> 2024-11-08 00:15:12 -05:00			`:caption: API Documentation`
[Doc] Add an automatic prefix caching section in vllm documentation (#5324) Co-authored-by: simon-mo <simon.mo@hey.com> 2024-06-11 10:24:59 -07:00
[Core] Consolidate prompt arguments to LLM engines (#4328) Co-authored-by: Roger Wang <ywang@roblox.com> 2024-05-29 04:29:31 +08:00			`dev/sampling_params`
[Frontend] Chat-based Embeddings API (#9759) 2024-11-01 16:13:35 +08:00			`dev/pooling_params`
[Core] Consolidate prompt arguments to LLM engines (#4328) Co-authored-by: Roger Wang <ywang@roblox.com> 2024-05-29 04:29:31 +08:00			`dev/offline_inference/offline_index`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00			`dev/engine/engine_index`
[Doc] Move CONTRIBUTING to docs site (#9924) Signed-off-by: Russell Bryant <rbryant@redhat.com> 2024-11-08 00:15:12 -05:00
			`.. Design: docs about vLLM internals`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00
[Doc] Add meetups to the doc (#4798) 2024-05-13 18:48:00 -07:00			`.. toctree::`
[Doc] Move CONTRIBUTING to docs site (#9924) Signed-off-by: Russell Bryant <rbryant@redhat.com> 2024-11-08 00:15:12 -05:00			`:maxdepth: 2`
			`:caption: Design`
[Doc] Add meetups to the doc (#4798) 2024-05-13 18:48:00 -07:00
[doc] explain the class hierarchy in vLLM (#10240) Signed-off-by: youkaichao <youkaichao@gmail.com> 2024-11-11 22:56:44 -08:00			`design/class_hierarchy`
			`design/huggingface_integration`
[Doc] Move CONTRIBUTING to docs site (#9924) Signed-off-by: Russell Bryant <rbryant@redhat.com> 2024-11-08 00:15:12 -05:00			`design/input_processing/model_inputs_index`
			`design/kernel/paged_attention`
			`design/multimodal/multimodal_index`

[Misc] small fixes to function tracing file path (#9543) Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> 2024-11-11 07:21:06 +08:00			`.. For Developers: contributing to the vLLM project`
[Doc] Move CONTRIBUTING to docs site (#9924) Signed-off-by: Russell Bryant <rbryant@redhat.com> 2024-11-08 00:15:12 -05:00
			`.. toctree::`
			`:maxdepth: 2`
[Misc] small fixes to function tracing file path (#9543) Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> 2024-11-11 07:21:06 +08:00			`:caption: For Developers`
[Doc] Move CONTRIBUTING to docs site (#9924) Signed-off-by: Russell Bryant <rbryant@redhat.com> 2024-11-08 00:15:12 -05:00
			`contributing/overview`
			`contributing/profiling/profiling_index`
			`contributing/dockerfile/dockerfile`
[Doc] Add meetups to the doc (#4798) 2024-05-13 18:48:00 -07:00
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00			`Indices and tables`
			`==================`

			* :ref:`genindex`
			* :ref:`modindex`