[Doc][5/N] Move Community and API Reference to the bottom (#11896)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Simon Mo <simon.mo@hey.com>
This commit is contained in:
Cyrus Leung 2025-01-10 11:10:12 +08:00 committed by GitHub
parent 36f5303578
commit c3cf54dda4
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 40 additions and 26 deletions

View File

@ -41,7 +41,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with: vLLM is fast with:
- State-of-the-art serving throughput - State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention** - Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests - Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph - Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8. - Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.

View File

@ -2,7 +2,7 @@
# Automatic Prefix Caching # Automatic Prefix Caching
The core idea of [PagedAttention](#design-paged-attention) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand. The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block. To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.

View File

@ -26,7 +26,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with: vLLM is fast with:
- State-of-the-art serving throughput - State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention** - Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests - Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph - Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8 - Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
@ -54,6 +54,8 @@ For more information, check out the following:
## Documentation ## Documentation
% How to start using vLLM?
```{toctree} ```{toctree}
:caption: Getting Started :caption: Getting Started
:maxdepth: 1 :maxdepth: 1
@ -65,6 +67,8 @@ getting_started/troubleshooting
getting_started/faq getting_started/faq
``` ```
% What does vLLM support?
```{toctree} ```{toctree}
:caption: Models :caption: Models
:maxdepth: 1 :maxdepth: 1
@ -75,6 +79,8 @@ models/supported_models
models/extensions/index models/extensions/index
``` ```
% Additional capabilities
```{toctree} ```{toctree}
:caption: Features :caption: Features
:maxdepth: 1 :maxdepth: 1
@ -89,6 +95,8 @@ features/spec_decode
features/compatibility_matrix features/compatibility_matrix
``` ```
% Details about running vLLM
```{toctree} ```{toctree}
:caption: Inference and Serving :caption: Inference and Serving
:maxdepth: 1 :maxdepth: 1
@ -104,6 +112,8 @@ serving/usage_stats
serving/integrations/index serving/integrations/index
``` ```
% Scaling up vLLM for production
```{toctree} ```{toctree}
:caption: Deployment :caption: Deployment
:maxdepth: 1 :maxdepth: 1
@ -115,6 +125,8 @@ deployment/frameworks/index
deployment/integrations/index deployment/integrations/index
``` ```
% Making the most out of vLLM
```{toctree} ```{toctree}
:caption: Performance :caption: Performance
:maxdepth: 1 :maxdepth: 1
@ -123,28 +135,7 @@ performance/optimization
performance/benchmarks performance/benchmarks
``` ```
% Community: User community resources % Explanation of vLLM internals
```{toctree}
:caption: Community
:maxdepth: 1
community/meetups
community/sponsors
```
```{toctree}
:caption: API Reference
:maxdepth: 2
api/offline_inference/index
api/engine/index
api/inference_params
api/multimodal/index
api/model/index
```
% Design Documents: Details about vLLM internals
```{toctree} ```{toctree}
:caption: Design Documents :caption: Design Documents
@ -159,7 +150,7 @@ design/automatic_prefix_caching
design/multiprocessing design/multiprocessing
``` ```
% Developer Guide: How to contribute to the vLLM project % How to contribute to the vLLM project
```{toctree} ```{toctree}
:caption: Developer Guide :caption: Developer Guide
@ -172,6 +163,29 @@ contributing/model/index
contributing/vulnerability_management contributing/vulnerability_management
``` ```
% Technical API specifications
```{toctree}
:caption: API Reference
:maxdepth: 2
api/offline_inference/index
api/engine/index
api/inference_params
api/multimodal/index
api/model/index
```
% Latest news and acknowledgements
```{toctree}
:caption: Community
:maxdepth: 1
community/meetups
community/sponsors
```
# Indices and tables # Indices and tables
- {ref}`genindex` - {ref}`genindex`