[Doc][5/N] Move Community and API Reference to the bottom (#11896)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Simon Mo <simon.mo@hey.com>
This commit is contained in:
parent
36f5303578
commit
c3cf54dda4
@ -41,7 +41,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving.
|
|||||||
vLLM is fast with:
|
vLLM is fast with:
|
||||||
|
|
||||||
- State-of-the-art serving throughput
|
- State-of-the-art serving throughput
|
||||||
- Efficient management of attention key and value memory with **PagedAttention**
|
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
|
||||||
- Continuous batching of incoming requests
|
- Continuous batching of incoming requests
|
||||||
- Fast model execution with CUDA/HIP graph
|
- Fast model execution with CUDA/HIP graph
|
||||||
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.
|
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.
|
||||||
|
@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
# Automatic Prefix Caching
|
# Automatic Prefix Caching
|
||||||
|
|
||||||
The core idea of [PagedAttention](#design-paged-attention) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
|
The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
|
||||||
|
|
||||||
To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.
|
To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.
|
||||||
|
|
||||||
|
@ -26,7 +26,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving.
|
|||||||
vLLM is fast with:
|
vLLM is fast with:
|
||||||
|
|
||||||
- State-of-the-art serving throughput
|
- State-of-the-art serving throughput
|
||||||
- Efficient management of attention key and value memory with **PagedAttention**
|
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
|
||||||
- Continuous batching of incoming requests
|
- Continuous batching of incoming requests
|
||||||
- Fast model execution with CUDA/HIP graph
|
- Fast model execution with CUDA/HIP graph
|
||||||
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
|
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
|
||||||
@ -54,6 +54,8 @@ For more information, check out the following:
|
|||||||
|
|
||||||
## Documentation
|
## Documentation
|
||||||
|
|
||||||
|
% How to start using vLLM?
|
||||||
|
|
||||||
```{toctree}
|
```{toctree}
|
||||||
:caption: Getting Started
|
:caption: Getting Started
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
@ -65,6 +67,8 @@ getting_started/troubleshooting
|
|||||||
getting_started/faq
|
getting_started/faq
|
||||||
```
|
```
|
||||||
|
|
||||||
|
% What does vLLM support?
|
||||||
|
|
||||||
```{toctree}
|
```{toctree}
|
||||||
:caption: Models
|
:caption: Models
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
@ -75,6 +79,8 @@ models/supported_models
|
|||||||
models/extensions/index
|
models/extensions/index
|
||||||
```
|
```
|
||||||
|
|
||||||
|
% Additional capabilities
|
||||||
|
|
||||||
```{toctree}
|
```{toctree}
|
||||||
:caption: Features
|
:caption: Features
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
@ -89,6 +95,8 @@ features/spec_decode
|
|||||||
features/compatibility_matrix
|
features/compatibility_matrix
|
||||||
```
|
```
|
||||||
|
|
||||||
|
% Details about running vLLM
|
||||||
|
|
||||||
```{toctree}
|
```{toctree}
|
||||||
:caption: Inference and Serving
|
:caption: Inference and Serving
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
@ -104,6 +112,8 @@ serving/usage_stats
|
|||||||
serving/integrations/index
|
serving/integrations/index
|
||||||
```
|
```
|
||||||
|
|
||||||
|
% Scaling up vLLM for production
|
||||||
|
|
||||||
```{toctree}
|
```{toctree}
|
||||||
:caption: Deployment
|
:caption: Deployment
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
@ -115,6 +125,8 @@ deployment/frameworks/index
|
|||||||
deployment/integrations/index
|
deployment/integrations/index
|
||||||
```
|
```
|
||||||
|
|
||||||
|
% Making the most out of vLLM
|
||||||
|
|
||||||
```{toctree}
|
```{toctree}
|
||||||
:caption: Performance
|
:caption: Performance
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
@ -123,28 +135,7 @@ performance/optimization
|
|||||||
performance/benchmarks
|
performance/benchmarks
|
||||||
```
|
```
|
||||||
|
|
||||||
% Community: User community resources
|
% Explanation of vLLM internals
|
||||||
|
|
||||||
```{toctree}
|
|
||||||
:caption: Community
|
|
||||||
:maxdepth: 1
|
|
||||||
|
|
||||||
community/meetups
|
|
||||||
community/sponsors
|
|
||||||
```
|
|
||||||
|
|
||||||
```{toctree}
|
|
||||||
:caption: API Reference
|
|
||||||
:maxdepth: 2
|
|
||||||
|
|
||||||
api/offline_inference/index
|
|
||||||
api/engine/index
|
|
||||||
api/inference_params
|
|
||||||
api/multimodal/index
|
|
||||||
api/model/index
|
|
||||||
```
|
|
||||||
|
|
||||||
% Design Documents: Details about vLLM internals
|
|
||||||
|
|
||||||
```{toctree}
|
```{toctree}
|
||||||
:caption: Design Documents
|
:caption: Design Documents
|
||||||
@ -159,7 +150,7 @@ design/automatic_prefix_caching
|
|||||||
design/multiprocessing
|
design/multiprocessing
|
||||||
```
|
```
|
||||||
|
|
||||||
% Developer Guide: How to contribute to the vLLM project
|
% How to contribute to the vLLM project
|
||||||
|
|
||||||
```{toctree}
|
```{toctree}
|
||||||
:caption: Developer Guide
|
:caption: Developer Guide
|
||||||
@ -172,6 +163,29 @@ contributing/model/index
|
|||||||
contributing/vulnerability_management
|
contributing/vulnerability_management
|
||||||
```
|
```
|
||||||
|
|
||||||
|
% Technical API specifications
|
||||||
|
|
||||||
|
```{toctree}
|
||||||
|
:caption: API Reference
|
||||||
|
:maxdepth: 2
|
||||||
|
|
||||||
|
api/offline_inference/index
|
||||||
|
api/engine/index
|
||||||
|
api/inference_params
|
||||||
|
api/multimodal/index
|
||||||
|
api/model/index
|
||||||
|
```
|
||||||
|
|
||||||
|
% Latest news and acknowledgements
|
||||||
|
|
||||||
|
```{toctree}
|
||||||
|
:caption: Community
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
community/meetups
|
||||||
|
community/sponsors
|
||||||
|
```
|
||||||
|
|
||||||
# Indices and tables
|
# Indices and tables
|
||||||
|
|
||||||
- {ref}`genindex`
|
- {ref}`genindex`
|
||||||
|
Loading…
x
Reference in New Issue
Block a user