vllm/docs/source/performance/optimization.md

(optimization-and-tuning)=

# Optimization and Tuning

## Preemption

Due to the auto-regressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
available again. When this occurs, the following warning is printed:

```
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
```

While this mechanism ensures system robustness, preemption and recomputation can adversely affect end-to-end latency.
If you frequently encounter preemptions from the vLLM engine, consider the following actions:

- Increase `gpu_memory_utilization`. The vLLM pre-allocates GPU cache by using gpu_memory_utilization% of memory. By increasing this utilization, you can provide more KV cache space.
- Decrease `max_num_seqs` or `max_num_batched_tokens`. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.
- Increase `tensor_parallel_size`. This approach shards model weights, so each GPU has more memory available for KV cache.

You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False.

(chunked-prefill)=

## Chunked Prefill

vLLM supports an experimental feature chunked prefill. Chunked prefill allows to chunk large prefills into smaller chunks and batch them together with decode requests.

You can enable the feature by specifying `--enable-chunked-prefill` in the command line or setting `enable_chunked_prefill=True` in the LLM constructor.

```python
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True)
# Set max_num_batched_tokens to tune performance.
# NOTE: 2048 is the default max_num_batched_tokens for chunked prefill.
# llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=2048)
```

By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch.
This policy optimizes the TTFT (time to the first token), but incurs slower ITL (inter token latency) and inefficient GPU utilization.

Once chunked prefill is enabled, the policy is changed to prioritize decode requests.
It batches all pending decode requests to the batch before scheduling any prefill.
When there are available token_budget (`max_num_batched_tokens`), it schedules pending prefills.
If a last pending prefill request cannot fit into `max_num_batched_tokens`, it chunks it.

This policy has two benefits:

- It improves ITL and generation decode because decode requests are prioritized.
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.

You can tune the performance by changing `max_num_batched_tokens`. By default, it is set to 2048.
Smaller `max_num_batched_tokens` achieves better ITL because there are fewer prefills interrupting decodes.
Higher `max_num_batched_tokens` achieves better TTFT as you can put more prefill to the batch.

- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
- Note that the default value (2048) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler.

We recommend you set `max_num_batched_tokens > 2048` for throughput.

See related papers for more details (<https://arxiv.org/pdf/2401.08671> or <https://arxiv.org/pdf/2308.16369>).

Please try out this feature and let us know your feedback via GitHub issues!
[Doc][2/N] Reorganize Models and Usage sections (#11755) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-06 21:40:31 +08:00			`(optimization-and-tuning)=`
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 16:18:00 +09:00
[Doc][2/N] Reorganize Models and Usage sections (#11755) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-06 21:40:31 +08:00			`# Optimization and Tuning`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
			`## Preemption`
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 16:18:00 +09:00
[Scheduler] Warning upon preemption and Swapping (#4647) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> 2024-05-13 23:50:44 +09:00			`Due to the auto-regressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.`
			`The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes`
			`available again. When this occurs, the following warning is printed:`

			```
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			`WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1`
[Scheduler] Warning upon preemption and Swapping (#4647) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> 2024-05-13 23:50:44 +09:00			```

			`While this mechanism ensures system robustness, preemption and recomputation can adversely affect end-to-end latency.`
			`If you frequently encounter preemptions from the vLLM engine, consider the following actions:`

			- Increase `gpu_memory_utilization`. The vLLM pre-allocates GPU cache by using gpu_memory_utilization% of memory. By increasing this utilization, you can provide more KV cache space.
			- Decrease `max_num_seqs` or `max_num_batched_tokens`. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.
			- Increase `tensor_parallel_size`. This approach shards model weights, so each GPU has more memory available for KV cache.

			`You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False.`

[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			`(chunked-prefill)=`
[Doc] Compatibility matrix for mutual exclusive features (#8512) Signed-off-by: Wallas Santos <wallashss@ibm.com> 2024-10-11 15:18:50 -03:00
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			`## Chunked Prefill`
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 16:18:00 +09:00
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			`vLLM supports an experimental feature chunked prefill. Chunked prefill allows to chunk large prefills into smaller chunks and batch them together with decode requests.`
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 16:18:00 +09:00
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			You can enable the feature by specifying `--enable-chunked-prefill` in the command line or setting `enable_chunked_prefill=True` in the LLM constructor.
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 16:18:00 +09:00
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			```python
			`llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True)`
			`# Set max_num_batched_tokens to tune performance.`
Update default max_num_batch_tokens for chunked prefill (#11694) 2025-01-02 19:25:53 -05:00			`# NOTE: 2048 is the default max_num_batched_tokens for chunked prefill.`
			`# llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=2048)`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			```
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 16:18:00 +09:00
chunked-prefill-doc-syntax (#4603) Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html Co-authored-by: sang <rkooo567@gmail.com> 2024-05-09 22:13:23 -07:00			`By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch.`
			`This policy optimizes the TTFT (time to the first token), but incurs slower ITL (inter token latency) and inefficient GPU utilization.`
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 16:18:00 +09:00
chunked-prefill-doc-syntax (#4603) Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html Co-authored-by: sang <rkooo567@gmail.com> 2024-05-09 22:13:23 -07:00			`Once chunked prefill is enabled, the policy is changed to prioritize decode requests.`
			`It batches all pending decode requests to the batch before scheduling any prefill.`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			When there are available token_budget (`max_num_batched_tokens`), it schedules pending prefills.
			If a last pending prefill request cannot fit into `max_num_batched_tokens`, it chunks it.
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 16:18:00 +09:00
chunked-prefill-doc-syntax (#4603) Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html Co-authored-by: sang <rkooo567@gmail.com> 2024-05-09 22:13:23 -07:00			`This policy has two benefits:`
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 16:18:00 +09:00
chunked-prefill-doc-syntax (#4603) Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html Co-authored-by: sang <rkooo567@gmail.com> 2024-05-09 22:13:23 -07:00			`- It improves ITL and generation decode because decode requests are prioritized.`
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 16:18:00 +09:00			`- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.`

Update default max_num_batch_tokens for chunked prefill (#11694) 2025-01-02 19:25:53 -05:00			You can tune the performance by changing `max_num_batched_tokens`. By default, it is set to 2048.
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			Smaller `max_num_batched_tokens` achieves better ITL because there are fewer prefills interrupting decodes.
			Higher `max_num_batched_tokens` achieves better TTFT as you can put more prefill to the batch.
chunked-prefill-doc-syntax (#4603) Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html Co-authored-by: sang <rkooo567@gmail.com> 2024-05-09 22:13:23 -07:00
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
Update default max_num_batch_tokens for chunked prefill (#11694) 2025-01-02 19:25:53 -05:00			- Note that the default value (2048) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler.
chunked-prefill-doc-syntax (#4603) Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html Co-authored-by: sang <rkooo567@gmail.com> 2024-05-09 22:13:23 -07:00
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			We recommend you set `max_num_batched_tokens > 2048` for throughput.
chunked-prefill-doc-syntax (#4603) Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html Co-authored-by: sang <rkooo567@gmail.com> 2024-05-09 22:13:23 -07:00
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			`See related papers for more details (<https://arxiv.org/pdf/2401.08671> or <https://arxiv.org/pdf/2308.16369>).`
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 16:18:00 +09:00
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			`Please try out this feature and let us know your feedback via GitHub issues!`