[Doc] Add more tips to avoid OOM (#16765)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
parent
a6481525b8
commit
61a44a0b22
@ -28,6 +28,8 @@ Please refer to the above pages for more details about each API.
|
|||||||
[API Reference](/api/offline_inference/index)
|
[API Reference](/api/offline_inference/index)
|
||||||
:::
|
:::
|
||||||
|
|
||||||
|
(configuration-options)=
|
||||||
|
|
||||||
## Configuration Options
|
## Configuration Options
|
||||||
|
|
||||||
This section lists the most common options for running the vLLM engine.
|
This section lists the most common options for running the vLLM engine.
|
||||||
@ -184,6 +186,29 @@ llm = LLM(model="google/gemma-3-27b-it",
|
|||||||
limit_mm_per_prompt={"image": 0})
|
limit_mm_per_prompt={"image": 0})
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#### Multi-modal processor arguments
|
||||||
|
|
||||||
|
For certain models, you can adjust the multi-modal processor arguments to
|
||||||
|
reduce the size of the processed multi-modal inputs, which in turn saves memory.
|
||||||
|
|
||||||
|
Here are some examples:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from vllm import LLM
|
||||||
|
|
||||||
|
# Available for Qwen2-VL series models
|
||||||
|
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
|
||||||
|
mm_processor_kwargs={
|
||||||
|
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
|
||||||
|
})
|
||||||
|
|
||||||
|
# Available for InternVL series models
|
||||||
|
llm = LLM(model="OpenGVLab/InternVL2-2B",
|
||||||
|
mm_processor_kwargs={
|
||||||
|
"max_dynamic_patch": 4, # Default is 12
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
### Performance optimization and tuning
|
### Performance optimization and tuning
|
||||||
|
|
||||||
You can potentially improve the performance of vLLM by finetuning various options.
|
You can potentially improve the performance of vLLM by finetuning various options.
|
||||||
|
@ -33,11 +33,13 @@ print(completion.choices[0].message)
|
|||||||
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
|
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
|
||||||
You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
|
You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
|
||||||
:::
|
:::
|
||||||
|
|
||||||
:::{important}
|
:::{important}
|
||||||
By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
|
By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
|
||||||
|
|
||||||
To disable this behavior, please pass `--generation-config vllm` when launching the server.
|
To disable this behavior, please pass `--generation-config vllm` when launching the server.
|
||||||
:::
|
:::
|
||||||
|
|
||||||
## Supported APIs
|
## Supported APIs
|
||||||
|
|
||||||
We currently support the following OpenAI APIs:
|
We currently support the following OpenAI APIs:
|
||||||
@ -172,6 +174,12 @@ print(completion._request_id)
|
|||||||
|
|
||||||
The `vllm serve` command is used to launch the OpenAI-compatible server.
|
The `vllm serve` command is used to launch the OpenAI-compatible server.
|
||||||
|
|
||||||
|
:::{tip}
|
||||||
|
The vast majority of command-line arguments are based on those for offline inference.
|
||||||
|
|
||||||
|
See [here](configuration-options) for some common options.
|
||||||
|
:::
|
||||||
|
|
||||||
:::{argparse}
|
:::{argparse}
|
||||||
:module: vllm.entrypoints.openai.cli_args
|
:module: vllm.entrypoints.openai.cli_args
|
||||||
:func: create_parser_for_docs
|
:func: create_parser_for_docs
|
||||||
|
Loading…
x
Reference in New Issue
Block a user