2025-01-07 11:20:01 +08:00
|
|
|
(offline-inference)=
|
|
|
|
|
|
|
|
# Offline Inference
|
|
|
|
|
|
|
|
You can run vLLM in your own code on a list of prompts.
|
|
|
|
|
|
|
|
The offline API is based on the {class}`~vllm.LLM` class.
|
|
|
|
To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.
|
|
|
|
|
|
|
|
For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
|
|
|
|
and runs it in vLLM using the default configuration.
|
|
|
|
|
|
|
|
```python
|
2025-03-28 23:56:48 +08:00
|
|
|
from vllm import LLM
|
|
|
|
|
2025-01-07 11:20:01 +08:00
|
|
|
llm = LLM(model="facebook/opt-125m")
|
|
|
|
```
|
|
|
|
|
|
|
|
After initializing the `LLM` instance, you can perform model inference using various APIs.
|
|
|
|
The available APIs depend on the type of model that is being run:
|
|
|
|
|
|
|
|
- [Generative models](#generative-models) output logprobs which are sampled from to obtain the final output text.
|
|
|
|
- [Pooling models](#pooling-models) output their hidden states directly.
|
|
|
|
|
|
|
|
Please refer to the above pages for more details about each API.
|
|
|
|
|
2025-01-29 03:38:29 +00:00
|
|
|
:::{seealso}
|
2025-01-08 21:34:44 +08:00
|
|
|
[API Reference](/api/offline_inference/index)
|
2025-01-29 03:38:29 +00:00
|
|
|
:::
|
2025-01-07 11:20:01 +08:00
|
|
|
|
|
|
|
## Configuration Options
|
|
|
|
|
|
|
|
This section lists the most common options for running the vLLM engine.
|
|
|
|
For a full list, refer to the [Engine Arguments](#engine-args) page.
|
|
|
|
|
2025-01-23 22:46:58 +08:00
|
|
|
(model-resolution)=
|
|
|
|
|
2025-01-16 15:53:43 +08:00
|
|
|
### Model resolution
|
|
|
|
|
|
|
|
vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
|
|
|
|
and finding the corresponding implementation that is registered to vLLM.
|
|
|
|
Nevertheless, our model resolution may fail for the following reasons:
|
|
|
|
|
|
|
|
- The `config.json` of the model repository lacks the `architectures` field.
|
|
|
|
- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
|
|
|
|
- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
|
|
|
|
|
|
|
|
To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
|
|
|
|
For example:
|
|
|
|
|
|
|
|
```python
|
2025-03-28 23:56:48 +08:00
|
|
|
from vllm import LLM
|
|
|
|
|
2025-01-16 15:53:43 +08:00
|
|
|
model = LLM(
|
|
|
|
model="cerebras/Cerebras-GPT-1.3B",
|
|
|
|
hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2
|
|
|
|
)
|
|
|
|
```
|
|
|
|
|
|
|
|
Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.
|
|
|
|
|
2025-01-07 11:20:01 +08:00
|
|
|
### Reducing memory usage
|
|
|
|
|
|
|
|
Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
|
|
|
|
|
|
|
|
#### Tensor Parallelism (TP)
|
|
|
|
|
|
|
|
Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.
|
|
|
|
|
|
|
|
The following code splits the model across 2 GPUs.
|
|
|
|
|
|
|
|
```python
|
|
|
|
llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
|
|
|
|
tensor_parallel_size=2)
|
|
|
|
```
|
|
|
|
|
2025-01-29 03:38:29 +00:00
|
|
|
:::{important}
|
2025-01-07 11:20:01 +08:00
|
|
|
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
|
|
|
|
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
|
|
|
|
|
|
|
|
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
|
2025-01-29 03:38:29 +00:00
|
|
|
:::
|
2025-01-07 11:20:01 +08:00
|
|
|
|
|
|
|
#### Quantization
|
|
|
|
|
|
|
|
Quantized models take less memory at the cost of lower precision.
|
|
|
|
|
|
|
|
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Neural Magic](https://huggingface.co/neuralmagic))
|
|
|
|
and used directly without extra configuration.
|
|
|
|
|
|
|
|
Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details.
|
|
|
|
|
|
|
|
#### Context length and batch size
|
|
|
|
|
2025-01-12 03:17:13 -05:00
|
|
|
You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
|
2025-01-07 11:20:01 +08:00
|
|
|
and the maximum batch size (`max_num_seqs` option).
|
|
|
|
|
|
|
|
```python
|
2025-03-28 23:56:48 +08:00
|
|
|
from vllm import LLM
|
|
|
|
|
2025-01-07 11:20:01 +08:00
|
|
|
llm = LLM(model="adept/fuyu-8b",
|
|
|
|
max_model_len=2048,
|
|
|
|
max_num_seqs=2)
|
|
|
|
```
|
|
|
|
|
2025-03-25 05:29:34 +08:00
|
|
|
#### Adjust cache size
|
|
|
|
|
|
|
|
If you run out of CPU RAM, try the following options:
|
|
|
|
|
|
|
|
- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
|
|
|
|
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
|
|
|
|
|
2025-04-12 16:52:39 +08:00
|
|
|
#### Disable unused modalities
|
|
|
|
|
|
|
|
You can disable unused modalities (except for text) by setting its limit to zero.
|
|
|
|
|
|
|
|
For example, if your application only accepts image input, there is no need to allocate any memory for videos.
|
|
|
|
|
|
|
|
```python
|
|
|
|
from vllm import LLM
|
|
|
|
|
|
|
|
# Accept images but not videos
|
|
|
|
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
|
|
|
|
limit_mm_per_prompt={"video": 0})
|
|
|
|
```
|
|
|
|
|
|
|
|
You can even run a multi-modal model for text-only inference:
|
|
|
|
|
|
|
|
```python
|
|
|
|
from vllm import LLM
|
|
|
|
|
|
|
|
# Don't accept images. Just text.
|
|
|
|
llm = LLM(model="google/gemma-3-27b-it",
|
|
|
|
limit_mm_per_prompt={"image": 0})
|
|
|
|
```
|
|
|
|
|
2025-01-07 11:20:01 +08:00
|
|
|
### Performance optimization and tuning
|
|
|
|
|
|
|
|
You can potentially improve the performance of vLLM by finetuning various options.
|
|
|
|
Please refer to [this guide](#optimization-and-tuning) for more details.
|