vllm/docs/source/serving/offline_inference.md

(offline-inference)=

# Offline Inference

You can run vLLM in your own code on a list of prompts.

The offline API is based on the {class}`~vllm.LLM` class.
To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.

For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
and runs it in vLLM using the default configuration.

```python
from vllm import LLM

llm = LLM(model="facebook/opt-125m")
```

After initializing the `LLM` instance, you can perform model inference using various APIs.
The available APIs depend on the type of model that is being run:

- [Generative models](#generative-models) output logprobs which are sampled from to obtain the final output text.
- [Pooling models](#pooling-models) output their hidden states directly.

Please refer to the above pages for more details about each API.

:::{seealso}
[API Reference](/api/offline_inference/index)
:::

## Configuration Options

This section lists the most common options for running the vLLM engine.
For a full list, refer to the [Engine Arguments](#engine-args) page.

(model-resolution)=

### Model resolution

vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
and finding the corresponding implementation that is registered to vLLM.
Nevertheless, our model resolution may fail for the following reasons:

- The `config.json` of the model repository lacks the `architectures` field.
- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.

To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
For example:

```python
from vllm import LLM

model = LLM(
    model="cerebras/Cerebras-GPT-1.3B",
    hf_overrides={"architectures": ["GPT2LMHeadModel"]},  # GPT-2
)
```

Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.

### Reducing memory usage

Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.

#### Tensor Parallelism (TP)

Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.

The following code splits the model across 2 GPUs.

```python
llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
          tensor_parallel_size=2)
```

:::{important}
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.

To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
:::

#### Quantization

Quantized models take less memory at the cost of lower precision.

Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Neural Magic](https://huggingface.co/neuralmagic))
and used directly without extra configuration.

Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details.

#### Context length and batch size

You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
and the maximum batch size (`max_num_seqs` option).

```python
from vllm import LLM

llm = LLM(model="adept/fuyu-8b",
          max_model_len=2048,
          max_num_seqs=2)
```

#### Adjust cache size

If you run out of CPU RAM, try the following options:

- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).

#### Disable unused modalities

You can disable unused modalities (except for text) by setting its limit to zero.

For example, if your application only accepts image input, there is no need to allocate any memory for videos.

```python
from vllm import LLM

# Accept images but not videos
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
          limit_mm_per_prompt={"video": 0})
```

You can even run a multi-modal model for text-only inference:

```python
from vllm import LLM

# Don't accept images. Just text.
llm = LLM(model="google/gemma-3-27b-it",
          limit_mm_per_prompt={"image": 0})
```

### Performance optimization and tuning

You can potentially improve the performance of vLLM by finetuning various options.
Please refer to [this guide](#optimization-and-tuning) for more details.
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			`(offline-inference)=`

			`# Offline Inference`

			`You can run vLLM in your own code on a list of prompts.`

			The offline API is based on the {class}`~vllm.LLM` class.
			To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.

			For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
			`and runs it in vLLM using the default configuration.`

			```python
[doc] add missing imports (#15699) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-03-28 23:56:48 +08:00			`from vllm import LLM`

[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			`llm = LLM(model="facebook/opt-125m")`
			```

			After initializing the `LLM` instance, you can perform model inference using various APIs.
			`The available APIs depend on the type of model that is being run:`

			`- [Generative models](#generative-models) output logprobs which are sampled from to obtain the final output text.`
			`- [Pooling models](#pooling-models) output their hidden states directly.`

			`Please refer to the above pages for more details about each API.`

[Doc] Convert docs to use colon fences (#12471) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-01-29 03:38:29 +00:00			`:::{seealso}`
[Doc][4/N] Reorganize API Reference (#11843) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-08 21:34:44 +08:00			`[API Reference](/api/offline_inference/index)`
[Doc] Convert docs to use colon fences (#12471) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-01-29 03:38:29 +00:00			`:::`
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00
			`## Configuration Options`

			`This section lists the most common options for running the vLLM engine.`
			`For a full list, refer to the [Engine Arguments](#engine-args) page.`

[Doc] Troubleshooting errors during model inspection (#12351) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-23 22:46:58 +08:00			`(model-resolution)=`

[Doc] Add documentation for specifying model architecture (#12105) 2025-01-16 15:53:43 +08:00			`### Model resolution`

			vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
			`and finding the corresponding implementation that is registered to vLLM.`
			`Nevertheless, our model resolution may fail for the following reasons:`

			- The `config.json` of the model repository lacks the `architectures` field.
			`- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.`
			`- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.`

			To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
			`For example:`

			```python
[doc] add missing imports (#15699) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-03-28 23:56:48 +08:00			`from vllm import LLM`

[Doc] Add documentation for specifying model architecture (#12105) 2025-01-16 15:53:43 +08:00			`model = LLM(`
			`model="cerebras/Cerebras-GPT-1.3B",`
			`hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2`
			`)`
			```

			`Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.`

[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			`### Reducing memory usage`

			`Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.`

			`#### Tensor Parallelism (TP)`

			Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.

			`The following code splits the model across 2 GPUs.`

			```python
			`llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",`
			`tensor_parallel_size=2)`
			```

[Doc] Convert docs to use colon fences (#12471) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-01-29 03:38:29 +00:00			`:::{important}`
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
			before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.

			To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
[Doc] Convert docs to use colon fences (#12471) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-01-29 03:38:29 +00:00			`:::`
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00
			`#### Quantization`

			`Quantized models take less memory at the cost of lower precision.`

			`Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Neural Magic](https://huggingface.co/neuralmagic))`
			`and used directly without extra configuration.`

			Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details.

			`#### Context length and batch size`

[CI/Build] Add markdown linter (#11857) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2025-01-12 03:17:13 -05:00			You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			and the maximum batch size (`max_num_seqs` option).

			```python
[doc] add missing imports (#15699) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-03-28 23:56:48 +08:00			`from vllm import LLM`

[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			`llm = LLM(model="adept/fuyu-8b",`
			`max_model_len=2048,`
			`max_num_seqs=2)`
			```

[Doc] Update docs on handling OOM (#15357) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com> 2025-03-25 05:29:34 +08:00			`#### Adjust cache size`

			`If you run out of CPU RAM, try the following options:`

			- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
			- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).

[V1] Enable multi-input by default (#15799) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-04-12 16:52:39 +08:00			`#### Disable unused modalities`

			`You can disable unused modalities (except for text) by setting its limit to zero.`

			`For example, if your application only accepts image input, there is no need to allocate any memory for videos.`

			```python
			`from vllm import LLM`

			`# Accept images but not videos`
			`llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",`
			`limit_mm_per_prompt={"video": 0})`
			```

			`You can even run a multi-modal model for text-only inference:`

			```python
			`from vllm import LLM`

			`# Don't accept images. Just text.`
			`llm = LLM(model="google/gemma-3-27b-it",`
			`limit_mm_per_prompt={"image": 0})`
			```

[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			`### Performance optimization and tuning`

			`You can potentially improve the performance of vLLM by finetuning various options.`
			`Please refer to [this guide](#optimization-and-tuning) for more details.`