vllm/docs/source/serving/offline_inference.md

(offline-inference)=

# Offline Inference

You can run vLLM in your own code on a list of prompts.

The offline API is based on the {class}`~vllm.LLM` class.
To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.

For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
and runs it in vLLM using the default configuration.

```python
from vllm import LLM

llm = LLM(model="facebook/opt-125m")
```

After initializing the `LLM` instance, you can perform model inference using various APIs.
The available APIs depend on the type of model that is being run:

- [Generative models](#generative-models) output logprobs which are sampled from to obtain the final output text.
- [Pooling models](#pooling-models) output their hidden states directly.

Please refer to the above pages for more details about each API.

:::{seealso}
[API Reference](/api/offline_inference/index)
:::

## Configuration Options

This section lists the most common options for running the vLLM engine.
For a full list, refer to the [Engine Arguments](#engine-args) page.

(model-resolution)=

### Model resolution

vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
and finding the corresponding implementation that is registered to vLLM.
Nevertheless, our model resolution may fail for the following reasons:

- The `config.json` of the model repository lacks the `architectures` field.
- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.

To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
For example:

```python
from vllm import LLM

model = LLM(
    model="cerebras/Cerebras-GPT-1.3B",
    hf_overrides={"architectures": ["GPT2LMHeadModel"]},  # GPT-2
)
```

Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.

### Reducing memory usage

Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.

#### Tensor Parallelism (TP)

Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.

The following code splits the model across 2 GPUs.

```python
llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
          tensor_parallel_size=2)
```

:::{important}
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.

To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
:::

#### Quantization

Quantized models take less memory at the cost of lower precision.

Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Neural Magic](https://huggingface.co/neuralmagic))
and used directly without extra configuration.

Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details.

#### Context length and batch size

You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
and the maximum batch size (`max_num_seqs` option).

```python
from vllm import LLM

llm = LLM(model="adept/fuyu-8b",
          max_model_len=2048,
          max_num_seqs=2)
```

#### Adjust cache size

If you run out of CPU RAM, try the following options:

- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).

### Performance optimization and tuning

You can potentially improve the performance of vLLM by finetuning various options.
Please refer to [this guide](#optimization-and-tuning) for more details.
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			`(offline-inference)=`

			`# Offline Inference`

			`You can run vLLM in your own code on a list of prompts.`

			The offline API is based on the {class}`~vllm.LLM` class.
			To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.

			For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
			`and runs it in vLLM using the default configuration.`

			```python
[doc] add missing imports (#15699) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-03-28 23:56:48 +08:00			`from vllm import LLM`

[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			`llm = LLM(model="facebook/opt-125m")`
			```

			After initializing the `LLM` instance, you can perform model inference using various APIs.
			`The available APIs depend on the type of model that is being run:`

			`- [Generative models](#generative-models) output logprobs which are sampled from to obtain the final output text.`
			`- [Pooling models](#pooling-models) output their hidden states directly.`

			`Please refer to the above pages for more details about each API.`

[Doc] Convert docs to use colon fences (#12471) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-01-29 03:38:29 +00:00			`:::{seealso}`
[Doc][4/N] Reorganize API Reference (#11843) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-08 21:34:44 +08:00			`[API Reference](/api/offline_inference/index)`
[Doc] Convert docs to use colon fences (#12471) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-01-29 03:38:29 +00:00			`:::`
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00
			`## Configuration Options`

			`This section lists the most common options for running the vLLM engine.`
			`For a full list, refer to the [Engine Arguments](#engine-args) page.`

[Doc] Troubleshooting errors during model inspection (#12351) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-23 22:46:58 +08:00			`(model-resolution)=`

[Doc] Add documentation for specifying model architecture (#12105) 2025-01-16 15:53:43 +08:00			`### Model resolution`

			vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
			`and finding the corresponding implementation that is registered to vLLM.`
			`Nevertheless, our model resolution may fail for the following reasons:`

			- The `config.json` of the model repository lacks the `architectures` field.
			`- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.`
			`- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.`

			To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
			`For example:`

			```python
[doc] add missing imports (#15699) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-03-28 23:56:48 +08:00			`from vllm import LLM`

[Doc] Add documentation for specifying model architecture (#12105) 2025-01-16 15:53:43 +08:00			`model = LLM(`
			`model="cerebras/Cerebras-GPT-1.3B",`
			`hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2`
			`)`
			```

			`Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.`

[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			`### Reducing memory usage`

			`Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.`

			`#### Tensor Parallelism (TP)`

			Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.

			`The following code splits the model across 2 GPUs.`

			```python
			`llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",`
			`tensor_parallel_size=2)`
			```

[Doc] Convert docs to use colon fences (#12471) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-01-29 03:38:29 +00:00			`:::{important}`
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
			before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.

			To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
[Doc] Convert docs to use colon fences (#12471) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-01-29 03:38:29 +00:00			`:::`
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00
			`#### Quantization`

			`Quantized models take less memory at the cost of lower precision.`

			`Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Neural Magic](https://huggingface.co/neuralmagic))`
			`and used directly without extra configuration.`

			Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details.

			`#### Context length and batch size`

[CI/Build] Add markdown linter (#11857) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2025-01-12 03:17:13 -05:00			You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			and the maximum batch size (`max_num_seqs` option).

			```python
[doc] add missing imports (#15699) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-03-28 23:56:48 +08:00			`from vllm import LLM`

[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			`llm = LLM(model="adept/fuyu-8b",`
			`max_model_len=2048,`
			`max_num_seqs=2)`
			```

[Doc] Update docs on handling OOM (#15357) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com> 2025-03-25 05:29:34 +08:00			`#### Adjust cache size`

			`If you run out of CPU RAM, try the following options:`

			- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
			- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).

[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			`### Performance optimization and tuning`

			`You can potentially improve the performance of vLLM by finetuning various options.`
			`Please refer to [this guide](#optimization-and-tuning) for more details.`