vllm/docs/source/contributing/model/multimodal.md

(enabling-multimodal-inputs)=

# Enabling Multimodal Inputs

This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](#multimodal-inputs).

## 1. Update the base vLLM model

It is assumed that you have already implemented the model in vLLM according to [these steps](#new-model-basic).
Further update the model as follows:

- Implement the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.

  ```diff
  + from vllm.model_executor.models.interfaces import SupportsMultiModal

  - class YourModelForImage2Seq(nn.Module):
  + class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
  ```

  ```{note}
  The model class does not have to be named {code}`*ForCausalLM`.
  Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
  ```

- If you haven't already done so, reserve a keyword parameter in {meth}`~torch.nn.Module.forward`
  for each input tensor that corresponds to a multi-modal input, as shown in the following example:

  ```diff
    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        kv_caches: List[torch.Tensor],
        attn_metadata: AttentionMetadata,
  +     pixel_values: torch.Tensor,
    ) -> SamplerOutput:
  ```

## 2. Register input mappers

For each modality type that the model accepts as input, decorate the model class with {meth}`MULTIMODAL_REGISTRY.register_input_mapper <vllm.multimodal.MultiModalRegistry.register_input_mapper>`.
This decorator accepts a function that maps multi-modal inputs to the keyword arguments you have previously defined in {meth}`~torch.nn.Module.forward`.

```diff
  from vllm.model_executor.models.interfaces import SupportsMultiModal
+ from vllm.multimodal import MULTIMODAL_REGISTRY

+ @MULTIMODAL_REGISTRY.register_image_input_mapper()
  class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
```

A default mapper is available for each modality in the core vLLM library. This input mapper will be used if you do not provide your own function.

```{seealso}
[Input Processing Pipeline](#input-processing-pipeline)
```

## 3. Register maximum number of multi-modal tokens

For each modality type that the model accepts as input, calculate the maximum possible number of tokens per data item
and register it via {meth}`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_max_multimodal_tokens>`.

```diff
  from vllm.inputs import INPUT_REGISTRY
  from vllm.model_executor.models.interfaces import SupportsMultiModal
  from vllm.multimodal import MULTIMODAL_REGISTRY

  @MULTIMODAL_REGISTRY.register_image_input_mapper()
+ @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
  @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
  class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
```

Here are some examples:

- Image inputs (static feature size): [LLaVA-1.5 Model](gh-file:vllm/model_executor/models/llava.py)
- Image inputs (dynamic feature size): [LLaVA-NeXT Model](gh-file:vllm/model_executor/models/llava_next.py)

```{seealso}
[Input Processing Pipeline](#input-processing-pipeline)
```

## 4. (Optional) Register dummy data

During startup, dummy data is passed to the vLLM model to allocate memory. This only consists of text input by default, which may not be applicable to multi-modal models.
In such cases, you can define your own dummy data by registering a factory method via {meth}`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_dummy_data>`.

```diff
  from vllm.inputs import INPUT_REGISTRY
  from vllm.model_executor.models.interfaces import SupportsMultiModal
  from vllm.multimodal import MULTIMODAL_REGISTRY

  @MULTIMODAL_REGISTRY.register_image_input_mapper()
  @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
+ @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
  class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
```

```{note}
The dummy data should have the maximum possible number of multi-modal tokens, as described in the previous step.
```

Here are some examples:

- Image inputs (static feature size): [LLaVA-1.5 Model](gh-file:vllm/model_executor/models/llava.py)
- Image inputs (dynamic feature size): [LLaVA-NeXT Model](gh-file:vllm/model_executor/models/llava_next.py)

```{seealso}
[Input Processing Pipeline](#input-processing-pipeline)
```

## 5. (Optional) Register input processor

Sometimes, there is a need to process inputs at the {class}`~vllm.LLMEngine` level before they are passed to the model executor.
This is often due to the fact that unlike implementations in HuggingFace Transformers, the reshaping and/or expansion of multi-modal embeddings needs to take place outside model's {meth}`~torch.nn.Module.forward` call.
You can register input processors via {meth}`INPUT_REGISTRY.register_input_processor <vllm.inputs.registry.InputRegistry.register_input_processor>`.

```diff
  from vllm.inputs import INPUT_REGISTRY
  from vllm.model_executor.models.interfaces import SupportsMultiModal
  from vllm.multimodal import MULTIMODAL_REGISTRY

  @MULTIMODAL_REGISTRY.register_image_input_mapper()
  @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
  @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
+ @INPUT_REGISTRY.register_input_processor(<your_input_processor>)
  class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
```

A common use case of input processors is inserting placeholder tokens to leverage the vLLM framework for attention mask generation.
Here are some examples:

- Insert static number of image tokens: [LLaVA-1.5 Model](gh-file:vllm/model_executor/models/llava.py)
- Insert dynamic number of image tokens: [LLaVA-NeXT Model](gh-file:vllm/model_executor/models/llava_next.py)

```{seealso}
[Input Processing Pipeline](#input-processing-pipeline)
```
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			`(enabling-multimodal-inputs)=`

			`# Enabling Multimodal Inputs`

[Doc][2/N] Reorganize Models and Usage sections (#11755) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-06 21:40:31 +08:00			`This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](#multimodal-inputs).`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
			`## 1. Update the base vLLM model`

[Doc][2/N] Reorganize Models and Usage sections (#11755) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-06 21:40:31 +08:00			`It is assumed that you have already implemented the model in vLLM according to [these steps](#new-model-basic).`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			`Further update the model as follows:`

			- Implement the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.

			```diff
			`+ from vllm.model_executor.models.interfaces import SupportsMultiModal`

			`- class YourModelForImage2Seq(nn.Module):`
			`+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal):`
			```

			```{note}
			The model class does not have to be named {code}`*ForCausalLM`.
			`Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.`
			```

			- If you haven't already done so, reserve a keyword parameter in {meth}`~torch.nn.Module.forward`
			`for each input tensor that corresponds to a multi-modal input, as shown in the following example:`

			```diff
			`def forward(`
			`self,`
			`input_ids: torch.Tensor,`
			`positions: torch.Tensor,`
			`kv_caches: List[torch.Tensor],`
			`attn_metadata: AttentionMetadata,`
			`+ pixel_values: torch.Tensor,`
			`) -> SamplerOutput:`
			```

			`## 2. Register input mappers`

			For each modality type that the model accepts as input, decorate the model class with {meth}`MULTIMODAL_REGISTRY.register_input_mapper <vllm.multimodal.MultiModalRegistry.register_input_mapper>`.
			This decorator accepts a function that maps multi-modal inputs to the keyword arguments you have previously defined in {meth}`~torch.nn.Module.forward`.

			```diff
			`from vllm.model_executor.models.interfaces import SupportsMultiModal`
			`+ from vllm.multimodal import MULTIMODAL_REGISTRY`

			`+ @MULTIMODAL_REGISTRY.register_image_input_mapper()`
			`class YourModelForImage2Seq(nn.Module, SupportsMultiModal):`
			```

			`A default mapper is available for each modality in the core vLLM library. This input mapper will be used if you do not provide your own function.`

			```{seealso}
			`[Input Processing Pipeline](#input-processing-pipeline)`
			```

			`## 3. Register maximum number of multi-modal tokens`

			`For each modality type that the model accepts as input, calculate the maximum possible number of tokens per data item`
			and register it via {meth}`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_max_multimodal_tokens>`.

			```diff
			`from vllm.inputs import INPUT_REGISTRY`
			`from vllm.model_executor.models.interfaces import SupportsMultiModal`
			`from vllm.multimodal import MULTIMODAL_REGISTRY`

			`@MULTIMODAL_REGISTRY.register_image_input_mapper()`
			`+ @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)`
			`@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)`
			`class YourModelForImage2Seq(nn.Module, SupportsMultiModal):`
			```

			`Here are some examples:`

[Doc] Improve GitHub links (#11491) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2024-12-26 06:49:26 +08:00			`- Image inputs (static feature size): [LLaVA-1.5 Model](gh-file:vllm/model_executor/models/llava.py)`
			`- Image inputs (dynamic feature size): [LLaVA-NeXT Model](gh-file:vllm/model_executor/models/llava_next.py)`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
			```{seealso}
			`[Input Processing Pipeline](#input-processing-pipeline)`
			```

			`## 4. (Optional) Register dummy data`

			`During startup, dummy data is passed to the vLLM model to allocate memory. This only consists of text input by default, which may not be applicable to multi-modal models.`
			In such cases, you can define your own dummy data by registering a factory method via {meth}`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_dummy_data>`.

			```diff
			`from vllm.inputs import INPUT_REGISTRY`
			`from vllm.model_executor.models.interfaces import SupportsMultiModal`
			`from vllm.multimodal import MULTIMODAL_REGISTRY`

			`@MULTIMODAL_REGISTRY.register_image_input_mapper()`
			`@MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)`
			`+ @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)`
			`class YourModelForImage2Seq(nn.Module, SupportsMultiModal):`
			```

			```{note}
			`The dummy data should have the maximum possible number of multi-modal tokens, as described in the previous step.`
			```

			`Here are some examples:`

[Doc] Improve GitHub links (#11491) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2024-12-26 06:49:26 +08:00			`- Image inputs (static feature size): [LLaVA-1.5 Model](gh-file:vllm/model_executor/models/llava.py)`
			`- Image inputs (dynamic feature size): [LLaVA-NeXT Model](gh-file:vllm/model_executor/models/llava_next.py)`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
			```{seealso}
			`[Input Processing Pipeline](#input-processing-pipeline)`
			```

			`## 5. (Optional) Register input processor`

			Sometimes, there is a need to process inputs at the {class}`~vllm.LLMEngine` level before they are passed to the model executor.
			This is often due to the fact that unlike implementations in HuggingFace Transformers, the reshaping and/or expansion of multi-modal embeddings needs to take place outside model's {meth}`~torch.nn.Module.forward` call.
			You can register input processors via {meth}`INPUT_REGISTRY.register_input_processor <vllm.inputs.registry.InputRegistry.register_input_processor>`.

			```diff
			`from vllm.inputs import INPUT_REGISTRY`
			`from vllm.model_executor.models.interfaces import SupportsMultiModal`
			`from vllm.multimodal import MULTIMODAL_REGISTRY`

			`@MULTIMODAL_REGISTRY.register_image_input_mapper()`
			`@MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)`
			`@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)`
			`+ @INPUT_REGISTRY.register_input_processor(<your_input_processor>)`
			`class YourModelForImage2Seq(nn.Module, SupportsMultiModal):`
			```

			`A common use case of input processors is inserting placeholder tokens to leverage the vLLM framework for attention mask generation.`
			`Here are some examples:`

[Doc] Improve GitHub links (#11491) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2024-12-26 06:49:26 +08:00			`- Insert static number of image tokens: [LLaVA-1.5 Model](gh-file:vllm/model_executor/models/llava.py)`
			`- Insert dynamic number of image tokens: [LLaVA-NeXT Model](gh-file:vllm/model_executor/models/llava_next.py)`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
			```{seealso}
			`[Input Processing Pipeline](#input-processing-pipeline)`
			```