2024-07-06 17:18:59 +08:00
.. _enabling_multimodal_inputs:
2024-07-03 11:34:00 +08:00
2024-07-06 17:18:59 +08:00
Enabling Multimodal Inputs
==========================
2024-07-03 11:34:00 +08:00
2024-07-06 17:18:59 +08:00
This document walks you through the steps to extend a vLLM model so that it accepts :ref: `multi-modal <multi_modality>` inputs.
2024-07-03 11:34:00 +08:00
2024-07-06 17:18:59 +08:00
.. seealso ::
:ref: `adding_a_new_model`
2024-07-03 11:34:00 +08:00
2024-07-06 17:18:59 +08:00
1. Update the base vLLM model
2024-07-03 11:34:00 +08:00
-----------------------------
2024-07-06 17:18:59 +08:00
It is assumed that you have already implemented the model in vLLM according to :ref: `these steps <adding_a_new_model>` .
Further update the model as follows:
2024-07-03 11:34:00 +08:00
2024-08-13 10:39:33 -07:00
- Implement the :class: `~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
2024-07-03 11:34:00 +08:00
.. code-block :: diff
2024-08-13 10:39:33 -07:00
+ from vllm.model_executor.models.interfaces import SupportsMultiModal
2024-07-03 11:34:00 +08:00
- class YourModelForImage2Seq(nn.Module):
2024-08-13 10:39:33 -07:00
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
2024-07-03 11:34:00 +08:00
.. note ::
The model class does not have to be named :code: `*ForCausalLM` .
Check out `the HuggingFace Transformers documentation <https://huggingface.co/docs/transformers/model_doc/auto#multimodal> `__ for some examples.
2024-07-06 17:18:59 +08:00
- If you haven't already done so, reserve a keyword parameter in :meth: `~torch.nn.Module.forward`
2024-07-03 11:34:00 +08:00
for each input tensor that corresponds to a multi-modal input, as shown in the following example:
.. code-block :: diff
2024-07-06 17:18:59 +08:00
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
kv_caches: List[torch.Tensor],
attn_metadata: AttentionMetadata,
+ pixel_values: torch.Tensor,
) -> SamplerOutput:
2024-07-03 11:34:00 +08:00
2. Register input mappers
-------------------------
2024-07-05 07:37:23 +08:00
For each modality type that the model accepts as input, decorate the model class with :meth: `MULTIMODAL_REGISTRY.register_input_mapper <vllm.multimodal.MultiModalRegistry.register_input_mapper>` .
2024-07-03 11:34:00 +08:00
This decorator accepts a function that maps multi-modal inputs to the keyword arguments you have previously defined in :meth: `~torch.nn.Module.forward` .
.. code-block :: diff
2024-08-13 10:39:33 -07:00
from vllm.model_executor.models.interfaces import SupportsMultiModal
2024-07-03 11:34:00 +08:00
+ from vllm.multimodal import MULTIMODAL_REGISTRY
2024-07-05 07:37:23 +08:00
+ @MULTIMODAL_REGISTRY.register_image_input_mapper()
2024-08-13 10:39:33 -07:00
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
2024-07-03 11:34:00 +08:00
A default mapper is available for each modality in the core vLLM library. This input mapper will be used if you do not provide your own function.
.. seealso ::
:ref: `input_processing_pipeline`
2024-07-06 17:18:59 +08:00
3. Register maximum number of multi-modal tokens
------------------------------------------------
2024-07-05 07:37:23 +08:00
2024-08-15 01:55:42 +08:00
For each modality type that the model accepts as input, calculate the maximum possible number of tokens per data instance
2024-07-05 07:37:23 +08:00
and register it via :meth: `INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_max_multimodal_tokens>` .
.. code-block :: diff
from vllm.inputs import INPUT_REGISTRY
2024-08-13 10:39:33 -07:00
from vllm.model_executor.models.interfaces import SupportsMultiModal
2024-07-05 07:37:23 +08:00
from vllm.multimodal import MULTIMODAL_REGISTRY
@MULTIMODAL_REGISTRY.register_image_input_mapper()
+ @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
2024-08-13 10:39:33 -07:00
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
2024-07-05 07:37:23 +08:00
Here are some examples:
- Image inputs (static feature size): `LLaVA-1.5 Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py> `__
- Image inputs (dynamic feature size): `LLaVA-NeXT Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py> `__
.. seealso ::
:ref: `input_processing_pipeline`
4. (Optional) Register dummy data
2024-07-03 11:34:00 +08:00
---------------------------------
During startup, dummy data is passed to the vLLM model to allocate memory. This only consists of text input by default, which may not be applicable to multi-modal models.
In such cases, you can define your own dummy data by registering a factory method via :meth: `INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_dummy_data>` .
.. code-block :: diff
2024-07-05 07:37:23 +08:00
from vllm.inputs import INPUT_REGISTRY
2024-08-13 10:39:33 -07:00
from vllm.model_executor.models.interfaces import SupportsMultiModal
2024-07-05 07:37:23 +08:00
from vllm.multimodal import MULTIMODAL_REGISTRY
2024-07-03 11:34:00 +08:00
2024-07-05 07:37:23 +08:00
@MULTIMODAL_REGISTRY.register_image_input_mapper()
@MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
2024-07-03 11:34:00 +08:00
+ @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
2024-08-13 10:39:33 -07:00
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
2024-07-05 07:37:23 +08:00
.. note ::
The dummy data should have the maximum possible number of multi-modal tokens, as described in the previous step.
2024-07-03 11:34:00 +08:00
Here are some examples:
- Image inputs (static feature size): `LLaVA-1.5 Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py> `__
- Image inputs (dynamic feature size): `LLaVA-NeXT Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py> `__
.. seealso ::
:ref: `input_processing_pipeline`
2024-07-05 07:37:23 +08:00
5. (Optional) Register input processor
2024-07-03 11:34:00 +08:00
--------------------------------------
Sometimes, there is a need to process inputs at the :class: `~vllm.LLMEngine` level before they are passed to the model executor.
This is often due to the fact that unlike implementations in HuggingFace Transformers, the reshaping and/or expansion of multi-modal embeddings needs to take place outside model's :meth: `~torch.nn.Module.forward` call.
You can register input processors via :meth: `INPUT_REGISTRY.register_input_processor <vllm.inputs.registry.InputRegistry.register_input_processor>` .
.. code-block :: diff
2024-07-05 07:37:23 +08:00
from vllm.inputs import INPUT_REGISTRY
2024-08-13 10:39:33 -07:00
from vllm.model_executor.models.interfaces import SupportsMultiModal
2024-07-05 07:37:23 +08:00
from vllm.multimodal import MULTIMODAL_REGISTRY
2024-07-03 11:34:00 +08:00
2024-07-05 07:37:23 +08:00
@MULTIMODAL_REGISTRY.register_image_input_mapper()
@MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
2024-07-03 11:34:00 +08:00
+ @INPUT_REGISTRY.register_input_processor(<your_input_processor>)
2024-08-13 10:39:33 -07:00
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
2024-07-03 11:34:00 +08:00
A common use case of input processors is inserting placeholder tokens to leverage the vLLM framework for attention mask generation.
Here are some examples:
- Insert static number of image tokens: `LLaVA-1.5 Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py> `__
- Insert dynamic number of image tokens: `LLaVA-NeXT Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py> `__
.. seealso ::
:ref: `input_processing_pipeline`