[Docs] Add Docs on Limitations of VLM Support (#5383)

This commit is contained in:
Roger Wang 2024-06-10 09:53:50 -07:00 committed by GitHub
parent c5602f0baa
commit 856c990041
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 9 additions and 1 deletions

View File

@ -92,6 +92,7 @@ autodoc_mock_imports = [
"vllm._C", "vllm._C",
"PIL", "PIL",
"numpy", "numpy",
'triton'
"tqdm", "tqdm",
"tensorizer", "tensorizer",
] ]

View File

@ -16,6 +16,13 @@ The following :ref:`engine arguments <engine_args>` are specific to VLMs:
:prog: -m vllm.entrypoints.openai.api_server :prog: -m vllm.entrypoints.openai.api_server
:nodefaultconst: :nodefaultconst:
.. important::
Currently, the support for vision language models on vLLM has the following limitations:
* Only single image input is supported per text prompt.
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the huggingface implementation.
We are continuously improving user & developer experience for VLMs. Please raise an issue on GitHub if you have any feedback or feature requests.
Offline Batched Inference Offline Batched Inference
------------------------- -------------------------
@ -31,7 +38,7 @@ To initialize a VLM, the aforementioned arguments must be passed to the ``LLM``
image_feature_size=576, image_feature_size=576,
) )
For now, we only support a single image per text prompt. To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`: To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:
* ``prompt``: The prompt should have a number of ``<image>`` tokens equal to ``image_feature_size``. * ``prompt``: The prompt should have a number of ``<image>`` tokens equal to ``image_feature_size``.
* ``multi_modal_data``: This should be an instance of :class:`~vllm.multimodal.image.ImagePixelData` or :class:`~vllm.multimodal.image.ImageFeatureData`. * ``multi_modal_data``: This should be an instance of :class:`~vllm.multimodal.image.ImagePixelData` or :class:`~vllm.multimodal.image.ImageFeatureData`.