[Bugfix] Enforce no chunked prefill for embedding models (#10470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
Cyrus Leung 2024-11-20 13:12:51 +08:00 committed by GitHub
parent ad44437ba3
commit b4be5a8adb
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 60 additions and 15 deletions

View File

@ -39,12 +39,13 @@ Feature x Feature
- :abbr:`prmpt adptr (Prompt Adapter)` - :abbr:`prmpt adptr (Prompt Adapter)`
- :ref:`SD <spec_decode>` - :ref:`SD <spec_decode>`
- CUDA graph - CUDA graph
- :abbr:`emd (Embedding Models)`
- :abbr:`enc-dec (Encoder-Decoder Models)` - :abbr:`enc-dec (Encoder-Decoder Models)`
- :abbr:`logP (Logprobs)` - :abbr:`logP (Logprobs)`
- :abbr:`prmpt logP (Prompt Logprobs)` - :abbr:`prmpt logP (Prompt Logprobs)`
- :abbr:`async output (Async Output Processing)` - :abbr:`async output (Async Output Processing)`
- multi-step - multi-step
- :abbr:`MM (Multimodal)` - :abbr:`mm (Multimodal)`
- best-of - best-of
- beam-search - beam-search
- :abbr:`guided dec (Guided Decoding)` - :abbr:`guided dec (Guided Decoding)`
@ -64,6 +65,7 @@ Feature x Feature
- -
- -
- -
-
* - :ref:`APC <apc>` * - :ref:`APC <apc>`
- ✅ - ✅
- -
@ -80,6 +82,7 @@ Feature x Feature
- -
- -
- -
-
* - :ref:`LoRA <lora>` * - :ref:`LoRA <lora>`
- `✗ <https://github.com/vllm-project/vllm/pull/9057>`__ - `✗ <https://github.com/vllm-project/vllm/pull/9057>`__
- ✅ - ✅
@ -96,6 +99,7 @@ Feature x Feature
- -
- -
- -
-
* - :abbr:`prmpt adptr (Prompt Adapter)` * - :abbr:`prmpt adptr (Prompt Adapter)`
- ✅ - ✅
- ✅ - ✅
@ -112,6 +116,7 @@ Feature x Feature
- -
- -
- -
-
* - :ref:`SD <spec_decode>` * - :ref:`SD <spec_decode>`
- ✗ - ✗
- ✅ - ✅
@ -128,6 +133,7 @@ Feature x Feature
- -
- -
- -
-
* - CUDA graph * - CUDA graph
- ✅ - ✅
- ✅ - ✅
@ -144,6 +150,24 @@ Feature x Feature
- -
- -
- -
-
* - :abbr:`emd (Embedding Models)`
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
-
-
-
-
-
-
-
-
-
-
* - :abbr:`enc-dec (Encoder-Decoder Models)` * - :abbr:`enc-dec (Encoder-Decoder Models)`
- ✗ - ✗
- `✗ <https://github.com/vllm-project/vllm/issues/7366>`__ - `✗ <https://github.com/vllm-project/vllm/issues/7366>`__
@ -151,6 +175,7 @@ Feature x Feature
- ✗ - ✗
- `✗ <https://github.com/vllm-project/vllm/issues/7366>`__ - `✗ <https://github.com/vllm-project/vllm/issues/7366>`__
- ✅ - ✅
- ✅
- -
- -
- -
@ -167,6 +192,7 @@ Feature x Feature
- ✅ - ✅
- ✅ - ✅
- ✅ - ✅
- ✗
- ✅ - ✅
- -
- -
@ -183,6 +209,7 @@ Feature x Feature
- ✅ - ✅
- `✗ <https://github.com/vllm-project/vllm/pull/8199>`__ - `✗ <https://github.com/vllm-project/vllm/pull/8199>`__
- ✅ - ✅
- ✗
- ✅ - ✅
- ✅ - ✅
- -
@ -200,6 +227,7 @@ Feature x Feature
- ✗ - ✗
- ✅ - ✅
- ✗ - ✗
- ✗
- ✅ - ✅
- ✅ - ✅
- -
@ -216,6 +244,7 @@ Feature x Feature
- ✗ - ✗
- ✅ - ✅
- ✗ - ✗
- ✗
- ✅ - ✅
- `✗ <https://github.com/vllm-project/vllm/issues/8198>`__ - `✗ <https://github.com/vllm-project/vllm/issues/8198>`__
- ✅ - ✅
@ -224,14 +253,15 @@ Feature x Feature
- -
- -
- -
* - :abbr:`MM (Multimodal)` * - :abbr:`mm (Multimodal)`
- `✗ <https://github.com/vllm-project/vllm/pull/8346>`__ -
- `✗ <https://github.com/vllm-project/vllm/pull/8348>`__ - `✗ <https://github.com/vllm-project/vllm/pull/8348>`__
- `✗ <https://github.com/vllm-project/vllm/pull/7199>`__ - `✗ <https://github.com/vllm-project/vllm/pull/7199>`__
- ? - ?
- ? - ?
- ✅ - ✅
- ✗ - ✅
- ✅
- ✅ - ✅
- ✅ - ✅
- ✅ - ✅
@ -247,6 +277,7 @@ Feature x Feature
- ✅ - ✅
- `✗ <https://github.com/vllm-project/vllm/issues/6137>`__ - `✗ <https://github.com/vllm-project/vllm/issues/6137>`__
- ✅ - ✅
- ✗
- ✅ - ✅
- ✅ - ✅
- ✅ - ✅
@ -263,6 +294,7 @@ Feature x Feature
- ✅ - ✅
- `✗ <https://github.com/vllm-project/vllm/issues/6137>`__ - `✗ <https://github.com/vllm-project/vllm/issues/6137>`__
- ✅ - ✅
- ✗
- ✅ - ✅
- ✅ - ✅
- ✅ - ✅
@ -279,6 +311,7 @@ Feature x Feature
- ? - ?
- ✅ - ✅
- ✅ - ✅
- ✗
- ? - ?
- ✅ - ✅
- ✅ - ✅
@ -353,6 +386,14 @@ Feature x Hardware
- ✅ - ✅
- ✗ - ✗
- ✅ - ✅
* - :abbr:`emd (Embedding Models)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
* - :abbr:`enc-dec (Encoder-Decoder Models)` * - :abbr:`enc-dec (Encoder-Decoder Models)`
- ✅ - ✅
- ✅ - ✅
@ -361,6 +402,14 @@ Feature x Hardware
- ✅ - ✅
- ✅ - ✅
- ✗ - ✗
* - :abbr:`mm (Multimodal)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - :abbr:`logP (Logprobs)` * - :abbr:`logP (Logprobs)`
- ✅ - ✅
- ✅ - ✅
@ -393,14 +442,6 @@ Feature x Hardware
- ✅ - ✅
- `✗ <https://github.com/vllm-project/vllm/issues/8477>`__ - `✗ <https://github.com/vllm-project/vllm/issues/8477>`__
- ✅ - ✅
* - :abbr:`MM (Multimodal)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - best-of * - best-of
- ✅ - ✅
- ✅ - ✅

View File

@ -1014,7 +1014,8 @@ class EngineArgs:
use_spec_decode = self.speculative_model is not None use_spec_decode = self.speculative_model is not None
if (is_gpu and not use_sliding_window and not use_spec_decode if (is_gpu and not use_sliding_window and not use_spec_decode
and not self.enable_lora and not self.enable_lora
and not self.enable_prompt_adapter): and not self.enable_prompt_adapter
and model_config.task != "embedding"):
self.enable_chunked_prefill = True self.enable_chunked_prefill = True
logger.warning( logger.warning(
"Chunked prefill is enabled by default for models with " "Chunked prefill is enabled by default for models with "
@ -1031,6 +1032,9 @@ class EngineArgs:
"errors during the initial memory profiling phase, or result " "errors during the initial memory profiling phase, or result "
"in low performance due to small KV cache space. Consider " "in low performance due to small KV cache space. Consider "
"setting --max-model-len to a smaller value.", max_model_len) "setting --max-model-len to a smaller value.", max_model_len)
elif self.enable_chunked_prefill and model_config.task == "embedding":
msg = "Chunked prefill is not supported for embedding models"
raise ValueError(msg)
speculative_config = SpeculativeConfig.maybe_create_spec_config( speculative_config = SpeculativeConfig.maybe_create_spec_config(
target_model_config=model_config, target_model_config=model_config,