[Docs] Fix a link and grammar issue in production-stack.md (#16809 )

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
[Doc] add podman setup instructions for official image (#16796 )
2025-04-18 06:42:58 +00:00 · 2025-04-18 06:10:49 +00:00 · 2025-04-18 13:51:30 +08:00 · 2025-04-18 05:46:45 +00:00 · 2025-04-17 22:13:35 -07:00 · 2025-04-17 22:13:32 -07:00
33 changed files with 193 additions and 139 deletions
--- a/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh
+++ b/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh
@ -17,10 +17,12 @@ source /etc/environment
 docker run --privileged --net host --shm-size=16G -it \
    -e "HF_TOKEN=$HF_TOKEN" --name tpu-test \
    vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git \
-    && python3 -m pip install pytest \
+    && python3 -m pip install pytest tpu-info \
    && python3 -m pip install lm_eval[api]==0.4.4 \
    && export VLLM_USE_V1=1 \
    && export VLLM_XLA_CHECK_RECOMPILATION=1 \
    && echo HARDWARE \
    && tpu-info \
    && echo TEST_0 \
    && pytest -v -s /workspace/vllm/tests/v1/tpu/test_perf.py \
    && echo TEST_1 \
--- a/.github/ISSUE_TEMPLATE/200-installation.yml
+++ b/.github/ISSUE_TEMPLATE/200-installation.yml
@ -14,7 +14,7 @@ body:
    description: |
      Please run the following and paste the output below.
      ```sh
-      wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
+      wget https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py
      # For security purposes, please feel free to check the contents of collect_env.py before running it.
      python collect_env.py
      ```
--- a/.github/ISSUE_TEMPLATE/300-usage.yml
+++ b/.github/ISSUE_TEMPLATE/300-usage.yml
@ -14,7 +14,7 @@ body:
    description: |
      Please run the following and paste the output below.
      ```sh
-      wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
+      wget https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py
      # For security purposes, please feel free to check the contents of collect_env.py before running it.
      python collect_env.py
      ```
--- a/.github/ISSUE_TEMPLATE/400-bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/400-bug-report.yml
@ -14,7 +14,7 @@ body:
    description: |
      Please run the following and paste the output below.
      ```sh
-      wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
+      wget https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py
      # For security purposes, please feel free to check the contents of collect_env.py before running it.
      python collect_env.py
      ```
--- a/.github/ISSUE_TEMPLATE/700-performance-discussion.yml
+++ b/.github/ISSUE_TEMPLATE/700-performance-discussion.yml
@ -35,7 +35,7 @@ body:
    description: |
      Please run the following and paste the output below.
      ```sh
-      wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
+      wget https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py
      # For security purposes, please feel free to check the contents of collect_env.py before running it.
      python collect_env.py
      ```
--- a/csrc/moe/moe_wna16.cu
+++ b/csrc/moe/moe_wna16.cu
@ -13,7 +13,6 @@
 template <typename scalar_t, int bit, int GROUPS>
 __global__ void moe_wna16_gemm_kernel(
    const scalar_t* __restrict__ input, scalar_t* __restrict__ output,
    const uint32_t* __restrict__ qweight, const scalar_t* __restrict__ scales,
    const uint32_t* __restrict__ qzeros,
@ -54,8 +53,6 @@ __global__ void moe_wna16_gemm_kernel(
      if (token_index / top_k >= size_m) break;
      num_valid_tokens = m + 1;
      if (blockIdx.z == 0 && offset_n < size_n)
        output[token_index * size_n + offset_n] = Dtype::int2num(0);
      if (expert_id != -1) {
        int k_per_thread = DIVIDE(BLOCK_SIZE_K, BLOCK_SIZE_N);
@ -284,8 +281,7 @@ torch::Tensor moe_wna16_gemm(torch::Tensor input, torch::Tensor output,
                             int64_t BLOCK_SIZE_M, int64_t BLOCK_SIZE_N,
                             int64_t BLOCK_SIZE_K, int64_t bit) {
  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
-  auto options =
+  output.zero_();
      torch::TensorOptions().dtype(input.dtype()).device(input.device());
  const int num_experts = b_qweight.size(0);
  const int size_m = input.size(0);
@ -302,9 +298,9 @@ torch::Tensor moe_wna16_gemm(torch::Tensor input, torch::Tensor output,
  const uint32_t* b_qzeros_ptr;
  if (b_qzeros.has_value())
    b_qzeros_ptr = (const uint32_t*)b_qzeros.value().data_ptr<uint8_t>();
-  const float* topk_weights_ptr;
+  const float* topk_weights_ptr = nullptr;
  if (topk_weights.has_value())
-    topk_weights_ptr = (const float*)topk_weights.value().data_ptr();
+    topk_weights_ptr = (const float*)topk_weights.value().data_ptr<float>();
  int groups_per_block_row = BLOCK_SIZE_K / group_size;
  TORCH_CHECK(bit == 4 || bit == 8, "bit must be 4 or 8");
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@ -241,6 +241,7 @@ if [ "$TARGETPLATFORM" != "linux/arm64" ]; then \
 fi
 COPY examples examples
 COPY benchmarks benchmarks
 COPY ./vllm/collect_env.py .
 # Although we build Flashinfer with AOT mode, there's still
 # some issues w.r.t. JIT compilation. Therefore we need to
--- a/docker/Dockerfile.cpu
+++ b/docker/Dockerfile.cpu
@ -121,6 +121,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
 ADD ./tests/ ./tests/
 ADD ./examples/ ./examples/
 ADD ./benchmarks/ ./benchmarks/
 ADD ./vllm/collect_env.py .
 # install development dependencies (for testing)
 RUN --mount=type=cache,target=/root/.cache/uv \
--- a/docs/source/deployment/docker.md
+++ b/docs/source/deployment/docker.md
@ -19,6 +19,18 @@ $ docker run --runtime nvidia --gpus all \
    --model mistralai/Mistral-7B-v0.1
 ```
 This image can also be used with other container engines such as [Podman](https://podman.io/).
 ```console
 $ podman run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-v0.1
 ```
 You can add any other <project:#engine-args> you need after the image tag (`vllm/vllm-openai:latest`).
 :::{note}
--- a/docs/source/deployment/integrations/production-stack.md
+++ b/docs/source/deployment/integrations/production-stack.md
@ -16,7 +16,7 @@ Ensure that you have a running Kubernetes environment with GPU (you can follow [
 ## Deployment using vLLM production stack
-The standard vLLM production stack install uses a Helm chart. You can run this [bash script](https://github.com/vllm-project/production-stack/blob/main/tutorials/install-helm.sh) to install Helm on your GPU server.
+The standard vLLM production stack is installed using a Helm chart. You can run this [bash script](https://github.com/vllm-project/production-stack/blob/main/utils/install-helm.sh) to install Helm on your GPU server.
 To install the vLLM production stack, run the following commands on your desktop:
--- a/docs/source/models/supported_models.md
+++ b/docs/source/models/supported_models.md
@ -788,7 +788,7 @@ llm = LLM(
 Online serving:
 ```bash
-vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
+vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt '{"image":4}'
 ```
 **This is no longer required if you are using vLLM V1.**
--- a/docs/source/serving/multimodal_inputs.md
+++ b/docs/source/serving/multimodal_inputs.md
@ -228,7 +228,7 @@ First, launch the OpenAI-compatible server:
 ```bash
 vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
-  --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2
+  --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
 ```
 Then, you can use the OpenAI client as follows:
--- a/examples/offline_inference/cpu_offload_lmcache.py
+++ b/examples/offline_inference/cpu_offload_lmcache.py
@ -37,11 +37,11 @@ def build_llm_with_lmcache():
        '{"kv_connector":"LMCacheConnector", "kv_role":"kv_both"}')
    # Set GPU memory utilization to 0.8 for an A40 GPU with 40GB
    # memory. Reduce the value if your GPU has less memory.
-    # Note that LMCache is not compatible with chunked prefill for now.
+    # Note: LMCache supports chunked prefill (see vLLM#14505, LMCache#392).
    llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2",
              kv_transfer_config=ktc,
              max_model_len=8000,
-              enable_chunked_prefill=False,
+              enable_chunked_prefill=True,
              gpu_memory_utilization=0.8)
    try:
--- a/examples/offline_inference/mistral-small.py
+++ b/examples/offline_inference/mistral-small.py
@ -16,11 +16,11 @@ from vllm.sampling_params import SamplingParams
 # # Mistral format
 # vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
 #   --tokenizer-mode mistral --config-format mistral --load-format mistral \
-#   --limit-mm-per-prompt 'image=4' --max-model-len 16384
+#   --limit-mm-per-prompt '{"image":4}' --max-model-len 16384
 #
 # # HF format
 # vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
-#   --limit-mm-per-prompt 'image=4' --max-model-len 16384
+#   --limit-mm-per-prompt '{"image":4}' --max-model-len 16384
 # ```
 #
 # - Client:
--- a/examples/online_serving/openai_chat_completion_client_for_multimodal.py
+++ b/examples/online_serving/openai_chat_completion_client_for_multimodal.py
@ -9,7 +9,7 @@ vllm serve llava-hf/llava-1.5-7b-hf --chat-template template_llava.jinja
 (multi-image inference with Phi-3.5-vision-instruct)
 vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
-    --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2
+    --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
 (audio inference with Ultravox)
 vllm serve fixie-ai/ultravox-v0_5-llama-3_2-1b --max-model-len 4096
--- a/tests/engine/test_arg_utils.py
+++ b/tests/engine/test_arg_utils.py
@ -24,6 +24,10 @@ from vllm.utils import FlexibleArgumentParser
    }),
 ])
 def test_limit_mm_per_prompt_parser(arg, expected):
    """This functionality is deprecated and will be removed in the future.
    This argument should be passed as JSON string instead.
    TODO: Remove with nullable_kvs."""
    parser = EngineArgs.add_cli_args(FlexibleArgumentParser())
    if arg is None:
        args = parser.parse_args([])
--- a/tests/entrypoints/openai/test_audio.py
+++ b/tests/entrypoints/openai/test_audio.py
@ -27,7 +27,7 @@ def server():
        "--enforce-eager",
        "--trust-remote-code",
        "--limit-mm-per-prompt",
-        f"audio={MAXIMUM_AUDIOS}",
+        str({"audio": MAXIMUM_AUDIOS}),
    ]
    with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
--- a/tests/entrypoints/openai/test_video.py
+++ b/tests/entrypoints/openai/test_video.py
@ -31,7 +31,7 @@ def server():
        "--enforce-eager",
        "--trust-remote-code",
        "--limit-mm-per-prompt",
-        f"video={MAXIMUM_VIDEOS}",
+        str({"video": MAXIMUM_VIDEOS}),
    ]
    with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
--- a/tests/entrypoints/openai/test_vision.py
+++ b/tests/entrypoints/openai/test_vision.py
@ -35,7 +35,7 @@ def server():
        "--enforce-eager",
        "--trust-remote-code",
        "--limit-mm-per-prompt",
-        f"image={MAXIMUM_IMAGES}",
+        str({"image": MAXIMUM_IMAGES}),
    ]
    with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
--- a/tests/entrypoints/openai/test_vision_embedding.py
+++ b/tests/entrypoints/openai/test_vision_embedding.py
@ -37,7 +37,7 @@ def server():
        "--enforce-eager",
        "--trust-remote-code",
        "--limit-mm-per-prompt",
-        f"image={MAXIMUM_IMAGES}",
+        str({"image": MAXIMUM_IMAGES}),
        "--chat-template",
        str(vlm2vec_jinja_path),
    ]
--- a/tests/models/decoder_only/audio_language/test_ultravox.py
+++ b/tests/models/decoder_only/audio_language/test_ultravox.py
@ -48,9 +48,9 @@ def audio(request):
 ])
 def server(request, audio_assets):
    args = [
-        "--dtype=bfloat16", "--max-model-len=4096", "--enforce-eager",
+        "--dtype", "bfloat16", "--max-model-len", "4096", "--enforce-eager",
-        f"--limit-mm-per-prompt=audio={len(audio_assets)}",
+        "--limit-mm-per-prompt",
-        "--trust-remote-code"
+        str({"audio": len(audio_assets)}), "--trust-remote-code"
    ] + [
        f"--{key.replace('_','-')}={value}"
        for key, value in request.param.items()
--- a/vllm/attention/backends/rocm_flash_attn.py
+++ b/vllm/attention/backends/rocm_flash_attn.py
@ -27,6 +27,7 @@ _PARTITION_SIZE_ROCM = 256
 class ROCmFlashAttentionBackend(AttentionBackend):
    accept_output_buffer: bool = True
    @staticmethod
    def get_name() -> str:
@ -515,7 +516,7 @@ class ROCmFlashAttentionImpl(AttentionImpl):
            from vllm.attention.ops.triton_flash_attention import (  # noqa: F401
                triton_attention)
-            self.attn_func = triton_attention
+            self.triton_attn_func = triton_attention
            logger.debug("Using Triton FA in ROCmBackend")
            if self.sliding_window != (-1, -1):
                logger.warning("ROCm Triton FA does not currently support "
@ -531,7 +532,7 @@ class ROCmFlashAttentionImpl(AttentionImpl):
            else:
                try:
                    from flash_attn import flash_attn_varlen_func  # noqa: F401
-                    self.attn_func = flash_attn_varlen_func
+                    self.fa_attn_func = flash_attn_varlen_func
                    logger.debug("Using CK FA in ROCmBackend")
                except ModuleNotFoundError:
                    self.use_naive_attn = True
@ -542,7 +543,7 @@ class ROCmFlashAttentionImpl(AttentionImpl):
                        "ROCm Naive FlashAttention does not support "
                        "attention logits soft capping.")
-                self.attn_func = _sdpa_attention
+                self.sdpa_attn_func = _sdpa_attention
                logger.debug("Using naive (SDPA) attention in ROCmBackend")
    def repeat_kv(self, x: torch.Tensor, n_rep: int) -> torch.Tensor:
@ -613,6 +614,8 @@ class ROCmFlashAttentionImpl(AttentionImpl):
        Returns:
            shape = [num_tokens, num_heads * head_size]
        """
        assert output is not None, "Output tensor must be provided."
        query = query.view(-1, self.num_heads, self.head_size)
        if key is not None:
            assert value is not None
@ -656,7 +659,6 @@ class ROCmFlashAttentionImpl(AttentionImpl):
            assert attn_metadata.num_encoder_tokens is not None
            num_prefill_tokens = attn_metadata.num_encoder_tokens
        output = torch.empty_like(query)
        # Query for decode. KV is not needed because it is already cached.
        decode_query = query[num_prefill_tokens:]
        # QKV for prefill.
@ -704,11 +706,11 @@ class ROCmFlashAttentionImpl(AttentionImpl):
                            query.dtype,
                            seq_lens,
                            make_attn_mask=causal_mask)  # type: ignore
-                    out, _ = self.attn_func(
+                    self.triton_attn_func(
                        query,
                        key,
                        value,
-                        None,
+                        output[:num_prefill_tokens],
                        query_seq_start_loc,
                        key_seq_start_loc,
                        query_max_seq_len,
@ -733,10 +735,11 @@ class ROCmFlashAttentionImpl(AttentionImpl):
                    key = key.movedim(0, key.dim() - 2)
                    value = value.movedim(0, value.dim() - 2)
                    # sdpa math backend attention
-                    out = self.attn_func(
+                    self.sdpa_attn_func(
                        query,
                        key,
                        value,
                        output[:num_prefill_tokens],
                        query_seq_start_loc,
                        num_prefill_tokens,
                        self.num_heads,
@ -745,7 +748,8 @@ class ROCmFlashAttentionImpl(AttentionImpl):
                        attn_masks,
                    )
                else:
-                    out = self.attn_func(
+                    # upstream FA does not support an output arg, copy
                    output[:num_prefill_tokens] = self.fa_attn_func(
                        q=query,
                        k=key,
                        v=value,
@ -760,12 +764,6 @@ class ROCmFlashAttentionImpl(AttentionImpl):
                        softcap=self.logits_soft_cap,
                    )
                # common code for prefill
                assert output[:num_prefill_tokens].shape == out.shape
                if output.shape[0] > num_prefill_tokens:
                    output[:num_prefill_tokens] = out
                else:
                    output = out
            else:
                # prefix-enabled attention -
                # not applicable for encoder-only models
@ -818,14 +816,10 @@ class ROCmFlashAttentionImpl(AttentionImpl):
                    device=output.device,
                )
                max_logits = torch.empty_like(exp_sums)
                if num_prefill_tokens > 0:
                    out = output[num_prefill_tokens:]
                else:
                    out = output
                query_start_loc = None
                ops.paged_attention_rocm(
-                    out,
+                    output[num_prefill_tokens:],
                    exp_sums,
                    max_logits,
                    tmp_output,
@ -878,7 +872,8 @@ def _sdpa_attention(
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
-    seq_lens: List[int],
+    output: torch.Tensor,
    seq_lens: torch.Tensor,
    num_tokens: int,
    num_heads: int,
    head_size: int,
@ -886,9 +881,9 @@ def _sdpa_attention(
    attn_masks: Optional[List[torch.Tensor]] = None,
 ) -> torch.Tensor:
    start = 0
-    output = torch.empty((num_tokens, num_heads, head_size),
+    assert output.shape == (num_tokens, num_heads, head_size)
-                         dtype=query.dtype,
+    assert output.dtype == query.dtype
-                         device=query.device)
+    assert output.device == query.device
    for i, seq_len in enumerate(seq_lens):
        end = start + seq_len
--- a/vllm/collect_env.py
+++ b/vllm/collect_env.py
@ -283,12 +283,13 @@ def get_vllm_version():
    if __version__ == "dev":
        return "N/A (dev)"
-    if len(__version_tuple__) == 4: # dev build
+    if len(__version_tuple__) == 4:  # dev build
-        git_sha = __version_tuple__[-1][1:] # type: ignore
+        git_sha = __version_tuple__[-1][1:]  # type: ignore
        return f"{__version__} (git sha: {git_sha}"
    return __version__
 def summarize_vllm_build_flags():
    # This could be a static method if the flags are constant, or dynamic if you need to check environment variables, etc.
    return 'CUDA Archs: {}; ROCm: {}; Neuron: {}'.format(
@ -502,7 +503,9 @@ def get_pip_packages(run_lambda, patterns=None):
            print("uv is set")
            cmd = ["uv", "pip", "list", "--format=freeze"]
        else:
-            raise RuntimeError("Could not collect pip list output (pip or uv module not available)")
+            raise RuntimeError(
                "Could not collect pip list output (pip or uv module not available)"
            )
        out = run_and_read_all(run_lambda, cmd)
        return "\n".join(line for line in out.splitlines()
@ -535,13 +538,12 @@ def is_xnnpack_available():
    else:
        return "N/A"
 def get_env_vars():
    env_vars = ''
-    secret_terms=('secret', 'token', 'api', 'access', 'password')
+    secret_terms = ('secret', 'token', 'api', 'access', 'password')
-    report_prefix = ("TORCH", "NCCL", "PYTORCH",
+    report_prefix = ("TORCH", "NCCL", "PYTORCH", "CUDA", "CUBLAS", "CUDNN",
-                     "CUDA", "CUBLAS", "CUDNN",
+                     "OMP_", "MKL_", "NVIDIA")
                     "OMP_", "MKL_",
                     "NVIDIA")
    for k, v in os.environ.items():
        if any(term in k.lower() for term in secret_terms):
            continue
@ -552,6 +554,7 @@ def get_env_vars():
    return env_vars
 def get_env_info():
    run_lambda = run
    pip_version, pip_list_output = get_pip_packages(run_lambda)
--- a/vllm/config.py
+++ b/vllm/config.py
@ -17,7 +17,7 @@ from dataclasses import (MISSING, dataclass, field, fields, is_dataclass,
 from importlib.util import find_spec
 from pathlib import Path
 from typing import (TYPE_CHECKING, Any, Callable, ClassVar, Final, Literal,
-                    Optional, Protocol, TypeVar, Union)
+                    Optional, Protocol, TypeVar, Union, get_args)
 import torch
 from pydantic import BaseModel, Field, PrivateAttr
@ -2725,6 +2725,7 @@ class PromptAdapterConfig:
                                                self.prompt_adapter_dtype)
@config
@dataclass
 class MultiModalConfig:
    """Controls the behavior of multimodal models."""
@ -2732,6 +2733,8 @@ class MultiModalConfig:
    limit_per_prompt: Mapping[str, int] = field(default_factory=dict)
    """
    The maximum number of input items allowed per prompt for each modality.
    This should be a JSON string that will be parsed into a dictionary.
    Defaults to 1 (V0) or 999 (V1) for each modality.
    """
    def compute_hash(self) -> str:
@ -2753,24 +2756,20 @@ class MultiModalConfig:
                               usedforsecurity=False).hexdigest()
        return hash_str
    def get_default_limit_per_prompt(self) -> int:
        """
        Return the default number of input items allowed per prompt
        for any modality if not specified by the user.
        """
        return 999 if envs.VLLM_USE_V1 else 1
    def get_limit_per_prompt(self, modality: str) -> int:
        """
        Get the maximum number of input items allowed per prompt
        for the given modality.
        """
-        default = self.get_default_limit_per_prompt()
+        return self.limit_per_prompt.get(
-        return self.limit_per_prompt.get(modality, default)
+            modality,
            999 if envs.VLLM_USE_V1 else 1,
        )
    # TODO: Add configs to init vision tower or not.
@config
@dataclass
 class PoolerConfig:
    """Controls the behavior of output pooling in pooling models."""
@ -3095,15 +3094,28 @@ def get_served_model_name(model: str,
    return served_model_name
 GuidedDecodingBackendV0 = Literal["auto", "outlines", "lm-format-enforcer",
                                  "xgrammar"]
 GuidedDecodingBackendV1 = Literal["auto", "xgrammar", "guidance"]
@config
@dataclass
 class DecodingConfig:
-    """Dataclass which contains the decoding strategy of the engine"""
+    """Dataclass which contains the decoding strategy of the engine."""
-    # Which guided decoding algo to use.
+    guided_decoding_backend: Union[
-    # 'outlines' / 'lm-format-enforcer' / 'xgrammar'
+        GuidedDecodingBackendV0,
-    guided_decoding_backend: str = "auto" if envs.VLLM_USE_V1 else "xgrammar"
+        GuidedDecodingBackendV1] = "auto" if envs.VLLM_USE_V1 else "xgrammar"
    """Which engine will be used for guided decoding (JSON schema / regex etc)
    by default. With "auto", we will make opinionated choices based on request
    contents and what the backend libraries currently support, so the behavior
    is subject to change in each release."""
    reasoning_backend: Optional[str] = None
    """Select the reasoning parser depending on the model that you're using.
    This is used to parse the reasoning content into OpenAI API format.
    Required for `--enable-reasoning`."""
    def compute_hash(self) -> str:
        """
@ -3125,17 +3137,12 @@ class DecodingConfig:
        return hash_str
    def __post_init__(self):
        v0_valid_guided_backends = [
            'outlines', 'lm-format-enforcer', 'xgrammar', 'auto'
        ]
        v1_valid_guided_backends = ['xgrammar', 'guidance', 'auto']
        backend = GuidedDecodingParams(
            backend=self.guided_decoding_backend).backend_name
        if envs.VLLM_USE_V1:
-            valid_guided_backends = v1_valid_guided_backends
+            valid_guided_backends = get_args(GuidedDecodingBackendV1)
        else:
-            valid_guided_backends = v0_valid_guided_backends
+            valid_guided_backends = get_args(GuidedDecodingBackendV0)
        if backend not in valid_guided_backends:
            raise ValueError(f"Invalid guided_decoding_backend '{backend}',"
                             f" must be one of {valid_guided_backends}")
--- a/vllm/engine/arg_utils.py
+++ b/vllm/engine/arg_utils.py
@ -20,11 +20,12 @@ from vllm.config import (CacheConfig, CompilationConfig, Config, ConfigFormat,
                         DecodingConfig, Device, DeviceConfig,
                         DistributedExecutorBackend, HfOverrides,
                         KVTransferConfig, LoadConfig, LoadFormat, LoRAConfig,
-                         ModelConfig, ModelImpl, ObservabilityConfig,
+                         ModelConfig, ModelImpl, MultiModalConfig,
-                         ParallelConfig, PoolerConfig, PoolType,
+                         ObservabilityConfig, ParallelConfig, PoolerConfig,
-                         PromptAdapterConfig, SchedulerConfig, SchedulerPolicy,
+                         PoolType, PromptAdapterConfig, SchedulerConfig,
-                         SpeculativeConfig, TaskOption, TokenizerPoolConfig,
+                         SchedulerPolicy, SpeculativeConfig, TaskOption,
-                         VllmConfig, get_attr_docs, get_field)
+                         TokenizerPoolConfig, VllmConfig, get_attr_docs,
                         get_field)
 from vllm.executor.executor_base import ExecutorBase
 from vllm.logger import init_logger
 from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS
@ -190,7 +191,8 @@ class EngineArgs:
        TokenizerPoolConfig.pool_type
    tokenizer_pool_extra_config: dict[str, Any] = \
        get_field(TokenizerPoolConfig, "extra_config")
-    limit_mm_per_prompt: Optional[Mapping[str, int]] = None
+    limit_mm_per_prompt: Mapping[str, int] = \
        get_field(MultiModalConfig, "limit_per_prompt")
    mm_processor_kwargs: Optional[Dict[str, Any]] = None
    disable_mm_preprocessor_cache: bool = False
    enable_lora: bool = False
@ -252,7 +254,7 @@ class EngineArgs:
    additional_config: Optional[Dict[str, Any]] = None
    enable_reasoning: Optional[bool] = None
-    reasoning_parser: Optional[str] = None
+    reasoning_parser: Optional[str] = DecodingConfig.reasoning_backend
    use_tqdm_on_load: bool = LoadConfig.use_tqdm_on_load
    def __post_init__(self):
@ -478,18 +480,22 @@ class EngineArgs:
                            'Examples:\n'
                            '- 1k → 1000\n'
                            '- 1K → 1024\n')
-        parser.add_argument(
+
        # Guided decoding arguments
        guided_decoding_kwargs = get_kwargs(DecodingConfig)
        guided_decoding_group = parser.add_argument_group(
            title="DecodingConfig",
            description=DecodingConfig.__doc__,
        )
        guided_decoding_group.add_argument(
            '--guided-decoding-backend',
-            type=str,
+            **guided_decoding_kwargs["guided_decoding_backend"])
-            default=DecodingConfig.guided_decoding_backend,
+        guided_decoding_group.add_argument(
-            help='Which engine will be used for guided decoding'
+            "--reasoning-parser",
-            ' (JSON schema / regex etc) by default. Currently support '
+            # This choices is a special case because it's not static
-            'https://github.com/mlc-ai/xgrammar and '
+            choices=list(ReasoningParserManager.reasoning_parsers),
-            'https://github.com/guidance-ai/llguidance.'
+            **guided_decoding_kwargs["reasoning_backend"])
-            'Valid backend values are "xgrammar", "guidance", and "auto". '
+
            'With "auto", we will make opinionated choices based on request '
            'contents and what the backend libraries currently support, so '
            'the behavior is subject to change in each release.')
        parser.add_argument(
            '--logits-processor-pattern',
            type=optional_str,
@ -697,18 +703,14 @@ class EngineArgs:
                                     **tokenizer_kwargs["extra_config"])
        # Multimodal related configs
-        parser.add_argument(
+        multimodal_kwargs = get_kwargs(MultiModalConfig)
-            '--limit-mm-per-prompt',
+        multimodal_group = parser.add_argument_group(
-            type=nullable_kvs,
+            title="MultiModalConfig",
-            default=EngineArgs.limit_mm_per_prompt,
+            description=MultiModalConfig.__doc__,
-            # The default value is given in
+        )
-            # MultiModalConfig.get_default_limit_per_prompt
+        multimodal_group.add_argument('--limit-mm-per-prompt',
-            help=('For each multimodal plugin, limit how many '
+                                      **multimodal_kwargs["limit_per_prompt"])
-                  'input instances to allow for each prompt. '
+
                  'Expects a comma-separated list of items, '
                  'e.g.: `image=16,video=2` allows a maximum of 16 '
                  'images and 2 videos per prompt. Defaults to '
                  '1 (V0) or 999 (V1) for each modality.'))
        parser.add_argument(
            '--mm-processor-kwargs',
            default=None,
@ -1018,16 +1020,6 @@ class EngineArgs:
            "If enabled, the model will be able to generate reasoning content."
        )
        parser.add_argument(
            "--reasoning-parser",
            type=str,
            choices=list(ReasoningParserManager.reasoning_parsers),
            default=None,
            help=
            "Select the reasoning parser depending on the model that you're "
            "using. This is used to parse the reasoning content into OpenAI "
            "API format. Required for ``--enable-reasoning``.")
        parser.add_argument(
            "--disable-cascade-attn",
            action="store_true",
--- a/vllm/entrypoints/cli/collect_env.py
+++ b/vllm/entrypoints/cli/collect_env.py
@ -0,0 +1,35 @@
 # SPDX-License-Identifier: Apache-2.0
 import argparse
 from vllm.collect_env import main as collect_env_main
 from vllm.entrypoints.cli.types import CLISubcommand
 from vllm.entrypoints.openai.cli_args import make_arg_parser
 from vllm.utils import FlexibleArgumentParser
 class CollectEnvSubcommand(CLISubcommand):
    """The `serve` subcommand for the vLLM CLI. """
    def __init__(self):
        self.name = "collect-env"
        super().__init__()
    @staticmethod
    def cmd(args: argparse.Namespace) -> None:
        """Collect information about the environment."""
        collect_env_main()
    def subparser_init(
            self,
            subparsers: argparse._SubParsersAction) -> FlexibleArgumentParser:
        serve_parser = subparsers.add_parser(
            "collect-env",
            help="Start collecting environment information.",
            description="Start collecting environment information.",
            usage="vllm collect-env")
        return make_arg_parser(serve_parser)
 def cmd_init() -> list[CLISubcommand]:
    return [CollectEnvSubcommand()]
--- a/vllm/entrypoints/cli/main.py
+++ b/vllm/entrypoints/cli/main.py
@ -5,6 +5,7 @@ import signal
 import sys
 import vllm.entrypoints.cli.benchmark.main
 import vllm.entrypoints.cli.collect_env
 import vllm.entrypoints.cli.openai
 import vllm.entrypoints.cli.serve
 import vllm.version
@ -15,6 +16,7 @@ CMD_MODULES = [
    vllm.entrypoints.cli.openai,
    vllm.entrypoints.cli.serve,
    vllm.entrypoints.cli.benchmark.main,
    vllm.entrypoints.cli.collect_env,
 ]
--- a/vllm/model_executor/layers/fused_moe/layer.py
+++ b/vllm/model_executor/layers/fused_moe/layer.py
@ -422,6 +422,7 @@ class FusedMoE(torch.nn.Module):
        if params_dtype is None:
            params_dtype = torch.get_default_dtype()
        self.params_dtype = params_dtype
        # Note: here we guard against accessing the TP and DP groups when
        # uninitialized (this happens when testing)
--- a/vllm/model_executor/models/llama4.py
+++ b/vllm/model_executor/models/llama4.py
@ -51,8 +51,8 @@ class Llama4MoE(nn.Module):
        renormalize: bool,
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        router_scores, router_indices = fast_topk(gating_output, topk, dim=-1)
-        router_scores = torch.sigmoid(router_scores.float()).to(
+        # psuedo-standard is that the router scores are floats
-            hidden_states.dtype)
+        router_scores = torch.sigmoid(router_scores.float())
        return (router_scores, router_indices.to(torch.int32))
    def __init__(self,
--- a/vllm/model_executor/models/mllama4.py
+++ b/vllm/model_executor/models/mllama4.py
@ -672,9 +672,9 @@ class Llama4ForConditionalGeneration(nn.Module, SupportsMultiModal,
            self.config,
            None,
            prefix=maybe_prefix(prefix, "multi_modal_projector"))
        self.language_model = _initialize_model(
-            vllm_config=vllm_config.with_hf_config(config.text_config),
+            vllm_config=vllm_config.with_hf_config(config.text_config,
                                                   ["LlamaForCausalLM"]),
            prefix=maybe_prefix(prefix, "language_model"),
            model_class=Llama4ForCausalLM,
        )
@ -824,7 +824,7 @@ class Llama4ForConditionalGeneration(nn.Module, SupportsMultiModal,
        # language_model is an Llama4ForCausalLM instance. We load it's
        # using llama4's load_weights routine.
        language_model_weights, other_weights = self.separate_weights(
-            weights, prefix="language_model.model.")
+            weights, prefix="language_model.")
        loader = AutoWeightsLoader(self)
        loaded_language_model_params = loader.load_weights(
            language_model_weights)
--- a/vllm/multimodal/processing.py
+++ b/vllm/multimodal/processing.py
@ -1,4 +1,5 @@
 # SPDX-License-Identifier: Apache-2.0
 import json
 import re
 import sys
 from abc import ABC, abstractmethod
@ -1117,8 +1118,9 @@ class BaseMultiModalProcessor(ABC, Generic[_I]):
            if num_items > allowed_limit:
                raise ValueError(
-                    f"You set or defaulted to {modality}={allowed_limit} "
+                    "You set or defaulted to "
-                    f"in --limit-mm-per-prompt`, but passed {num_items} "
+                    f"'{json.dumps({modality: allowed_limit})}' in "
                    f"`--limit-mm-per-prompt`, but passed {num_items} "
                    f"{modality} items in the same prompt.")
        return mm_items
--- a/vllm/multimodal/registry.py
+++ b/vllm/multimodal/registry.py
@ -1,6 +1,7 @@
 # SPDX-License-Identifier: Apache-2.0
 import functools
 import json
 from collections import UserDict
 from collections.abc import Mapping, Sequence
 from dataclasses import dataclass
@ -194,9 +195,9 @@ class MultiModalRegistry:
            max_items = self._limits_by_model[model_config][data_key]
            if num_items > max_items:
                raise ValueError(
-                    f"You set {data_key}={max_items} (or defaulted to 1) in "
+                    f"You set '{json.dumps({data_key: max_items})}' (or "
-                    f"`--limit-mm-per-prompt`, but found {num_items} items "
+                    "defaulted to 1) in `--limit-mm-per-prompt`, but found "
-                    "in the same prompt.")
+                    f"{num_items} items in the same prompt.")
            input_dict = plugin.map_input(model_config, data_value,
                                          mm_processor_kwargs)
--- a/vllm/v1/engine/processor.py
+++ b/vllm/v1/engine/processor.py
@ -149,6 +149,7 @@ class Processor:
            "xgrammar", "xgrammar:disable-any-whitespace", "guidance",
            "guidance:disable-any-whitespace", "auto"
        ]
        engine_level_backend = self.decoding_config.guided_decoding_backend
        if engine_level_backend not in supported_backends:
            raise ValueError(f"Only {supported_backends} structured output is "
@ -169,8 +170,15 @@ class Processor:
        if engine_level_backend.startswith("xgrammar"):
            # xgrammar with no fallback
            validate_xgrammar_grammar(params)
-            params.guided_decoding.backend = engine_level_backend
+        elif engine_level_backend.startswith("guidance"):
-        elif engine_level_backend == "auto":
+            # TODO: ideally we would have the LLTokenizer here as Lark syntax
            # allows <|special_token|> and similar, see
            # https://github.com/guidance-ai/llguidance/blob/main/docs/syntax.md#special-tokens
            # Without tokenizer these are disallowed in grammars.
            validate_guidance_grammar(params, tokenizer=None)
        else:
            # NOTE: engine_level_backend must be "auto" here, because we have
            # checked supported_backends above.
            # "auto" is an opt-in to opinionated behavior where we try to
            # choose a backend based on request contents. This is not the
            # default as it is less predictable and subject to change
@ -183,14 +191,6 @@ class Processor:
                # are not supported in xgrammar. Fall back to guidance.
                params.guided_decoding.backend = "guidance"
        if engine_level_backend.startswith("guidance"):
            # TODO ideally we would have the LLTokenizer here as Lark syntax
            # allows <|special_token|> and similar, see
            # https://github.com/guidance-ai/llguidance/blob/main/docs/syntax.md#special-tokens
            # Without tokenizer these are disallowed in grammars.
            validate_guidance_grammar(params, tokenizer=None)
            params.guided_decoding.backend = engine_level_backend
    def process_inputs(
        self,
        request_id: str,
Author	SHA1	Message	Date
Michael Yao	26507f8973	[Docs] Fix a link and grammar issue in production-stack.md (#16809 ) Some checks failed pre-commit / pre-commit (push) Has been cancelled Details Close inactive issues and PRs / close-issues-and-pull-requests (push) Has been cancelled Details Signed-off-by: windsonsea <haifeng.yao@daocloud.io>	2025-04-18 06:42:58 +00:00
Nathan Weinberg	9c1d5b456d	[Doc] add podman setup instructions for official image (#16796 ) Signed-off-by: Nathan Weinberg <nweinber@redhat.com>	2025-04-18 06:10:49 +00:00
Lucia Fang	e31045f95c	[Bugfix] fix pp for llama4 (#16746 ) Signed-off-by: Lu Fang <fanglu@fb.com>	2025-04-18 13:51:30 +08:00
Luka Govedič	aaec845f8e	[ROCm] [Attention] Cleanup ROCm output passing (#16431 ) Signed-off-by: Luka Govedič <lgovedic@redhat.com>	2025-04-18 05:46:45 +00:00
rongfu.leng	7bdfd29a35	[Misc] add collect_env to cli and docker image (#16759 ) Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>	2025-04-17 22:13:35 -07:00
Harry Mellor	e78587a64c	Improve-mm-and-pooler-and-decoding-configs (#16789 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-04-17 22:13:32 -07:00
Lucas Wilkinson	7eb4255628	[BugFix] Accuracy fix for llama4 int4 - improperly casted scales (#16801 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2025-04-17 22:13:29 -07:00
Michael Goin	6a0f547561	Add hardware print to TPU V1 test (#16792 )	2025-04-17 22:13:26 -07:00
Shanshan Shen	30ed81b7ca	[V1][Structured Output] Minor modification to `_validate_structured_output()` (#16748 ) Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-18 13:12:54 +08:00
Chauncey	7a4a5de729	[Misc] Update outdated note: LMCache now supports chunked prefill (#16697 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2025-04-18 05:12:42 +00:00