20231088/vllm

[Frontend] Support reasoning content for deepseek r1 (#12473 )

Signed-off-by: Ce Gao <cegao@tensorchord.ai>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>

2025-01-29 11:38:08 +08:00

5.6 KiB

Raw Blame History

(reasoning-outputs)=

Reasoning Outputs

vLLM offers support for reasoning models like DeepSeek R1, which are designed to generate outputs containing both reasoning steps and final conclusions.

Reasoning models return a additional reasoning_content field in their outputs, which contains the reasoning steps that led to the final conclusion. This field is not present in the outputs of other models.

Supported Models

vLLM currently supports the following reasoning models:

DeepSeek R1 series (deepseek_r1, which looks for <think> ... </think>)

Quickstart

To use reasoning models, you need to specify the --enable-reasoning and --reasoning-parser flags when making a request to the chat completion endpoint. The --reasoning-parser flag specifies the reasoning parser to use for extracting reasoning content from the model output.

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    --enable-reasoning --reasoning-parser deepseek_r1

Next, make a request to the model that should return the reasoning content in the response.

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Round 1
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(model=model, messages=messages)

reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

print("reasoning_content:", reasoning_content)
print("content:", content)

The reasoning_content field contains the reasoning steps that led to the final conclusion, while the content field contains the final conclusion.

Streaming chat completions

Streaming chat completions are also supported for reasoning models. The reasoning_content field is available in the delta field in chat completion response chunks.

{
    "id": "chatcmpl-123",
    "object": "chat.completion.chunk",
    "created": 1694268190,
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    "system_fingerprint": "fp_44709d6fcb",
    "choices": [
        {
            "index": 0,
            "delta": {
                "role": "assistant",
                "reasoning_content": "is",
            },
            "logprobs": null,
            "finish_reason": null
        }
    ]
}

Please note that it is not compatible with the OpenAI Python client library. You can use the requests library to make streaming requests.

How to support a new reasoning model

You can add a new ReasoningParser similar to vllm/entrypoints/openai/reasoning_parsers/deepseek_r1_reasoning_parser.py.

# import the required packages

from vllm.entrypoints.openai.reasoning_parsers.abs_reasoning_parsers import (
    ReasoningParser, ReasoningParserManager)
from vllm.entrypoints.openai.protocol import (ChatCompletionRequest,
                                              DeltaMessage)

# define a reasoning parser and register it to vllm
# the name list in register_module can be used
# in --reasoning-parser.
@ReasoningParserManager.register_module(["example"])
class ExampleParser(ReasoningParser):
    def __init__(self, tokenizer: AnyTokenizer):
        super().__init__(tokenizer)

    def extract_reasoning_content_streaming(
        self,
        previous_text: str,
        current_text: str,
        delta_text: str,
        previous_token_ids: Sequence[int],
        current_token_ids: Sequence[int],
        delta_token_ids: Sequence[int],
    ) -> Union[DeltaMessage, None]:
        """
        Instance method that should be implemented for extracting reasoning
        from an incomplete response; for use when handling reasoning calls and
        streaming. Has to be an instance method because  it requires state -
        the current tokens/diffs, but also the information about what has
        previously been parsed and extracted (see constructor)
        """

    def extract_reasoning_content(
            self, model_output: str, request: ChatCompletionRequest
    ) -> Tuple[Optional[str], Optional[str]]:
        """
        Extract reasoning content from a complete model-generated string.

        Used for non-streaming responses where we have the entire model response
        available before sending to the client.

        Parameters:
        model_output: str
            The model-generated string to extract reasoning content from.

        request: ChatCompletionRequest
            The request object that was used to generate the model_output.

        Returns:
        Tuple[Optional[str], Optional[str]]
            A tuple containing the reasoning content and the content.
        """

After defining the reasoning parser, you can use it by specifying the --reasoning-parser flag when making a request to the chat completion endpoint.

vllm serve <model_tag> \
    --enable-reasoning --reasoning-parser example

Limitations

The reasoning content is only available for online serving's chat completion endpoint (/v1/chat/completions).
It is not compatible with the structured_outputs and tool_calling features.
The reasoning content is not available for all models. Check the model's documentation to see if it supports reasoning.

5.6 KiB Raw Blame History