vllm/tests/async_engine/test_api_server.py

# SPDX-License-Identifier: Apache-2.0

import os
import subprocess
import sys
import time
from multiprocessing import Pool
from pathlib import Path

import pytest
import requests


def _query_server(prompt: str, max_tokens: int = 5) -> dict:
    response = requests.post("http://localhost:8000/generate",
                             json={
                                 "prompt": prompt,
                                 "max_tokens": max_tokens,
                                 "temperature": 0,
                                 "ignore_eos": True
                             })
    response.raise_for_status()
    return response.json()


def _query_server_long(prompt: str) -> dict:
    return _query_server(prompt, max_tokens=500)


@pytest.fixture
def api_server(tokenizer_pool_size: int, distributed_executor_backend: str):
    script_path = Path(__file__).parent.joinpath(
        "api_server_async_engine.py").absolute()
    commands = [
        sys.executable,
        "-u",
        str(script_path),
        "--model",
        "facebook/opt-125m",
        "--host",
        "127.0.0.1",
        "--tokenizer-pool-size",
        str(tokenizer_pool_size),
        "--distributed-executor-backend",
        distributed_executor_backend,
    ]

    # API Server Test Requires V0.
    my_env = os.environ.copy()
    my_env["VLLM_USE_V1"] = "0"
    uvicorn_process = subprocess.Popen(commands, env=my_env)
    yield
    uvicorn_process.terminate()


@pytest.mark.parametrize("tokenizer_pool_size", [0, 2])
@pytest.mark.parametrize("distributed_executor_backend", ["mp", "ray"])
def test_api_server(api_server, tokenizer_pool_size: int,
                    distributed_executor_backend: str):
    """
    Run the API server and test it.

    We run both the server and requests in separate processes.

    We test that the server can handle incoming requests, including
    multiple requests at the same time, and that it can handle requests
    being cancelled without crashing.
    """
    with Pool(32) as pool:
        # Wait until the server is ready
        prompts = ["warm up"] * 1
        result = None
        while not result:
            try:
                for r in pool.map(_query_server, prompts):
                    result = r
                    break
            except requests.exceptions.ConnectionError:
                time.sleep(1)

        # Actual tests start here
        # Try with 1 prompt
        for result in pool.map(_query_server, prompts):
            assert result

        num_aborted_requests = requests.get(
            "http://localhost:8000/stats").json()["num_aborted_requests"]
        assert num_aborted_requests == 0

        # Try with 100 prompts
        prompts = ["test prompt"] * 100
        for result in pool.map(_query_server, prompts):
            assert result

    with Pool(32) as pool:
        # Cancel requests
        prompts = ["canceled requests"] * 100
        pool.map_async(_query_server_long, prompts)
        time.sleep(0.01)
        pool.terminate()
        pool.join()

        # check cancellation stats
        # give it some times to update the stats
        time.sleep(1)

        num_aborted_requests = requests.get(
            "http://localhost:8000/stats").json()["num_aborted_requests"]
        assert num_aborted_requests > 0

    # check that server still runs after cancellations
    with Pool(32) as pool:
        # Try with 100 prompts
        prompts = ["test prompt after canceled"] * 100
        for result in pool.map(_query_server, prompts):
            assert result
[Misc] Add SPDX-License-Identifier headers to python source files (#12628) - Add SPDX license headers to python source files - Check for SPDX headers using pre-commit commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745 Author: Russell Bryant <rbryant@redhat.com> Date: Fri Jan 31 14:18:24 2025 -0500 Add SPDX license headers to python source files This commit adds SPDX license headers to python source files as recommended to the project by the Linux Foundation. These headers provide a concise way that is both human and machine readable for communicating license information for each source file. It helps avoid any ambiguity about the license of the code and can also be easily used by tools to help manage license compliance. The Linux Foundation runs license scans against the codebase to help ensure we are in compliance with the licenses of the code we use, including dependencies. Having these headers in place helps that tool do its job. More information can be found on the SPDX site: - https://spdx.dev/learn/handling-license-info/ Signed-off-by: Russell Bryant <rbryant@redhat.com> commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea Author: Russell Bryant <rbryant@redhat.com> Date: Fri Jan 31 14:36:32 2025 -0500 Check for SPDX headers using pre-commit Signed-off-by: Russell Bryant <rbryant@redhat.com> --------- Signed-off-by: Russell Bryant <rbryant@redhat.com> 2025-02-02 14:58:18 -05:00			`# SPDX-License-Identifier: Apache-2.0`

[V1] V1 Enablement Oracle (#13726) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> 2025-03-15 01:02:20 -04:00			`import os`
Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`import subprocess`
			`import sys`
			`import time`
			`from multiprocessing import Pool`
			`from pathlib import Path`

			`import pytest`
			`import requests`


Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-04 03:30:22 +08:00			`def _query_server(prompt: str, max_tokens: int = 5) -> dict:`
Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`response = requests.post("http://localhost:8000/generate",`
			`json={`
			`"prompt": prompt,`
Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-04 03:30:22 +08:00			`"max_tokens": max_tokens,`
Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`"temperature": 0,`
			`"ignore_eos": True`
			`})`
			`response.raise_for_status()`
			`return response.json()`


Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-04 03:30:22 +08:00			`def _query_server_long(prompt: str) -> dict:`
			`return _query_server(prompt, max_tokens=500)`


Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`@pytest.fixture`
[Misc] Remove deprecated code (#12383) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-25 03:45:20 +08:00			`def api_server(tokenizer_pool_size: int, distributed_executor_backend: str):`
Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`script_path = Path(__file__).parent.joinpath(`
			`"api_server_async_engine.py").absolute()`
[Core] Fix engine-use-ray broken (#4105) 2024-04-16 14:24:53 +09:00			`commands = [`
[Misc] Remove deprecated code (#12383) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-25 03:45:20 +08:00			`sys.executable,`
			`"-u",`
			`str(script_path),`
			`"--model",`
			`"facebook/opt-125m",`
			`"--host",`
			`"127.0.0.1",`
			`"--tokenizer-pool-size",`
			`str(tokenizer_pool_size),`
			`"--distributed-executor-backend",`
			`distributed_executor_backend,`
[Core] Fix engine-use-ray broken (#4105) 2024-04-16 14:24:53 +09:00			`]`
[Misc] Deprecation Warning when setting --engine-use-ray (#7424) Signed-off-by: Wallas Santos <wallashss@ibm.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: youkaichao <youkaichao@126.com> 2024-08-14 13:44:27 -03:00
[V1] V1 Enablement Oracle (#13726) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> 2025-03-15 01:02:20 -04:00			`# API Server Test Requires V0.`
			`my_env = os.environ.copy()`
			`my_env["VLLM_USE_V1"] = "0"`
			`uvicorn_process = subprocess.Popen(commands, env=my_env)`
Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`yield`
			`uvicorn_process.terminate()`


Asynchronous tokenization (#2879) 2024-03-15 16:37:01 -07:00			`@pytest.mark.parametrize("tokenizer_pool_size", [0, 2])`
[Misc] Remove deprecated code (#12383) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-25 03:45:20 +08:00			`@pytest.mark.parametrize("distributed_executor_backend", ["mp", "ray"])`
[misc] remove engine_use_ray (#8126) 2024-09-11 18:23:36 -07:00			`def test_api_server(api_server, tokenizer_pool_size: int,`
[Misc] Remove deprecated code (#12383) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-25 03:45:20 +08:00			`distributed_executor_backend: str):`
Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`"""`
			`Run the API server and test it.`

			`We run both the server and requests in separate processes.`

			`We test that the server can handle incoming requests, including`
			`multiple requests at the same time, and that it can handle requests`
			`being cancelled without crashing.`
			`"""`
			`with Pool(32) as pool:`
			`# Wait until the server is ready`
[BUGFIX] Fix API server test (#2270) 2023-12-27 02:37:06 +08:00			`prompts = ["warm up"] * 1`
Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`result = None`
			`while not result:`
			`try:`
[BUGFIX] Fix API server test (#2270) 2023-12-27 02:37:06 +08:00			`for r in pool.map(_query_server, prompts):`
			`result = r`
Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`break`
[BUGFIX] Fix API server test (#2270) 2023-12-27 02:37:06 +08:00			`except requests.exceptions.ConnectionError:`
Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`time.sleep(1)`

			`# Actual tests start here`
			`# Try with 1 prompt`
			`for result in pool.map(_query_server, prompts):`
			`assert result`

			`num_aborted_requests = requests.get(`
			`"http://localhost:8000/stats").json()["num_aborted_requests"]`
			`assert num_aborted_requests == 0`

			`# Try with 100 prompts`
[BUGFIX] Fix API server test (#2270) 2023-12-27 02:37:06 +08:00			`prompts = ["test prompt"] * 100`
Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`for result in pool.map(_query_server, prompts):`
			`assert result`

Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-04 03:30:22 +08:00			`with Pool(32) as pool:`
Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`# Cancel requests`
[BUGFIX] Fix API server test (#2270) 2023-12-27 02:37:06 +08:00			`prompts = ["canceled requests"] * 100`
Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-04 03:30:22 +08:00			`pool.map_async(_query_server_long, prompts)`
			`time.sleep(0.01)`
Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`pool.terminate()`
			`pool.join()`

			`# check cancellation stats`
[CI] Add Buildkite (#2355) 2024-01-14 12:37:58 -08:00			`# give it some times to update the stats`
			`time.sleep(1)`

Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`num_aborted_requests = requests.get(`
			`"http://localhost:8000/stats").json()["num_aborted_requests"]`
			`assert num_aborted_requests > 0`

			`# check that server still runs after cancellations`
			`with Pool(32) as pool:`
			`# Try with 100 prompts`
[BUGFIX] Fix API server test (#2270) 2023-12-27 02:37:06 +08:00			`prompts = ["test prompt after canceled"] * 100`
Make `AsyncLLMEngine` more robust & fix batched abort (#969) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> 2023-09-07 13:43:45 -07:00			`for result in pool.map(_query_server, prompts):`
			`assert result`