vllm/tests/entrypoints/openai/test_models.py

import openai  # use the official client for correctness check
import pytest
# using Ray for overall ease of process management, parallel requests,
# and debugging.
import ray
# downloading lora to test lora requests
from huggingface_hub import snapshot_download

from ...utils import VLLM_PATH, RemoteOpenAIServer

# any model with a chat template should work here
MODEL_NAME = "HuggingFaceH4/zephyr-7b-beta"
# technically this needs Mistral-7B-v0.1 as base, but we're not testing
# generation quality here
LORA_NAME = "typeof/zephyr-7b-beta-lora"


@pytest.fixture(scope="module")
def zephyr_lora_files():
    return snapshot_download(repo_id=LORA_NAME)


@pytest.fixture(scope="module")
def ray_ctx():
    ray.init(runtime_env={"working_dir": VLLM_PATH})
    yield
    ray.shutdown()


@pytest.fixture(scope="module")
def server(zephyr_lora_files, ray_ctx):
    return RemoteOpenAIServer([
        "--model",
        MODEL_NAME,
        # use half precision for speed and memory savings in CI environment
        "--dtype",
        "bfloat16",
        "--max-model-len",
        "8192",
        "--enforce-eager",
        # lora config below
        "--enable-lora",
        "--lora-modules",
        f"zephyr-lora={zephyr_lora_files}",
        f"zephyr-lora2={zephyr_lora_files}",
        "--max-lora-rank",
        "64",
        "--max-cpu-loras",
        "2",
        "--max-num-seqs",
        "128",
    ])


@pytest.fixture(scope="module")
def client(server):
    return server.get_async_client()


@pytest.mark.asyncio
async def test_check_models(client: openai.AsyncOpenAI):
    models = await client.models.list()
    models = models.data
    served_model = models[0]
    lora_models = models[1:]
    assert served_model.id == MODEL_NAME
    assert all(model.root == MODEL_NAME for model in models)
    assert lora_models[0].id == "zephyr-lora"
    assert lora_models[1].id == "zephyr-lora2"
[CI] Try introducing isort. (#3495) 2024-03-25 23:59:47 +09:00			`import openai # use the official client for correctness check`
OpenAI Server refactoring (#2360) 2024-01-17 05:33:14 +00:00			`import pytest`
Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00			`# using Ray for overall ease of process management, parallel requests,`
			`# and debugging.`
			`import ray`
			`# downloading lora to test lora requests`
			`from huggingface_hub import snapshot_download`
Support logit bias for OpenAI API (#3027) 2024-02-26 19:51:53 -08:00
[Core] Pipeline Parallel Support (#4412) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> 2024-07-02 10:58:08 -07:00			`from ...utils import VLLM_PATH, RemoteOpenAIServer`
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425) Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py. 2024-05-13 22:50:09 +08:00
Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00			`# any model with a chat template should work here`
			`MODEL_NAME = "HuggingFaceH4/zephyr-7b-beta"`
			`# technically this needs Mistral-7B-v0.1 as base, but we're not testing`
			`# generation quality here`
			`LORA_NAME = "typeof/zephyr-7b-beta-lora"`
OpenAI Server refactoring (#2360) 2024-01-17 05:33:14 +00:00

[CI/Build] [2/3] Reorganize entrypoints tests (#5904) 2024-06-28 22:59:18 +08:00			`@pytest.fixture(scope="module")`
multi-LoRA as extra models in OpenAI server (#2775) how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models 2024-02-17 15:00:48 -05:00			`def zephyr_lora_files():`
			`return snapshot_download(repo_id=LORA_NAME)`


[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) 2024-05-11 11:30:37 -07:00			`@pytest.fixture(scope="module")`
[CI/Build] Simplify OpenAI server setup in tests (#5100) 2024-06-14 02:21:53 +08:00			`def ray_ctx():`
[Core] Pipeline Parallel Support (#4412) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> 2024-07-02 10:58:08 -07:00			`ray.init(runtime_env={"working_dir": VLLM_PATH})`
[CI/Build] Simplify OpenAI server setup in tests (#5100) 2024-06-14 02:21:53 +08:00			`yield`
			`ray.shutdown()`


			`@pytest.fixture(scope="module")`
			`def server(zephyr_lora_files, ray_ctx):`
			`return RemoteOpenAIServer([`
OpenAI Server refactoring (#2360) 2024-01-17 05:33:14 +00:00			`"--model",`
			`MODEL_NAME,`
Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00			`# use half precision for speed and memory savings in CI environment`
OpenAI Server refactoring (#2360) 2024-01-17 05:33:14 +00:00			`"--dtype",`
Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00			`"bfloat16",`
OpenAI Server refactoring (#2360) 2024-01-17 05:33:14 +00:00			`"--max-model-len",`
Support Batch Completion in Server (#2529) 2024-01-24 17:11:07 -08:00			`"8192",`
			`"--enforce-eager",`
multi-LoRA as extra models in OpenAI server (#2775) how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models 2024-02-17 15:00:48 -05:00			`# lora config below`
			`"--enable-lora",`
			`"--lora-modules",`
			`f"zephyr-lora={zephyr_lora_files}",`
			`f"zephyr-lora2={zephyr_lora_files}",`
			`"--max-lora-rank",`
			`"64",`
			`"--max-cpu-loras",`
			`"2",`
			`"--max-num-seqs",`
[Core][5/N] Fully working chunked prefill e2e (#3884) 2024-04-11 09:56:48 +09:00			`"128",`
OpenAI Server refactoring (#2360) 2024-01-17 05:33:14 +00:00			`])`


[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) 2024-05-11 11:30:37 -07:00			`@pytest.fixture(scope="module")`
[CI/Build] Simplify OpenAI server setup in tests (#5100) 2024-06-14 02:21:53 +08:00			`def client(server):`
			`return server.get_async_client()`
OpenAI Server refactoring (#2360) 2024-01-17 05:33:14 +00:00

[CI/Build] [1/3] Reorganize entrypoints tests (#5526) 2024-06-27 20:43:17 +08:00			`@pytest.mark.asyncio`
[CI/Build] Simplify OpenAI server setup in tests (#5100) 2024-06-14 02:21:53 +08:00			`async def test_check_models(client: openai.AsyncOpenAI):`
multi-LoRA as extra models in OpenAI server (#2775) how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models 2024-02-17 15:00:48 -05:00			`models = await client.models.list()`
			`models = models.data`
			`served_model = models[0]`
			`lora_models = models[1:]`
			`assert served_model.id == MODEL_NAME`
			`assert all(model.root == MODEL_NAME for model in models)`
			`assert lora_models[0].id == "zephyr-lora"`
			`assert lora_models[1].id == "zephyr-lora2"`