vllm/tests/distributed/test_basic_distributed_correctness.py

"""Compare the outputs of HF and distributed vLLM when using greedy sampling.
vLLM will allocate all the available memory, so we need to run the tests one
by one. The solution is to pass arguments (model name) by environment
variables.
Run:
```sh
cd $VLLM_PATH/tests

TEST_DIST_MODEL=facebook/opt-125m pytest \
    distributed/test_basic_distributed_correctness.py
TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf \
    distributed/test_basic_distributed_correctness.py
```
"""
import os

import pytest

from vllm.utils import cuda_device_count_stateless

from ..models.utils import check_outputs_equal

MODELS = [
    os.environ["TEST_DIST_MODEL"],
]
DISTRIBUTED_EXECUTOR_BACKEND = "DISTRIBUTED_EXECUTOR_BACKEND"


@pytest.mark.skipif(cuda_device_count_stateless() < 2,
                    reason="Need at least 2 GPUs to run the test.")
@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("dtype", ["half"])
@pytest.mark.parametrize("max_tokens", [5])
def test_models(
    hf_runner,
    vllm_runner,
    example_prompts,
    model: str,
    dtype: str,
    max_tokens: int,
) -> None:
    distributed_executor_backend = os.getenv(DISTRIBUTED_EXECUTOR_BACKEND)

    # NOTE: take care of the order. run vLLM first, and then run HF.
    # vLLM needs a fresh new process without cuda initialization.
    # if we run HF first, the cuda initialization will be done and it
    # will hurt multiprocessing backend with fork method (the default method).
    with vllm_runner(model,
                     dtype=dtype,
                     tensor_parallel_size=2,
                     distributed_executor_backend=distributed_executor_backend
                     ) as vllm_model:
        vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)

    with hf_runner(model, dtype=dtype) as hf_model:
        hf_outputs = hf_model.generate_greedy(example_prompts, max_tokens)

    check_outputs_equal(
        outputs_0_lst=hf_outputs,
        outputs_1_lst=vllm_outputs,
        name_0="hf",
        name_1="vllm",
    )
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00			`"""Compare the outputs of HF and distributed vLLM when using greedy sampling.`
[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00			`vLLM will allocate all the available memory, so we need to run the tests one`
			`by one. The solution is to pass arguments (model name) by environment`
			`variables.`
			`Run:`
			```sh
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425) Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py. 2024-05-13 22:50:09 +08:00			`cd $VLLM_PATH/tests`

[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00			`TEST_DIST_MODEL=facebook/opt-125m pytest \`
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425) Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py. 2024-05-13 22:50:09 +08:00			`distributed/test_basic_distributed_correctness.py`
[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00			`TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf \`
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425) Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py. 2024-05-13 22:50:09 +08:00			`distributed/test_basic_distributed_correctness.py`
[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00			```
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00			`"""`
[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00			`import os`

[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00			`import pytest`
[ci][distributed] fix device count call [ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991) 2024-06-30 01:06:13 -07:00
			`from vllm.utils import cuda_device_count_stateless`
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00
[CI/Build] Reuse code for checking output consistency (#5988) 2024-06-30 11:44:25 +08:00			`from ..models.utils import check_outputs_equal`

[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00			`MODELS = [`
[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00			`os.environ["TEST_DIST_MODEL"],`
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00			`]`
[Core] Add MultiprocessingGPUExecutor (#4539) Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com> 2024-05-14 10:38:59 -07:00			`DISTRIBUTED_EXECUTOR_BACKEND = "DISTRIBUTED_EXECUTOR_BACKEND"`
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00

[ci][distributed] fix device count call [ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991) 2024-06-30 01:06:13 -07:00			`@pytest.mark.skipif(cuda_device_count_stateless() < 2,`
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00			`reason="Need at least 2 GPUs to run the test.")`
			`@pytest.mark.parametrize("model", MODELS)`
			`@pytest.mark.parametrize("dtype", ["half"])`
			`@pytest.mark.parametrize("max_tokens", [5])`
			`def test_models(`
			`hf_runner,`
			`vllm_runner,`
			`example_prompts,`
			`model: str,`
			`dtype: str,`
			`max_tokens: int,`
			`) -> None:`
[Core] Add MultiprocessingGPUExecutor (#4539) Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com> 2024-05-14 10:38:59 -07:00			`distributed_executor_backend = os.getenv(DISTRIBUTED_EXECUTOR_BACKEND)`

[ci][distributed] fix device count call [ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991) 2024-06-30 01:06:13 -07:00			`# NOTE: take care of the order. run vLLM first, and then run HF.`
			`# vLLM needs a fresh new process without cuda initialization.`
			`# if we run HF first, the cuda initialization will be done and it`
			`# will hurt multiprocessing backend with fork method (the default method).`
[CI/Test] improve robustness of test (vllm_runner) (#5357) [CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357) 2024-06-08 01:59:20 -07:00			`with vllm_runner(model,`
			`dtype=dtype,`
			`tensor_parallel_size=2,`
			`distributed_executor_backend=distributed_executor_backend`
			`) as vllm_model:`
			`vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)`
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00
[ci][distributed] fix device count call [ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991) 2024-06-30 01:06:13 -07:00			`with hf_runner(model, dtype=dtype) as hf_model:`
			`hf_outputs = hf_model.generate_greedy(example_prompts, max_tokens)`

[CI/Build] Reuse code for checking output consistency (#5988) 2024-06-30 11:44:25 +08:00			`check_outputs_equal(`
			`outputs_0_lst=hf_outputs,`
			`outputs_1_lst=vllm_outputs,`
			`name_0="hf",`
			`name_1="vllm",`
			`)`