vllm/tests/distributed/test_basic_distributed_correctness.py

"""Compare the outputs of HF and distributed vLLM when using greedy sampling.
vLLM will allocate all the available memory, so we need to run the tests one
by one. The solution is to pass arguments (model name) by environment
variables.
Run:
```sh
cd $VLLM_PATH/tests

TEST_DIST_MODEL=facebook/opt-125m pytest \
    distributed/test_basic_distributed_correctness.py
TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf \
    distributed/test_basic_distributed_correctness.py
```
"""
import os

import pytest
import torch

MODELS = [
    os.environ["TEST_DIST_MODEL"],
]
DISTRIBUTED_EXECUTOR_BACKEND = "DISTRIBUTED_EXECUTOR_BACKEND"
VLLM_ATTENTION_BACKEND = "VLLM_ATTENTION_BACKEND"


@pytest.mark.skipif(torch.cuda.device_count() < 2,
                    reason="Need at least 2 GPUs to run the test.")
@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("dtype", ["half"])
@pytest.mark.parametrize("max_tokens", [5])
def test_models(
    hf_runner,
    vllm_runner,
    example_prompts,
    model: str,
    dtype: str,
    max_tokens: int,
) -> None:
    distributed_executor_backend = os.getenv(DISTRIBUTED_EXECUTOR_BACKEND)

    backend_by_env_var = os.getenv(VLLM_ATTENTION_BACKEND)
    enforce_eager = backend_by_env_var == "FLASHINFER"

    with hf_runner(model, dtype=dtype) as hf_model:
        hf_outputs = hf_model.generate_greedy(example_prompts, max_tokens)

    vllm_model = vllm_runner(
        model,
        dtype=dtype,
        tensor_parallel_size=2,
        enforce_eager=enforce_eager,
        distributed_executor_backend=distributed_executor_backend)
    vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)
    del vllm_model

    for i in range(len(example_prompts)):
        hf_output_ids, hf_output_str = hf_outputs[i]
        vllm_output_ids, vllm_output_str = vllm_outputs[i]
        assert hf_output_str == vllm_output_str, (
            f"Test{i}:\nHF: {hf_output_str!r}\nvLLM: {vllm_output_str!r}")
        assert hf_output_ids == vllm_output_ids, (
            f"Test{i}:\nHF: {hf_output_ids}\nvLLM: {vllm_output_ids}")
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00			`"""Compare the outputs of HF and distributed vLLM when using greedy sampling.`
[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00			`vLLM will allocate all the available memory, so we need to run the tests one`
			`by one. The solution is to pass arguments (model name) by environment`
			`variables.`
			`Run:`
			```sh
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425) Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py. 2024-05-13 22:50:09 +08:00			`cd $VLLM_PATH/tests`

[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00			`TEST_DIST_MODEL=facebook/opt-125m pytest \`
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425) Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py. 2024-05-13 22:50:09 +08:00			`distributed/test_basic_distributed_correctness.py`
[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00			`TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf \`
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425) Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py. 2024-05-13 22:50:09 +08:00			`distributed/test_basic_distributed_correctness.py`
[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00			```
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00			`"""`
[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00			`import os`

[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00			`import pytest`
			`import torch`

			`MODELS = [`
[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00			`os.environ["TEST_DIST_MODEL"],`
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00			`]`
[Core] Add MultiprocessingGPUExecutor (#4539) Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com> 2024-05-14 10:38:59 -07:00			`DISTRIBUTED_EXECUTOR_BACKEND = "DISTRIBUTED_EXECUTOR_BACKEND"`
[Kernel] Use flashinfer for decoding (#4353) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com> 2024-05-03 15:51:27 -07:00			`VLLM_ATTENTION_BACKEND = "VLLM_ATTENTION_BACKEND"`
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00

			`@pytest.mark.skipif(torch.cuda.device_count() < 2,`
			`reason="Need at least 2 GPUs to run the test.")`
			`@pytest.mark.parametrize("model", MODELS)`
			`@pytest.mark.parametrize("dtype", ["half"])`
			`@pytest.mark.parametrize("max_tokens", [5])`
			`def test_models(`
			`hf_runner,`
			`vllm_runner,`
			`example_prompts,`
			`model: str,`
			`dtype: str,`
			`max_tokens: int,`
			`) -> None:`
[Core] Add MultiprocessingGPUExecutor (#4539) Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com> 2024-05-14 10:38:59 -07:00			`distributed_executor_backend = os.getenv(DISTRIBUTED_EXECUTOR_BACKEND)`

[Kernel] Use flashinfer for decoding (#4353) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com> 2024-05-03 15:51:27 -07:00			`backend_by_env_var = os.getenv(VLLM_ATTENTION_BACKEND)`
[Core] Add MultiprocessingGPUExecutor (#4539) Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com> 2024-05-14 10:38:59 -07:00			`enforce_eager = backend_by_env_var == "FLASHINFER"`
[Core][5/N] Fully working chunked prefill e2e (#3884) 2024-04-11 09:56:48 +09:00
[CI/Test] improve robustness of test (hf_runner) (#5347) [CI/Test] improve robustness of test by replacing del with context manager (hf_runner) (#5347) 2024-06-07 22:31:32 -07:00			`with hf_runner(model, dtype=dtype) as hf_model:`
			`hf_outputs = hf_model.generate_greedy(example_prompts, max_tokens)`
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00
[Core] Add MultiprocessingGPUExecutor (#4539) Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com> 2024-05-14 10:38:59 -07:00			`vllm_model = vllm_runner(`
			`model,`
			`dtype=dtype,`
			`tensor_parallel_size=2,`
			`enforce_eager=enforce_eager,`
			`distributed_executor_backend=distributed_executor_backend)`
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00			`vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)`
			`del vllm_model`

			`for i in range(len(example_prompts)):`
			`hf_output_ids, hf_output_str = hf_outputs[i]`
			`vllm_output_ids, vllm_output_str = vllm_outputs[i]`
			`assert hf_output_str == vllm_output_str, (`
			`f"Test{i}:\nHF: {hf_output_str!r}\nvLLM: {vllm_output_str!r}")`
			`assert hf_output_ids == vllm_output_ids, (`
			`f"Test{i}:\nHF: {hf_output_ids}\nvLLM: {vllm_output_ids}")`