vllm/vllm/utils.py

import enum
import os
import socket
import uuid
from platform import uname
from typing import List

import psutil
import torch

from vllm._C import cuda_utils


class Device(enum.Enum):
    GPU = enum.auto()
    CPU = enum.auto()


class Counter:

    def __init__(self, start: int = 0) -> None:
        self.counter = start

    def __next__(self) -> int:
        i = self.counter
        self.counter += 1
        return i

    def reset(self) -> None:
        self.counter = 0


def is_hip() -> bool:
    return torch.version.hip is not None


def get_max_shared_memory_bytes(gpu: int = 0) -> int:
    """Returns the maximum shared memory per thread block in bytes."""
    # https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html
    cudaDevAttrMaxSharedMemoryPerBlockOptin = 97 if not is_hip() else 74
    max_shared_mem = cuda_utils.get_device_attribute(
        cudaDevAttrMaxSharedMemoryPerBlockOptin, gpu)
    return int(max_shared_mem)


def get_cpu_memory() -> int:
    """Returns the total CPU memory of the node in bytes."""
    return psutil.virtual_memory().total


def random_uuid() -> str:
    return str(uuid.uuid4().hex)


def in_wsl() -> bool:
    # Reference: https://github.com/microsoft/WSL/issues/4071
    return "microsoft" in " ".join(uname()).lower()


def get_ip() -> str:
    return socket.gethostbyname(socket.gethostname())


def get_open_port() -> int:
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.bind(("", 0))
        return s.getsockname()[1]


def set_cuda_visible_devices(device_ids: List[int]) -> None:
    os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, device_ids))
Add utils 2023-02-09 11:26:50 +00:00			`import enum`
Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-04 03:30:22 +08:00			`import os`
Optimize model execution with CUDA graph (#1926) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> 2023-12-16 21:12:08 -08:00			`import socket`
OpenAI Compatible Frontend (#116) 2023-05-23 21:39:50 -07:00			`import uuid`
Allocate more shared memory to attention kernel (#1154) 2023-09-26 22:27:13 -07:00			`from platform import uname`
Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-04 03:30:22 +08:00			`from typing import List`
Support tensor parallel (#2) 2023-03-22 04:45:42 +08:00
Refactor system architecture (#82) 2023-05-09 15:30:12 -07:00			`import psutil`
Support tensor parallel (#2) 2023-03-22 04:45:42 +08:00			`import torch`

[Build] Avoid building too many extensions (#1624) 2023-11-23 16:31:19 -08:00			`from vllm._C import cuda_utils`
Allocate more shared memory to attention kernel (#1154) 2023-09-26 22:27:13 -07:00
Add utils 2023-02-09 11:26:50 +00:00
			`class Device(enum.Enum):`
			`GPU = enum.auto()`
			`CPU = enum.auto()`


			`class Counter:`

			`def __init__(self, start: int = 0) -> None:`
			`self.counter = start`

Fix typo 2023-02-14 01:19:27 +00:00			`def __next__(self) -> int:`
[Quality] Add code formatter and linter (#326) 2023-07-03 11:31:55 -07:00			`i = self.counter`
Add utils 2023-02-09 11:26:50 +00:00			`self.counter += 1`
[Quality] Add code formatter and linter (#326) 2023-07-03 11:31:55 -07:00			`return i`
Add utils 2023-02-09 11:26:50 +00:00
			`def reset(self) -> None:`
			`self.counter = 0`
Support tensor parallel (#2) 2023-03-22 04:45:42 +08:00
FastAPI-based working frontend (#10) 2023-03-29 14:48:56 +08:00
Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Amir Balwel <amoooori04@gmail.com> Co-authored-by: root <kuanfu.liu@akirakan.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com> Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com> 2023-12-08 15:16:52 +08:00			`def is_hip() -> bool:`
			`return torch.version.hip is not None`


Allocate more shared memory to attention kernel (#1154) 2023-09-26 22:27:13 -07:00			`def get_max_shared_memory_bytes(gpu: int = 0) -> int:`
			`"""Returns the maximum shared memory per thread block in bytes."""`
			`# https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html`
Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Amir Balwel <amoooori04@gmail.com> Co-authored-by: root <kuanfu.liu@akirakan.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com> Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com> 2023-12-08 15:16:52 +08:00			`cudaDevAttrMaxSharedMemoryPerBlockOptin = 97 if not is_hip() else 74`
Allocate more shared memory to attention kernel (#1154) 2023-09-26 22:27:13 -07:00			`max_shared_mem = cuda_utils.get_device_attribute(`
			`cudaDevAttrMaxSharedMemoryPerBlockOptin, gpu)`
			`return int(max_shared_mem)`


FastAPI-based working frontend (#10) 2023-03-29 14:48:56 +08:00			`def get_cpu_memory() -> int:`
Print warnings/errors for large swap space (#123) 2023-05-23 18:22:26 -07:00			`"""Returns the total CPU memory of the node in bytes."""`
FastAPI-based working frontend (#10) 2023-03-29 14:48:56 +08:00			`return psutil.virtual_memory().total`
OpenAI Compatible Frontend (#116) 2023-05-23 21:39:50 -07:00

			`def random_uuid() -> str:`
			`return str(uuid.uuid4().hex)`
[Fix] Do not pin memory when in WSL (#312) 2023-06-29 15:00:21 -07:00
[Quality] Add code formatter and linter (#326) 2023-07-03 11:31:55 -07:00
[Fix] Do not pin memory when in WSL (#312) 2023-06-29 15:00:21 -07:00			`def in_wsl() -> bool:`
			`# Reference: https://github.com/microsoft/WSL/issues/4071`
			`return "microsoft" in " ".join(uname()).lower()`
Optimize model execution with CUDA graph (#1926) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> 2023-12-16 21:12:08 -08:00

Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-04 03:30:22 +08:00			`def get_ip() -> str:`
			`return socket.gethostbyname(socket.gethostname())`


			`def get_open_port() -> int:`
Optimize model execution with CUDA graph (#1926) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> 2023-12-16 21:12:08 -08:00			`with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:`
			`s.bind(("", 0))`
			`return s.getsockname()[1]`
Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-04 03:30:22 +08:00

			`def set_cuda_visible_devices(device_ids: List[int]) -> None:`
			`os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, device_ids))`