.. _installation_tpu: ##################### Installation with TPU ##################### Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs are available in different versions each with different hardware specifications. For more information about TPUs, see `TPU System Architecture `_. For more information on the TPU versions supported with vLLM, see: * `TPU v6e `_ * `TPU v5e `_ * `TPU v5p `_ * `TPU v4 `_ These TPU versions allow you to configure the physical arrangements of the TPU chips. This can improve throughput and networking performance. For more information see: * `TPU v6e topologies `_ * `TPU v5e topologies `_ * `TPU v5p topologies `_ * `TPU v4 topologies `_ In order for you to use Cloud TPUs you need to have TPU quota granted to your Google Cloud Platform project. TPU quotas specify how many TPUs you can use in a GPC project and are specified in terms of TPU version, the number of TPU you want to use, and quota type. For more information, see `TPU quota `_. For TPU pricing information, see `Cloud TPU pricing `_. You may need additional persistent storage for your TPU VMs. For more information, see `Storage options for Cloud TPU data `_. Requirements ------------ * Google Cloud TPU VM * TPU versions: v6e, v5e, v5p, v4 * Python: 3.10 or newer Provision Cloud TPUs ==================== You can provision Cloud TPUs using the `Cloud TPU API `_` or the `queued resources `_` API. This section shows how to create TPUs using the queued resource API. For more information about using the Cloud TPU API, see `Create a Cloud TPU using the Create Node API `_. `Queued resources `_ enable you to request Cloud TPU resources in a queued manner. When you request queued resources, the request is added to a queue maintained by the Cloud TPU service. When the requested resource becomes available, it's assigned to your Google Cloud project for your immediate exclusive use. Provision a Cloud TPU with the queued resource API -------------------------------------------------- Create a TPU v5e with 4 TPU chips: .. code-block:: console gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \ --node-id TPU_NAME \ --project PROJECT_ID \ --zone ZONE \ --accelerator-type ACCELERATOR_TYPE \ --runtime-version RUNTIME_VERSION \ --service-account SERVICE_ACCOUNT .. list-table:: Parameter descriptions :header-rows: 1 * - Parameter name - Description * - QUEUED_RESOURCE_ID - The user-assigned ID of the queued resource request. * - TPU_NAME - The user-assigned name of the TPU which is created when the queued resource request is allocated. * - PROJECT_ID - Your Google Cloud project * - ZONE - The `zone `_ where you want to create your Cloud TPU. * - ACCELERATOR_TYPE - The TPU version you want to use. Specify the TPU version, followed by a '-' and the number of TPU cores. For example `v5e-4` specifies a v5e TPU with 4 cores. For more information, see `TPU versions `_. * - RUNTIME_VERSION - The TPU VM runtime version to use. For more information see `TPU VM images `_. * - SERVICE_ACCOUNT - The email address for your service account. You can find it in the IAM Cloud Console under *Service Accounts*. For example: `tpu-service-account@.iam.gserviceaccount.com` Connect to your TPU using SSH: .. code-block:: bash gcloud compute tpus tpu-vm ssh TPU_NAME Create and activate a Conda environment for vLLM: .. code-block:: bash conda create -n vllm python=3.10 -y conda activate vllm Clone the vLLM repository and go to the vLLM directory: .. code-block:: bash git clone https://github.com/vllm-project/vllm.git && cd vllm Uninstall the existing `torch` and `torch_xla` packages: .. code-block:: bash pip uninstall torch torch-xla -y Install build dependencies: .. code-block:: bash pip install -r requirements-tpu.txt sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev Run the setup script: .. code-block:: bash VLLM_TARGET_DEVICE="tpu" python setup.py develop Provision Cloud TPUs with GKE ----------------------------- For more information about using TPUs with GKE, see https://cloud.google.com/kubernetes-engine/docs/how-to/tpus https://cloud.google.com/kubernetes-engine/docs/concepts/tpus https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus .. _build_docker_tpu: Build a docker image with :code:`Dockerfile.tpu` ------------------------------------------------ You can use `Dockerfile.tpu `_ to build a Docker image with TPU support. .. code-block:: console $ docker build -f Dockerfile.tpu -t vllm-tpu . Run the Docker image with the following command: .. code-block:: console $ # Make sure to add `--privileged --net host --shm-size=16G`. $ docker run --privileged --net host --shm-size=16G -it vllm-tpu .. note:: Since TPU relies on XLA which requires static shapes, vLLM bucketizes the possible input shapes and compiles an XLA graph for each different shape. The compilation time may take 20~30 minutes in the first run. However, the compilation time reduces to ~5 minutes afterwards because the XLA graphs are cached in the disk (in :code:`VLLM_XLA_CACHE_PATH` or :code:`~/.cache/vllm/xla_cache` by default). .. tip:: If you encounter the following error: .. code-block:: console from torch._C import * # noqa: F403 ImportError: libopenblas.so.0: cannot open shared object file: No such file or directory Install OpenBLAS with the following command: .. code-block:: console $ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev