vllm/docs/source/getting_started/tpu-installation.rst

.. _installation_tpu:

#####################
Installation with TPU
#####################

Tensor Processing Units (TPUs) are Google's custom-developed application-specific 
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs 
are available in different versions each with different hardware specifications.
For more information about TPUs, see `TPU System Architecture <https://cloud.google.com/tpu/docs/system-architecture-tpu-vm>`_. 
For more information on the TPU versions supported with vLLM, see:

* `TPU v6e <https://cloud.google.com/tpu/docs/v6e>`_
* `TPU v5e <https://cloud.google.com/tpu/docs/v5e>`_
* `TPU v5p <https://cloud.google.com/tpu/docs/v5p>`_
* `TPU v4 <https://cloud.google.com/tpu/docs/v4>`_

These TPU versions allow you to configure the physical arrangements of the TPU 
chips. This can improve throughput and networking performance. For more 
information see: 

* `TPU v6e topologies <https://cloud.google.com/tpu/docs/v6e#configurations>`_
* `TPU v5e topologies <https://cloud.google.com/tpu/docs/v5e#tpu-v5e-config>`_
* `TPU v5p topologies <https://cloud.google.com/tpu/docs/v5p#tpu-v5p-config>`_
* `TPU v4 topologies <https://cloud.google.com/tpu/docs/v4#tpu-v4-config>`_

In order for you to use Cloud TPUs you need to have TPU quota granted to your 
Google Cloud Platform project. TPU quotas specify how many TPUs you can use in a
GPC project and are specified in terms of TPU version, the number of TPU you 
want to use, and quota type. For more information, see `TPU quota <https://cloud.google.com/tpu/docs/quota#tpu_quota>`_. 

For TPU pricing information, see `Cloud TPU pricing <https://cloud.google.com/tpu/pricing>`_.

You may need additional persistent storage for your TPU VMs. For more 
information, see `Storage options for Cloud TPU data <https://cloud.devsite.corp.google.com/tpu/docs/storage-options>`_.

Requirements
------------

* Google Cloud TPU VM 
* TPU versions: v6e, v5e, v5p, v4
* Python: 3.10 or newer

Provision Cloud TPUs
====================

You can provision Cloud TPUs using the `Cloud TPU API <https://cloud.google.com/tpu/docs/reference/rest>`_` 
or the `queued resources <https://cloud.google.com/tpu/docs/queued-resources>`_` 
API. This section shows how to create TPUs using the queued resource API. 
For more information about using the Cloud TPU API, see `Create a Cloud TPU using the Create Node API <https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#create-node-api>`_. 
`Queued resources <https://cloud.devsite.corp.google.com/tpu/docs/queued-resources>`_
enable you to request Cloud TPU resources in a queued manner. When you request 
queued resources, the request is added to a queue maintained by the Cloud TPU 
service. When the requested resource becomes available, it's assigned to your 
Google Cloud project for your immediate exclusive use. 

Provision a Cloud TPU with the queued resource API
--------------------------------------------------
Create a TPU v5e with 4 TPU chips:

.. code-block:: console

    gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
    --node-id TPU_NAME \
    --project PROJECT_ID \
    --zone ZONE \
    --accelerator-type ACCELERATOR_TYPE \
    --runtime-version RUNTIME_VERSION \
    --service-account SERVICE_ACCOUNT

.. list-table:: Parameter descriptions
    :header-rows: 1

    * - Parameter name
      - Description
    * - QUEUED_RESOURCE_ID
      - The user-assigned ID of the queued resource request.
    * - TPU_NAME
      - The user-assigned name of the TPU which is created when the queued 
        resource request is allocated.
    * - PROJECT_ID
      - Your Google Cloud project
    * - ZONE
      - The `zone <https://cloud.google.com/tpu/docs/regions-zones>`_ where you 
        want to create your Cloud TPU.
    * - ACCELERATOR_TYPE
      - The TPU version you want to use. Specify the TPU version, followed by a 
        '-' and the number of TPU cores. For example `v5e-4` specifies a v5e TPU 
        with 4 cores. For more information, see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
    * - RUNTIME_VERSION
      - The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
    * - SERVICE_ACCOUNT
      - The email address for your service account. You can find it in the IAM 
        Cloud Console under *Service Accounts*. For example: 
        `tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`

Connect to your TPU using SSH:

.. code-block:: bash

    gcloud compute tpus tpu-vm ssh TPU_NAME

Create and activate a Conda environment for vLLM:

.. code-block:: bash

    conda create -n vllm python=3.10 -y
    conda activate vllm

Clone the vLLM repository and go to the vLLM directory:

.. code-block:: bash

    git clone https://github.com/vllm-project/vllm.git && cd vllm

Uninstall the existing `torch` and `torch_xla` packages:

.. code-block:: bash

    pip uninstall torch torch-xla -y

Install build dependencies:

.. code-block:: bash

    pip install -r requirements-tpu.txt
    sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev 

Run the setup script:

.. code-block:: bash

   VLLM_TARGET_DEVICE="tpu" python setup.py develop


Provision Cloud TPUs with GKE 
-----------------------------

For more information about using TPUs with GKE, see 
https://cloud.google.com/kubernetes-engine/docs/how-to/tpus
https://cloud.google.com/kubernetes-engine/docs/concepts/tpus
https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus

.. _build_docker_tpu:

Build a docker image with :code:`Dockerfile.tpu`
------------------------------------------------

You can use `Dockerfile.tpu <https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu>`_ 
to build a Docker image with TPU support.

.. code-block:: console

    $ docker build -f Dockerfile.tpu -t vllm-tpu .

Run the Docker image with the following command:

.. code-block:: console

    $ # Make sure to add `--privileged --net host --shm-size=16G`.
    $ docker run --privileged --net host --shm-size=16G -it vllm-tpu

.. note::

    Since TPU relies on XLA which requires static shapes, vLLM bucketizes the possible input shapes and compiles an XLA graph for each different shape.
    The compilation time may take 20~30 minutes in the first run.
    However, the compilation time reduces to ~5 minutes afterwards because the XLA graphs are cached in the disk (in :code:`VLLM_XLA_CACHE_PATH` or :code:`~/.cache/vllm/xla_cache` by default).

.. tip::

    If you encounter the following error:

    .. code-block:: console

        from torch._C import *  # noqa: F403
        ImportError: libopenblas.so.0: cannot open shared object file: No such file or directory


    Install OpenBLAS with the following command:

    .. code-block:: console

        $ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00			`.. _installation_tpu:`

[Doc] Updated tpu-installation.rst with more details (#9926) Signed-off-by: Michael Green <mikegre@google.com> 2024-11-02 14:06:45 +00:00			`#####################`
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00			`Installation with TPU`
[Doc] Updated tpu-installation.rst with more details (#9926) Signed-off-by: Michael Green <mikegre@google.com> 2024-11-02 14:06:45 +00:00			`#####################`
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00
[Doc] Updated tpu-installation.rst with more details (#9926) Signed-off-by: Michael Green <mikegre@google.com> 2024-11-02 14:06:45 +00:00			`Tensor Processing Units (TPUs) are Google's custom-developed application-specific`
			`integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs`
			`are available in different versions each with different hardware specifications.`
			For more information about TPUs, see `TPU System Architecture <https://cloud.google.com/tpu/docs/system-architecture-tpu-vm>`_.
			`For more information on the TPU versions supported with vLLM, see:`

			* `TPU v6e <https://cloud.google.com/tpu/docs/v6e>`_
			* `TPU v5e <https://cloud.google.com/tpu/docs/v5e>`_
			* `TPU v5p <https://cloud.google.com/tpu/docs/v5p>`_
			* `TPU v4 <https://cloud.google.com/tpu/docs/v4>`_

			`These TPU versions allow you to configure the physical arrangements of the TPU`
			`chips. This can improve throughput and networking performance. For more`
			`information see:`

			* `TPU v6e topologies <https://cloud.google.com/tpu/docs/v6e#configurations>`_
			* `TPU v5e topologies <https://cloud.google.com/tpu/docs/v5e#tpu-v5e-config>`_
			* `TPU v5p topologies <https://cloud.google.com/tpu/docs/v5p#tpu-v5p-config>`_
			* `TPU v4 topologies <https://cloud.google.com/tpu/docs/v4#tpu-v4-config>`_

			`In order for you to use Cloud TPUs you need to have TPU quota granted to your`
			`Google Cloud Platform project. TPU quotas specify how many TPUs you can use in a`
			`GPC project and are specified in terms of TPU version, the number of TPU you`
			want to use, and quota type. For more information, see `TPU quota <https://cloud.google.com/tpu/docs/quota#tpu_quota>`_.

			For TPU pricing information, see `Cloud TPU pricing <https://cloud.google.com/tpu/pricing>`_.

			`You may need additional persistent storage for your TPU VMs. For more`
			information, see `Storage options for Cloud TPU data <https://cloud.devsite.corp.google.com/tpu/docs/storage-options>`_.
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00
			`Requirements`
			`------------`

[Doc] Updated tpu-installation.rst with more details (#9926) Signed-off-by: Michael Green <mikegre@google.com> 2024-11-02 14:06:45 +00:00			`* Google Cloud TPU VM`
			`* TPU versions: v6e, v5e, v5p, v4`
			`* Python: 3.10 or newer`

			`Provision Cloud TPUs`
			`====================`

			You can provision Cloud TPUs using the `Cloud TPU API <https://cloud.google.com/tpu/docs/reference/rest>`_`
			or the `queued resources <https://cloud.google.com/tpu/docs/queued-resources>`_`
			`API. This section shows how to create TPUs using the queued resource API.`
			For more information about using the Cloud TPU API, see `Create a Cloud TPU using the Create Node API <https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#create-node-api>`_.
			`Queued resources <https://cloud.devsite.corp.google.com/tpu/docs/queued-resources>`_
			`enable you to request Cloud TPU resources in a queued manner. When you request`
			`queued resources, the request is added to a queue maintained by the Cloud TPU`
			`service. When the requested resource becomes available, it's assigned to your`
			`Google Cloud project for your immediate exclusive use.`

			`Provision a Cloud TPU with the queued resource API`
			`--------------------------------------------------`
			`Create a TPU v5e with 4 TPU chips:`

			`.. code-block:: console`

			`gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \`
			`--node-id TPU_NAME \`
			`--project PROJECT_ID \`
			`--zone ZONE \`
			`--accelerator-type ACCELERATOR_TYPE \`
			`--runtime-version RUNTIME_VERSION \`
			`--service-account SERVICE_ACCOUNT`

			`.. list-table:: Parameter descriptions`
			`:header-rows: 1`

			`* - Parameter name`
			`- Description`
			`* - QUEUED_RESOURCE_ID`
			`- The user-assigned ID of the queued resource request.`
			`* - TPU_NAME`
			`- The user-assigned name of the TPU which is created when the queued`
			`resource request is allocated.`
			`* - PROJECT_ID`
			`- Your Google Cloud project`
			`* - ZONE`
			- The `zone <https://cloud.google.com/tpu/docs/regions-zones>`_ where you
			`want to create your Cloud TPU.`
			`* - ACCELERATOR_TYPE`
			`- The TPU version you want to use. Specify the TPU version, followed by a`
			'-' and the number of TPU cores. For example `v5e-4` specifies a v5e TPU
			with 4 cores. For more information, see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
			`* - RUNTIME_VERSION`
			- The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
			`* - SERVICE_ACCOUNT`
			`- The email address for your service account. You can find it in the IAM`
			`Cloud Console under Service Accounts. For example:`
			`tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`

			`Connect to your TPU using SSH:`

			`.. code-block:: bash`

			`gcloud compute tpus tpu-vm ssh TPU_NAME`

			`Create and activate a Conda environment for vLLM:`

			`.. code-block:: bash`
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00
[Doc] Updated tpu-installation.rst with more details (#9926) Signed-off-by: Michael Green <mikegre@google.com> 2024-11-02 14:06:45 +00:00			`conda create -n vllm python=3.10 -y`
			`conda activate vllm`
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00
[Doc] Updated tpu-installation.rst with more details (#9926) Signed-off-by: Michael Green <mikegre@google.com> 2024-11-02 14:06:45 +00:00			`Clone the vLLM repository and go to the vLLM directory:`

			`.. code-block:: bash`

			`git clone https://github.com/vllm-project/vllm.git && cd vllm`

			Uninstall the existing `torch` and `torch_xla` packages:

			`.. code-block:: bash`

			`pip uninstall torch torch-xla -y`

Refactor TPU requirements file and pin build dependencies (#10010) Signed-off-by: Richard Liu <ricliu@google.com> 2024-11-05 08:48:44 -08:00			`Install build dependencies:`
[Doc] Updated tpu-installation.rst with more details (#9926) Signed-off-by: Michael Green <mikegre@google.com> 2024-11-02 14:06:45 +00:00
			`.. code-block:: bash`

Refactor TPU requirements file and pin build dependencies (#10010) Signed-off-by: Richard Liu <ricliu@google.com> 2024-11-05 08:48:44 -08:00			`pip install -r requirements-tpu.txt`
			`sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev`
[Doc] Updated tpu-installation.rst with more details (#9926) Signed-off-by: Michael Green <mikegre@google.com> 2024-11-02 14:06:45 +00:00
Refactor TPU requirements file and pin build dependencies (#10010) Signed-off-by: Richard Liu <ricliu@google.com> 2024-11-05 08:48:44 -08:00			`Run the setup script:`
[Doc] Updated tpu-installation.rst with more details (#9926) Signed-off-by: Michael Green <mikegre@google.com> 2024-11-02 14:06:45 +00:00
			`.. code-block:: bash`

Refactor TPU requirements file and pin build dependencies (#10010) Signed-off-by: Richard Liu <ricliu@google.com> 2024-11-05 08:48:44 -08:00			`VLLM_TARGET_DEVICE="tpu" python setup.py develop`
[Doc] Updated tpu-installation.rst with more details (#9926) Signed-off-by: Michael Green <mikegre@google.com> 2024-11-02 14:06:45 +00:00

			`Provision Cloud TPUs with GKE`
			`-----------------------------`

			`For more information about using TPUs with GKE, see`
			`https://cloud.google.com/kubernetes-engine/docs/how-to/tpus`
			`https://cloud.google.com/kubernetes-engine/docs/concepts/tpus`
			`https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus`
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00
			`.. _build_docker_tpu:`

			Build a docker image with :code:`Dockerfile.tpu`
			`------------------------------------------------`

[Doc] Updated tpu-installation.rst with more details (#9926) Signed-off-by: Michael Green <mikegre@google.com> 2024-11-02 14:06:45 +00:00			You can use `Dockerfile.tpu <https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu>`_
			`to build a Docker image with TPU support.`
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00
			`.. code-block:: console`

			`$ docker build -f Dockerfile.tpu -t vllm-tpu .`

[Doc] Updated tpu-installation.rst with more details (#9926) Signed-off-by: Michael Green <mikegre@google.com> 2024-11-02 14:06:45 +00:00			`Run the Docker image with the following command:`
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00
			`.. code-block:: console`

			$ # Make sure to add `--privileged --net host --shm-size=16G`.
			`$ docker run --privileged --net host --shm-size=16G -it vllm-tpu`

[TPU] Reduce compilation time & Upgrade PyTorch XLA version (#6856) 2024-07-27 10:28:33 -07:00			`.. note::`

			`Since TPU relies on XLA which requires static shapes, vLLM bucketizes the possible input shapes and compiles an XLA graph for each different shape.`
			`The compilation time may take 20~30 minutes in the first run.`
			However, the compilation time reduces to ~5 minutes afterwards because the XLA graphs are cached in the disk (in :code:`VLLM_XLA_CACHE_PATH` or :code:`~/.cache/vllm/xla_cache` by default).

[Docs][TPU] Add installation tip for TPU (#5761) 2024-06-21 23:09:40 -07:00			`.. tip::`

			`If you encounter the following error:`

			`.. code-block:: console`

			`from torch._C import * # noqa: F403`
			`ImportError: libopenblas.so.0: cannot open shared object file: No such file or directory`


[Doc] Updated tpu-installation.rst with more details (#9926) Signed-off-by: Michael Green <mikegre@google.com> 2024-11-02 14:06:45 +00:00			`Install OpenBLAS with the following command:`
[Docs][TPU] Add installation tip for TPU (#5761) 2024-06-21 23:09:40 -07:00
			`.. code-block:: console`

			`$ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev`