193 lines
6.6 KiB
Markdown
193 lines
6.6 KiB
Markdown
(installation-tpu)=
|
|
|
|
# Installation with TPU
|
|
|
|
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
|
|
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
|
|
are available in different versions each with different hardware specifications.
|
|
For more information about TPUs, see [TPU System Architecture](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm).
|
|
For more information on the TPU versions supported with vLLM, see:
|
|
|
|
- [TPU v6e](https://cloud.google.com/tpu/docs/v6e)
|
|
- [TPU v5e](https://cloud.google.com/tpu/docs/v5e)
|
|
- [TPU v5p](https://cloud.google.com/tpu/docs/v5p)
|
|
- [TPU v4](https://cloud.google.com/tpu/docs/v4)
|
|
|
|
These TPU versions allow you to configure the physical arrangements of the TPU
|
|
chips. This can improve throughput and networking performance. For more
|
|
information see:
|
|
|
|
- [TPU v6e topologies](https://cloud.google.com/tpu/docs/v6e#configurations)
|
|
- [TPU v5e topologies](https://cloud.google.com/tpu/docs/v5e#tpu-v5e-config)
|
|
- [TPU v5p topologies](https://cloud.google.com/tpu/docs/v5p#tpu-v5p-config)
|
|
- [TPU v4 topologies](https://cloud.google.com/tpu/docs/v4#tpu-v4-config)
|
|
|
|
In order for you to use Cloud TPUs you need to have TPU quota granted to your
|
|
Google Cloud Platform project. TPU quotas specify how many TPUs you can use in a
|
|
GPC project and are specified in terms of TPU version, the number of TPU you
|
|
want to use, and quota type. For more information, see [TPU quota](https://cloud.google.com/tpu/docs/quota#tpu_quota).
|
|
|
|
For TPU pricing information, see [Cloud TPU pricing](https://cloud.google.com/tpu/pricing).
|
|
|
|
You may need additional persistent storage for your TPU VMs. For more
|
|
information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp.google.com/tpu/docs/storage-options).
|
|
|
|
## Requirements
|
|
|
|
- Google Cloud TPU VM
|
|
- TPU versions: v6e, v5e, v5p, v4
|
|
- Python: 3.10 or newer
|
|
|
|
### Provision Cloud TPUs
|
|
|
|
You can provision Cloud TPUs using the [Cloud TPU API](https://cloud.google.com/tpu/docs/reference/rest)
|
|
or the [queued resources](https://cloud.google.com/tpu/docs/queued-resources)
|
|
API. This section shows how to create TPUs using the queued resource API. For
|
|
more information about using the Cloud TPU API, see [Create a Cloud TPU using the Create Node API](https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#create-node-api).
|
|
Queued resources enable you to request Cloud TPU resources in a queued manner.
|
|
When you request queued resources, the request is added to a queue maintained by
|
|
the Cloud TPU service. When the requested resource becomes available, it's
|
|
assigned to your Google Cloud project for your immediate exclusive use.
|
|
|
|
```{note}
|
|
In all of the following commands, replace the ALL CAPS parameter names with
|
|
appropriate values. See the parameter descriptions table for more information.
|
|
```
|
|
|
|
## Provision a Cloud TPU with the queued resource API
|
|
|
|
Create a TPU v5e with 4 TPU chips:
|
|
|
|
```console
|
|
gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
|
|
--node-id TPU_NAME \
|
|
--project PROJECT_ID \
|
|
--zone ZONE \
|
|
--accelerator-type ACCELERATOR_TYPE \
|
|
--runtime-version RUNTIME_VERSION \
|
|
--service-account SERVICE_ACCOUNT
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. list-table:: Parameter descriptions
|
|
:header-rows: 1
|
|
|
|
* - Parameter name
|
|
- Description
|
|
* - QUEUED_RESOURCE_ID
|
|
- The user-assigned ID of the queued resource request.
|
|
* - TPU_NAME
|
|
- The user-assigned name of the TPU which is created when the queued
|
|
resource request is allocated.
|
|
* - PROJECT_ID
|
|
- Your Google Cloud project
|
|
* - ZONE
|
|
- The GCP zone where you want to create your Cloud TPU. The value you use
|
|
depends on the version of TPUs you are using. For more information, see
|
|
`TPU regions and zones <https://cloud.google.com/tpu/docs/regions-zones>`_
|
|
* - ACCELERATOR_TYPE
|
|
- The TPU version you want to use. Specify the TPU version, for example
|
|
`v5litepod-4` specifies a v5e TPU with 4 cores. For more information,
|
|
see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
|
|
* - RUNTIME_VERSION
|
|
- The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
|
|
* - SERVICE_ACCOUNT
|
|
- The email address for your service account. You can find it in the IAM
|
|
Cloud Console under *Service Accounts*. For example:
|
|
`tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`
|
|
```
|
|
|
|
Connect to your TPU using SSH:
|
|
|
|
```bash
|
|
gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE
|
|
```
|
|
|
|
Install Miniconda
|
|
|
|
```bash
|
|
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
|
|
bash Miniconda3-latest-Linux-x86_64.sh
|
|
source ~/.bashrc
|
|
```
|
|
|
|
Create and activate a Conda environment for vLLM:
|
|
|
|
```bash
|
|
conda create -n vllm python=3.10 -y
|
|
conda activate vllm
|
|
```
|
|
|
|
Clone the vLLM repository and go to the vLLM directory:
|
|
|
|
```bash
|
|
git clone https://github.com/vllm-project/vllm.git && cd vllm
|
|
```
|
|
|
|
Uninstall the existing `torch` and `torch_xla` packages:
|
|
|
|
```bash
|
|
pip uninstall torch torch-xla -y
|
|
```
|
|
|
|
Install build dependencies:
|
|
|
|
```bash
|
|
pip install -r requirements-tpu.txt
|
|
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
|
|
```
|
|
|
|
Run the setup script:
|
|
|
|
```bash
|
|
VLLM_TARGET_DEVICE="tpu" python setup.py develop
|
|
```
|
|
|
|
## Provision Cloud TPUs with GKE
|
|
|
|
For more information about using TPUs with GKE, see
|
|
<https://cloud.google.com/kubernetes-engine/docs/how-to/tpus>
|
|
<https://cloud.google.com/kubernetes-engine/docs/concepts/tpus>
|
|
<https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus>
|
|
|
|
(build-docker-tpu)=
|
|
|
|
## Build a docker image with {code}`Dockerfile.tpu`
|
|
|
|
You can use <gh-file:Dockerfile.tpu> to build a Docker image with TPU support.
|
|
|
|
```console
|
|
$ docker build -f Dockerfile.tpu -t vllm-tpu .
|
|
```
|
|
|
|
Run the Docker image with the following command:
|
|
|
|
```console
|
|
$ # Make sure to add `--privileged --net host --shm-size=16G`.
|
|
$ docker run --privileged --net host --shm-size=16G -it vllm-tpu
|
|
```
|
|
|
|
```{note}
|
|
Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
|
|
possible input shapes and compiles an XLA graph for each shape. The
|
|
compilation time may take 20~30 minutes in the first run. However, the
|
|
compilation time reduces to ~5 minutes afterwards because the XLA graphs are
|
|
cached in the disk (in {code}`VLLM_XLA_CACHE_PATH` or {code}`~/.cache/vllm/xla_cache` by default).
|
|
```
|
|
|
|
````{tip}
|
|
If you encounter the following error:
|
|
|
|
```console
|
|
from torch._C import * # noqa: F403
|
|
ImportError: libopenblas.so.0: cannot open shared object file: No such
|
|
file or directory
|
|
```
|
|
|
|
Install OpenBLAS with the following command:
|
|
|
|
```console
|
|
$ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
|
|
```
|
|
````
|