- First, install required driver and intel OneAPI 2024.2 or later.
- Second, install Python packages for vLLM XPU backend building:
```console
$ source /opt/intel/oneapi/setvars.sh
$ pip install --upgrade pip
$ pip install -v -r requirements-xpu.txt
```
- Finally, build and install vLLM XPU backend:
```console
$ VLLM_TARGET_DEVICE=xpu python setup.py install
```
```{note}
- FP16 is the default data type in the current XPU backend. The BF16 data
type will be supported in the future.
```
## Distributed inference and serving
XPU platform supports tensor-parallel inference/serving and also supports pipeline parallel as a beta feature for online serving. We requires Ray as the distributed runtime backend. For example, a reference execution likes following:
By default, a ray instance will be launched automatically if no existing one is detected in system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/run_cluster.sh> helper script.