vllm/docs/source/deployment/frameworks/skypilot.md

(deployment-skypilot)=

# SkyPilot

```{raw} html
<p align="center">
  <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
</p>
```

vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html).

## Prerequisites

- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-8B-Instruct`.
- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
- Check that `sky check` shows clouds or Kubernetes are enabled.

```console
pip install skypilot-nightly
sky check
```

## Run on a single instance

See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml).

```yaml
resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  use_spot: True
  disk_size: 512  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.

setup: |
  conda create -n vllm python=3.10 -y
  conda activate vllm

  pip install vllm==0.4.0.post1
  # Install Gradio for web UI.
  pip install gradio openai
  pip install flash-attn==2.5.7

run: |
  conda activate vllm
  echo 'Starting vllm api server...'
  python -u -m vllm.entrypoints.openai.api_server \
    --port 8081 \
    --model $MODEL_NAME \
    --trust-remote-code \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    2>&1 | tee api_server.log &

  echo 'Waiting for vllm api server to start...'
  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://localhost:8081/v1 \
    --stop-token-ids 128009,128001
```

Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):

```console
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
```

Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.

```console
(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
```

**Optional**: Serve the 70B model instead of the default 8B and use more GPU:

```console
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
```

## Scale up to multiple replicas

SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.

```yaml
service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
    model: $MODEL_NAME
    messages:
      - role: user
        content: Hello! What is your name?
  max_completion_tokens: 1
```

```{raw} html
<details>
<summary>Click to see the full recipe YAML</summary>
```

```yaml
service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_completion_tokens: 1

resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  use_spot: True
  disk_size: 512  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.

setup: |
  conda create -n vllm python=3.10 -y
  conda activate vllm

  pip install vllm==0.4.0.post1
  # Install Gradio for web UI.
  pip install gradio openai
  pip install flash-attn==2.5.7

run: |
  conda activate vllm
  echo 'Starting vllm api server...'
  python -u -m vllm.entrypoints.openai.api_server \
    --port 8081 \
    --model $MODEL_NAME \
    --trust-remote-code \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    2>&1 | tee api_server.log
```

```{raw} html
</details>
```

Start the serving the Llama-3 8B model on multiple replicas:

```console
HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
```

Wait until the service is ready:

```console
watch -n10 sky serve status vllm
```

```{raw} html
<details>
<summary>Example outputs:</summary>
```

```console
Services
NAME  VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
vllm  1        35s     READY   2/2       xx.yy.zz.100:30001

Service Replicas
SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES                STATUS  REGION
vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
```

```{raw} html
</details>
```

After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:

```console
ENDPOINT=$(sky serve status --endpoint 8081 vllm)
curl -L http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Who are you?"
    }
    ],
    "stop_token_ids": [128009,  128001]
  }'
```

To enable autoscaling, you could replace the `replicas` with the following configs in `service`:

```yaml
service:
  replica_policy:
    min_replicas: 2
    max_replicas: 4
    target_qps_per_replica: 2
```

This will scale the service up to when the QPS exceeds 2 for each replica.

```{raw} html
<details>
<summary>Click to see the full recipe YAML</summary>
```

```yaml
service:
  replica_policy:
    min_replicas: 2
    max_replicas: 4
    target_qps_per_replica: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_completion_tokens: 1

resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  use_spot: True
  disk_size: 512  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.

setup: |
  conda create -n vllm python=3.10 -y
  conda activate vllm

  pip install vllm==0.4.0.post1
  # Install Gradio for web UI.
  pip install gradio openai
  pip install flash-attn==2.5.7

run: |
  conda activate vllm
  echo 'Starting vllm api server...'
  python -u -m vllm.entrypoints.openai.api_server \
    --port 8081 \
    --model $MODEL_NAME \
    --trust-remote-code \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    2>&1 | tee api_server.log
```

```{raw} html
</details>
```

To update the service with the new config:

```console
HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
```

To stop the service:

```console
sky serve down vllm
```

### **Optional**: Connect a GUI to the endpoint

It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.

```{raw} html
<details>
<summary>Click to see the full GUI YAML</summary>
```

```yaml
envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.

resources:
  cpus: 2

setup: |
  conda create -n vllm python=3.10 -y
  conda activate vllm

  # Install Gradio for web UI.
  pip install gradio openai

run: |
  conda activate vllm
  export PATH=$PATH:/sbin

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://$ENDPOINT/v1 \
    --stop-token-ids 128009,128001 | tee ~/gradio.log
```

```{raw} html
</details>
```

1. Start the chat web UI:

    ```console
    sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
    ```

2. Then, we can access the GUI at the returned gradio link:

    ```console
    | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
    ```
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			`(deployment-skypilot)=`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			`# SkyPilot`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
			```{raw} html
			`<p align="center">`
			`<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>`
			`</p>`
			```

			`vLLM can be run and scaled to multiple service replicas on clouds and Kubernetes with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html).`

			`## Prerequisites`

[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-8B-Instruct`.
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			`- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).`
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			- Check that `sky check` shows clouds or Kubernetes are enabled.
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
			```console
			`pip install skypilot-nightly`
			`sky check`
			```

			`## Run on a single instance`

			`See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml).`

			```yaml
			`resources:`
			`accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.`
			`use_spot: True`
			`disk_size: 512 # Ensure model checkpoints can fit.`
			`disk_tier: best`
			`ports: 8081 # Expose to internet traffic.`

			`envs:`
			`MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct`
			`HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.`

			`setup: \|`
			`conda create -n vllm python=3.10 -y`
			`conda activate vllm`

			`pip install vllm==0.4.0.post1`
			`# Install Gradio for web UI.`
			`pip install gradio openai`
			`pip install flash-attn==2.5.7`

			`run: \|`
			`conda activate vllm`
			`echo 'Starting vllm api server...'`
			`python -u -m vllm.entrypoints.openai.api_server \`
			`--port 8081 \`
			`--model $MODEL_NAME \`
			`--trust-remote-code \`
			`--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \`
			`2>&1 \| tee api_server.log &`

			`echo 'Waiting for vllm api server to start...'`
			while ! `cat api_server.log \| grep -q 'Uvicorn running on'`; do sleep 1; done

			`echo 'Starting gradio server...'`
			`git clone https://github.com/vllm-project/vllm.git \|\| true`
[Doc] Move examples into categories (#11840) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-01-08 13:09:53 +00:00			`python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			`-m $MODEL_NAME \`
			`--port 8811 \`
			`--model-url http://localhost:8081/v1 \`
			`--stop-token-ids 128009,128001`
			```

			`Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):`

			```console
			`HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN`
			```

			`Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.`

			```console
			`(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live`
			```

			`Optional: Serve the 70B model instead of the default 8B and use more GPU:`

			```console
			`HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct`
			```

			`## Scale up to multiple replicas`

			`SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.`

			```yaml
			`service:`
			`replicas: 2`
			`# An actual request for readiness probe.`
			`readiness_probe:`
			`path: /v1/chat/completions`
			`post_data:`
			`model: $MODEL_NAME`
			`messages:`
			`- role: user`
			`content: Hello! What is your name?`
			`max_completion_tokens: 1`
			```

			```{raw} html
			`<details>`
			`<summary>Click to see the full recipe YAML</summary>`
			```

			```yaml
			`service:`
			`replicas: 2`
			`# An actual request for readiness probe.`
			`readiness_probe:`
			`path: /v1/chat/completions`
			`post_data:`
			`model: $MODEL_NAME`
			`messages:`
			`- role: user`
			`content: Hello! What is your name?`
			`max_completion_tokens: 1`

			`resources:`
			`accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.`
			`use_spot: True`
			`disk_size: 512 # Ensure model checkpoints can fit.`
			`disk_tier: best`
			`ports: 8081 # Expose to internet traffic.`

			`envs:`
			`MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct`
			`HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.`

			`setup: \|`
			`conda create -n vllm python=3.10 -y`
			`conda activate vllm`

			`pip install vllm==0.4.0.post1`
			`# Install Gradio for web UI.`
			`pip install gradio openai`
			`pip install flash-attn==2.5.7`

			`run: \|`
			`conda activate vllm`
			`echo 'Starting vllm api server...'`
			`python -u -m vllm.entrypoints.openai.api_server \`
			`--port 8081 \`
			`--model $MODEL_NAME \`
			`--trust-remote-code \`
			`--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \`
			`2>&1 \| tee api_server.log`
			```

			```{raw} html
			`</details>`
			```

			`Start the serving the Llama-3 8B model on multiple replicas:`

			```console
			`HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN`
			```

			`Wait until the service is ready:`

			```console
			`watch -n10 sky serve status vllm`
			```

			```{raw} html
			`<details>`
			`<summary>Example outputs:</summary>`
			```

			```console
			`Services`
			`NAME VERSION UPTIME STATUS REPLICAS ENDPOINT`
			`vllm 1 35s READY 2/2 xx.yy.zz.100:30001`

			`Service Replicas`
			`SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION`
			`vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4`
			`vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4`
			```

			```{raw} html
			`</details>`
			```

			`After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:`

			```console
			`ENDPOINT=$(sky serve status --endpoint 8081 vllm)`
			`curl -L http://$ENDPOINT/v1/chat/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "meta-llama/Meta-Llama-3-8B-Instruct",`
			`"messages": [`
			`{`
			`"role": "system",`
			`"content": "You are a helpful assistant."`
			`},`
			`{`
			`"role": "user",`
			`"content": "Who are you?"`
			`}`
			`],`
			`"stop_token_ids": [128009, 128001]`
			`}'`
			```

			To enable autoscaling, you could replace the `replicas` with the following configs in `service`:

			```yaml
			`service:`
			`replica_policy:`
			`min_replicas: 2`
			`max_replicas: 4`
			`target_qps_per_replica: 2`
			```

			`This will scale the service up to when the QPS exceeds 2 for each replica.`

			```{raw} html
			`<details>`
			`<summary>Click to see the full recipe YAML</summary>`
			```

			```yaml
			`service:`
			`replica_policy:`
			`min_replicas: 2`
			`max_replicas: 4`
			`target_qps_per_replica: 2`
			`# An actual request for readiness probe.`
			`readiness_probe:`
			`path: /v1/chat/completions`
			`post_data:`
			`model: $MODEL_NAME`
			`messages:`
			`- role: user`
			`content: Hello! What is your name?`
			`max_completion_tokens: 1`

			`resources:`
			`accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.`
			`use_spot: True`
			`disk_size: 512 # Ensure model checkpoints can fit.`
			`disk_tier: best`
			`ports: 8081 # Expose to internet traffic.`

			`envs:`
			`MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct`
			`HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.`

			`setup: \|`
			`conda create -n vllm python=3.10 -y`
			`conda activate vllm`

			`pip install vllm==0.4.0.post1`
			`# Install Gradio for web UI.`
			`pip install gradio openai`
			`pip install flash-attn==2.5.7`

			`run: \|`
			`conda activate vllm`
			`echo 'Starting vllm api server...'`
			`python -u -m vllm.entrypoints.openai.api_server \`
			`--port 8081 \`
			`--model $MODEL_NAME \`
			`--trust-remote-code \`
			`--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \`
			`2>&1 \| tee api_server.log`
			```

			```{raw} html
			`</details>`
			```

			`To update the service with the new config:`

			```console
			`HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN`
			```

			`To stop the service:`

			```console
			`sky serve down vllm`
			```

			`### Optional: Connect a GUI to the endpoint`

			`It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.`

			```{raw} html
			`<details>`
			`<summary>Click to see the full GUI YAML</summary>`
			```

			```yaml
			`envs:`
			`MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct`
			`ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.`

			`resources:`
			`cpus: 2`

			`setup: \|`
			`conda create -n vllm python=3.10 -y`
			`conda activate vllm`

			`# Install Gradio for web UI.`
			`pip install gradio openai`

			`run: \|`
			`conda activate vllm`
			`export PATH=$PATH:/sbin`

			`echo 'Starting gradio server...'`
			`git clone https://github.com/vllm-project/vllm.git \|\| true`
[Doc] Move examples into categories (#11840) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-01-08 13:09:53 +00:00			`python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			`-m $MODEL_NAME \`
			`--port 8811 \`
			`--model-url http://$ENDPOINT/v1 \`
			`--stop-token-ids 128009,128001 \| tee ~/gradio.log`
			```

			```{raw} html
			`</details>`
			```

			`1. Start the chat web UI:`

[CI/Build] Add markdown linter (#11857) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2025-01-12 03:17:13 -05:00			```console
			`sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)`
			```
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
			`2. Then, we can access the GUI at the returned gradio link:`

[CI/Build] Add markdown linter (#11857) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2025-01-12 03:17:13 -05:00			```console
			`\| INFO \| stdout \| Running on public URL: https://6141e84201ce0bb4ed.gradio.live`
			```