vllm/docs/source/deployment/frameworks/lws.md

(deployment-lws)=

# LWS

LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
A major use case is for multi-host/multi-node distributed inference.

vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kubernetes for distributed model serving.

## Prerequisites

* At least two Kubernetes nodes, each with 8 GPUs, are required.
* Install LWS by following the instructions found [here](https://lws.sigs.k8s.io/docs/installation/).

## Deploy and Serve

Deploy the following yaml file `lws.yaml`

```yaml
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 2
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        containers:
          - name: vllm-leader
            image: docker.io/vllm/vllm-openai:latest
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                value: <your-hf-token>
            command:
              - sh
              - -c
              - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                 python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2"
            resources:
              limits:
                nvidia.com/gpu: "8"
                memory: 1124Gi
                ephemeral-storage: 800Gi
              requests:
                ephemeral-storage: 800Gi
                cpu: 125
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: docker.io/vllm/vllm-openai:latest
            command:
              - sh
              - -c
              - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
                memory: 1124Gi
                ephemeral-storage: 800Gi
              requests:
                ephemeral-storage: 800Gi
                cpu: 125
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                value: <your-hf-token>
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm   
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP
```

```bash
kubectl apply -f lws.yaml
```

Verify the status of the pods:

```bash
kubectl get pods
```

Should get an output similar to this:

```bash
NAME       READY   STATUS    RESTARTS   AGE
vllm-0     1/1     Running   0          2s
vllm-0-1   1/1     Running   0          2s
vllm-1     1/1     Running   0          2s
vllm-1-1   1/1     Running   0          2s
```

Verify that the distributed tensor-parallel inference works:

```bash
kubectl logs vllm-0 |grep -i "Loading model weights took" 
```

Should get something similar to this:

```text
INFO 05-08 03:20:24 model_runner.py:173] Loading model weights took 0.1189 GB
(RayWorkerWrapper pid=169, ip=10.20.0.197) INFO 05-08 03:20:28 model_runner.py:173] Loading model weights took 0.1189 GB
```

## Access ClusterIP service

```bash
# Listen on port 8080 locally, forwarding to the targetPort of the service's port 8080 in a pod selected by the service
kubectl port-forward svc/vllm-leader 8080:8080
```

The output should be similar to the following:

```text
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
```

## Serve the model

Open another terminal and send a request

```text
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
}'
```

The output should be similar to the following

```text
{
  "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d",
  "object": "text_completion",
  "created": 1715138766,
  "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
  "choices": [
    {
      "index": 0,
      "text": " top destination for foodies, with",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 12,
    "completion_tokens": 7
  }
}
```
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			`(deployment-lws)=`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
[Doc][3/N] Reorganize Serving section (#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-01-07 11:20:01 +08:00			`# LWS`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
			`LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.`
			`A major use case is for multi-host/multi-node distributed inference.`

			`vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kubernetes for distributed model serving.`

[Doc] Update LWS docs (#15163) Signed-off-by: Edwinhr716 <Edandres249@gmail.com> 2025-03-20 21:18:47 -07:00			`## Prerequisites`

			`* At least two Kubernetes nodes, each with 8 GPUs, are required.`
			`* Install LWS by following the instructions found [here](https://lws.sigs.k8s.io/docs/installation/).`

			`## Deploy and Serve`

			Deploy the following yaml file `lws.yaml`

			```yaml
			`apiVersion: leaderworkerset.x-k8s.io/v1`
			`kind: LeaderWorkerSet`
			`metadata:`
			`name: vllm`
			`spec:`
			`replicas: 2`
			`leaderWorkerTemplate:`
			`size: 2`
			`restartPolicy: RecreateGroupOnPodRestart`
			`leaderTemplate:`
			`metadata:`
			`labels:`
			`role: leader`
			`spec:`
			`containers:`
			`- name: vllm-leader`
			`image: docker.io/vllm/vllm-openai:latest`
			`env:`
			`- name: HUGGING_FACE_HUB_TOKEN`
			`value: <your-hf-token>`
			`command:`
			`- sh`
			`- -c`
			`- "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);`
			`python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2"`
			`resources:`
			`limits:`
			`nvidia.com/gpu: "8"`
			`memory: 1124Gi`
			`ephemeral-storage: 800Gi`
			`requests:`
			`ephemeral-storage: 800Gi`
			`cpu: 125`
			`ports:`
			`- containerPort: 8080`
			`readinessProbe:`
			`tcpSocket:`
			`port: 8080`
			`initialDelaySeconds: 15`
			`periodSeconds: 10`
			`volumeMounts:`
			`- mountPath: /dev/shm`
			`name: dshm`
			`volumes:`
			`- name: dshm`
			`emptyDir:`
			`medium: Memory`
			`sizeLimit: 15Gi`
			`workerTemplate:`
			`spec:`
			`containers:`
			`- name: vllm-worker`
			`image: docker.io/vllm/vllm-openai:latest`
			`command:`
			`- sh`
			`- -c`
			`- "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"`
			`resources:`
			`limits:`
			`nvidia.com/gpu: "8"`
			`memory: 1124Gi`
			`ephemeral-storage: 800Gi`
			`requests:`
			`ephemeral-storage: 800Gi`
			`cpu: 125`
			`env:`
			`- name: HUGGING_FACE_HUB_TOKEN`
			`value: <your-hf-token>`
			`volumeMounts:`
			`- mountPath: /dev/shm`
			`name: dshm`
			`volumes:`
			`- name: dshm`
			`emptyDir:`
			`medium: Memory`
			`sizeLimit: 15Gi`
			`---`
			`apiVersion: v1`
			`kind: Service`
			`metadata:`
			`name: vllm-leader`
			`spec:`
			`ports:`
			`- name: http`
			`port: 8080`
			`protocol: TCP`
			`targetPort: 8080`
			`selector:`
			`leaderworkerset.sigs.k8s.io/name: vllm`
			`role: leader`
			`type: ClusterIP`
			```

			```bash
			`kubectl apply -f lws.yaml`
			```

			`Verify the status of the pods:`

			```bash
			`kubectl get pods`
			```

			`Should get an output similar to this:`

			```bash
			`NAME READY STATUS RESTARTS AGE`
			`vllm-0 1/1 Running 0 2s`
			`vllm-0-1 1/1 Running 0 2s`
			`vllm-1 1/1 Running 0 2s`
			`vllm-1-1 1/1 Running 0 2s`
			```

			`Verify that the distributed tensor-parallel inference works:`

			```bash
			`kubectl logs vllm-0 \|grep -i "Loading model weights took"`
			```

			`Should get something similar to this:`

			```text
			`INFO 05-08 03:20:24 model_runner.py:173] Loading model weights took 0.1189 GB`
			`(RayWorkerWrapper pid=169, ip=10.20.0.197) INFO 05-08 03:20:28 model_runner.py:173] Loading model weights took 0.1189 GB`
			```

			`## Access ClusterIP service`

			```bash
			`# Listen on port 8080 locally, forwarding to the targetPort of the service's port 8080 in a pod selected by the service`
			`kubectl port-forward svc/vllm-leader 8080:8080`
			```

			`The output should be similar to the following:`

			```text
			`Forwarding from 127.0.0.1:8080 -> 8080`
			`Forwarding from [::1]:8080 -> 8080`
			```

			`## Serve the model`

			`Open another terminal and send a request`

			```text
			`curl http://localhost:8080/v1/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "meta-llama/Meta-Llama-3.1-405B-Instruct",`
			`"prompt": "San Francisco is a",`
			`"max_tokens": 7,`
			`"temperature": 0`
			`}'`
			```

			`The output should be similar to the following`

			```text
			`{`
			`"id": "cmpl-1bb34faba88b43f9862cfbfb2200949d",`
			`"object": "text_completion",`
			`"created": 1715138766,`
			`"model": "meta-llama/Meta-Llama-3.1-405B-Instruct",`
			`"choices": [`
			`{`
			`"index": 0,`
			`"text": " top destination for foodies, with",`
			`"logprobs": null,`
			`"finish_reason": "length",`
			`"stop_reason": null`
			`}`
			`],`
			`"usage": {`
			`"prompt_tokens": 5,`
			`"total_tokens": 12,`
			`"completion_tokens": 7`
			`}`
			`}`
			```