(deployment-lws)= # LWS LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. A major use case is for multi-host/multi-node distributed inference. vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kubernetes for distributed model serving. ## Prerequisites * At least two Kubernetes nodes, each with 8 GPUs, are required. * Install LWS by following the instructions found [here](https://lws.sigs.k8s.io/docs/installation/). ## Deploy and Serve Deploy the following yaml file `lws.yaml` ```yaml apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: vllm spec: replicas: 2 leaderWorkerTemplate: size: 2 restartPolicy: RecreateGroupOnPodRestart leaderTemplate: metadata: labels: role: leader spec: containers: - name: vllm-leader image: docker.io/vllm/vllm-openai:latest env: - name: HUGGING_FACE_HUB_TOKEN value: command: - sh - -c - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2" resources: limits: nvidia.com/gpu: "8" memory: 1124Gi ephemeral-storage: 800Gi requests: ephemeral-storage: 800Gi cpu: 125 ports: - containerPort: 8080 readinessProbe: tcpSocket: port: 8080 initialDelaySeconds: 15 periodSeconds: 10 volumeMounts: - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 15Gi workerTemplate: spec: containers: - name: vllm-worker image: docker.io/vllm/vllm-openai:latest command: - sh - -c - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)" resources: limits: nvidia.com/gpu: "8" memory: 1124Gi ephemeral-storage: 800Gi requests: ephemeral-storage: 800Gi cpu: 125 env: - name: HUGGING_FACE_HUB_TOKEN value: volumeMounts: - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 15Gi --- apiVersion: v1 kind: Service metadata: name: vllm-leader spec: ports: - name: http port: 8080 protocol: TCP targetPort: 8080 selector: leaderworkerset.sigs.k8s.io/name: vllm role: leader type: ClusterIP ``` ```bash kubectl apply -f lws.yaml ``` Verify the status of the pods: ```bash kubectl get pods ``` Should get an output similar to this: ```bash NAME READY STATUS RESTARTS AGE vllm-0 1/1 Running 0 2s vllm-0-1 1/1 Running 0 2s vllm-1 1/1 Running 0 2s vllm-1-1 1/1 Running 0 2s ``` Verify that the distributed tensor-parallel inference works: ```bash kubectl logs vllm-0 |grep -i "Loading model weights took" ``` Should get something similar to this: ```text INFO 05-08 03:20:24 model_runner.py:173] Loading model weights took 0.1189 GB (RayWorkerWrapper pid=169, ip=10.20.0.197) INFO 05-08 03:20:28 model_runner.py:173] Loading model weights took 0.1189 GB ``` ## Access ClusterIP service ```bash # Listen on port 8080 locally, forwarding to the targetPort of the service's port 8080 in a pod selected by the service kubectl port-forward svc/vllm-leader 8080:8080 ``` The output should be similar to the following: ```text Forwarding from 127.0.0.1:8080 -> 8080 Forwarding from [::1]:8080 -> 8080 ``` ## Serve the model Open another terminal and send a request ```text curl http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3.1-405B-Instruct", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }' ``` The output should be similar to the following ```text { "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d", "object": "text_completion", "created": 1715138766, "model": "meta-llama/Meta-Llama-3.1-405B-Instruct", "choices": [ { "index": 0, "text": " top destination for foodies, with", "logprobs": null, "finish_reason": "length", "stop_reason": null } ], "usage": { "prompt_tokens": 5, "total_tokens": 12, "completion_tokens": 7 } } ```