diff --git a/docs/source/deployment/frameworks/lws.md b/docs/source/deployment/frameworks/lws.md index 349fa83f..4e9a03b5 100644 --- a/docs/source/deployment/frameworks/lws.md +++ b/docs/source/deployment/frameworks/lws.md @@ -7,5 +7,192 @@ A major use case is for multi-host/multi-node distributed inference. vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kubernetes for distributed model serving. -Please see [this guide](https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/vllm) for more details on -deploying vLLM on Kubernetes using LWS. +## Prerequisites + +* At least two Kubernetes nodes, each with 8 GPUs, are required. +* Install LWS by following the instructions found [here](https://lws.sigs.k8s.io/docs/installation/). + +## Deploy and Serve + +Deploy the following yaml file `lws.yaml` + +```yaml +apiVersion: leaderworkerset.x-k8s.io/v1 +kind: LeaderWorkerSet +metadata: + name: vllm +spec: + replicas: 2 + leaderWorkerTemplate: + size: 2 + restartPolicy: RecreateGroupOnPodRestart + leaderTemplate: + metadata: + labels: + role: leader + spec: + containers: + - name: vllm-leader + image: docker.io/vllm/vllm-openai:latest + env: + - name: HUGGING_FACE_HUB_TOKEN + value: + command: + - sh + - -c + - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); + python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2" + resources: + limits: + nvidia.com/gpu: "8" + memory: 1124Gi + ephemeral-storage: 800Gi + requests: + ephemeral-storage: 800Gi + cpu: 125 + ports: + - containerPort: 8080 + readinessProbe: + tcpSocket: + port: 8080 + initialDelaySeconds: 15 + periodSeconds: 10 + volumeMounts: + - mountPath: /dev/shm + name: dshm + volumes: + - name: dshm + emptyDir: + medium: Memory + sizeLimit: 15Gi + workerTemplate: + spec: + containers: + - name: vllm-worker + image: docker.io/vllm/vllm-openai:latest + command: + - sh + - -c + - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)" + resources: + limits: + nvidia.com/gpu: "8" + memory: 1124Gi + ephemeral-storage: 800Gi + requests: + ephemeral-storage: 800Gi + cpu: 125 + env: + - name: HUGGING_FACE_HUB_TOKEN + value: + volumeMounts: + - mountPath: /dev/shm + name: dshm + volumes: + - name: dshm + emptyDir: + medium: Memory + sizeLimit: 15Gi +--- +apiVersion: v1 +kind: Service +metadata: + name: vllm-leader +spec: + ports: + - name: http + port: 8080 + protocol: TCP + targetPort: 8080 + selector: + leaderworkerset.sigs.k8s.io/name: vllm + role: leader + type: ClusterIP +``` + +```bash +kubectl apply -f lws.yaml +``` + +Verify the status of the pods: + +```bash +kubectl get pods +``` + +Should get an output similar to this: + +```bash +NAME READY STATUS RESTARTS AGE +vllm-0 1/1 Running 0 2s +vllm-0-1 1/1 Running 0 2s +vllm-1 1/1 Running 0 2s +vllm-1-1 1/1 Running 0 2s +``` + +Verify that the distributed tensor-parallel inference works: + +```bash +kubectl logs vllm-0 |grep -i "Loading model weights took" +``` + +Should get something similar to this: + +```text +INFO 05-08 03:20:24 model_runner.py:173] Loading model weights took 0.1189 GB +(RayWorkerWrapper pid=169, ip=10.20.0.197) INFO 05-08 03:20:28 model_runner.py:173] Loading model weights took 0.1189 GB +``` + +## Access ClusterIP service + +```bash +# Listen on port 8080 locally, forwarding to the targetPort of the service's port 8080 in a pod selected by the service +kubectl port-forward svc/vllm-leader 8080:8080 +``` + +The output should be similar to the following: + +```text +Forwarding from 127.0.0.1:8080 -> 8080 +Forwarding from [::1]:8080 -> 8080 +``` + +## Serve the model + +Open another terminal and send a request + +```text +curl http://localhost:8080/v1/completions \ +-H "Content-Type: application/json" \ +-d '{ + "model": "meta-llama/Meta-Llama-3.1-405B-Instruct", + "prompt": "San Francisco is a", + "max_tokens": 7, + "temperature": 0 +}' +``` + +The output should be similar to the following + +```text +{ + "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d", + "object": "text_completion", + "created": 1715138766, + "model": "meta-llama/Meta-Llama-3.1-405B-Instruct", + "choices": [ + { + "index": 0, + "text": " top destination for foodies, with", + "logprobs": null, + "finish_reason": "length", + "stop_reason": null + } + ], + "usage": { + "prompt_tokens": 5, + "total_tokens": 12, + "completion_tokens": 7 + } +} +```