2025-01-07 11:20:01 +08:00
|
|
|
(deployment-lws)=
|
2024-12-23 17:35:38 -05:00
|
|
|
|
2025-01-07 11:20:01 +08:00
|
|
|
# LWS
|
2024-12-23 17:35:38 -05:00
|
|
|
|
|
|
|
LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
|
|
|
|
A major use case is for multi-host/multi-node distributed inference.
|
|
|
|
|
|
|
|
vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kubernetes for distributed model serving.
|
|
|
|
|
2025-03-20 21:18:47 -07:00
|
|
|
## Prerequisites
|
|
|
|
|
|
|
|
* At least two Kubernetes nodes, each with 8 GPUs, are required.
|
|
|
|
* Install LWS by following the instructions found [here](https://lws.sigs.k8s.io/docs/installation/).
|
|
|
|
|
|
|
|
## Deploy and Serve
|
|
|
|
|
|
|
|
Deploy the following yaml file `lws.yaml`
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
apiVersion: leaderworkerset.x-k8s.io/v1
|
|
|
|
kind: LeaderWorkerSet
|
|
|
|
metadata:
|
|
|
|
name: vllm
|
|
|
|
spec:
|
|
|
|
replicas: 2
|
|
|
|
leaderWorkerTemplate:
|
|
|
|
size: 2
|
|
|
|
restartPolicy: RecreateGroupOnPodRestart
|
|
|
|
leaderTemplate:
|
|
|
|
metadata:
|
|
|
|
labels:
|
|
|
|
role: leader
|
|
|
|
spec:
|
|
|
|
containers:
|
|
|
|
- name: vllm-leader
|
|
|
|
image: docker.io/vllm/vllm-openai:latest
|
|
|
|
env:
|
|
|
|
- name: HUGGING_FACE_HUB_TOKEN
|
|
|
|
value: <your-hf-token>
|
|
|
|
command:
|
|
|
|
- sh
|
|
|
|
- -c
|
|
|
|
- "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
|
|
|
|
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2"
|
|
|
|
resources:
|
|
|
|
limits:
|
|
|
|
nvidia.com/gpu: "8"
|
|
|
|
memory: 1124Gi
|
|
|
|
ephemeral-storage: 800Gi
|
|
|
|
requests:
|
|
|
|
ephemeral-storage: 800Gi
|
|
|
|
cpu: 125
|
|
|
|
ports:
|
|
|
|
- containerPort: 8080
|
|
|
|
readinessProbe:
|
|
|
|
tcpSocket:
|
|
|
|
port: 8080
|
|
|
|
initialDelaySeconds: 15
|
|
|
|
periodSeconds: 10
|
|
|
|
volumeMounts:
|
|
|
|
- mountPath: /dev/shm
|
|
|
|
name: dshm
|
|
|
|
volumes:
|
|
|
|
- name: dshm
|
|
|
|
emptyDir:
|
|
|
|
medium: Memory
|
|
|
|
sizeLimit: 15Gi
|
|
|
|
workerTemplate:
|
|
|
|
spec:
|
|
|
|
containers:
|
|
|
|
- name: vllm-worker
|
|
|
|
image: docker.io/vllm/vllm-openai:latest
|
|
|
|
command:
|
|
|
|
- sh
|
|
|
|
- -c
|
|
|
|
- "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
|
|
|
|
resources:
|
|
|
|
limits:
|
|
|
|
nvidia.com/gpu: "8"
|
|
|
|
memory: 1124Gi
|
|
|
|
ephemeral-storage: 800Gi
|
|
|
|
requests:
|
|
|
|
ephemeral-storage: 800Gi
|
|
|
|
cpu: 125
|
|
|
|
env:
|
|
|
|
- name: HUGGING_FACE_HUB_TOKEN
|
|
|
|
value: <your-hf-token>
|
|
|
|
volumeMounts:
|
|
|
|
- mountPath: /dev/shm
|
|
|
|
name: dshm
|
|
|
|
volumes:
|
|
|
|
- name: dshm
|
|
|
|
emptyDir:
|
|
|
|
medium: Memory
|
|
|
|
sizeLimit: 15Gi
|
|
|
|
---
|
|
|
|
apiVersion: v1
|
|
|
|
kind: Service
|
|
|
|
metadata:
|
|
|
|
name: vllm-leader
|
|
|
|
spec:
|
|
|
|
ports:
|
|
|
|
- name: http
|
|
|
|
port: 8080
|
|
|
|
protocol: TCP
|
|
|
|
targetPort: 8080
|
|
|
|
selector:
|
|
|
|
leaderworkerset.sigs.k8s.io/name: vllm
|
|
|
|
role: leader
|
|
|
|
type: ClusterIP
|
|
|
|
```
|
|
|
|
|
|
|
|
```bash
|
|
|
|
kubectl apply -f lws.yaml
|
|
|
|
```
|
|
|
|
|
|
|
|
Verify the status of the pods:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
kubectl get pods
|
|
|
|
```
|
|
|
|
|
|
|
|
Should get an output similar to this:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
NAME READY STATUS RESTARTS AGE
|
|
|
|
vllm-0 1/1 Running 0 2s
|
|
|
|
vllm-0-1 1/1 Running 0 2s
|
|
|
|
vllm-1 1/1 Running 0 2s
|
|
|
|
vllm-1-1 1/1 Running 0 2s
|
|
|
|
```
|
|
|
|
|
|
|
|
Verify that the distributed tensor-parallel inference works:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
kubectl logs vllm-0 |grep -i "Loading model weights took"
|
|
|
|
```
|
|
|
|
|
|
|
|
Should get something similar to this:
|
|
|
|
|
|
|
|
```text
|
|
|
|
INFO 05-08 03:20:24 model_runner.py:173] Loading model weights took 0.1189 GB
|
|
|
|
(RayWorkerWrapper pid=169, ip=10.20.0.197) INFO 05-08 03:20:28 model_runner.py:173] Loading model weights took 0.1189 GB
|
|
|
|
```
|
|
|
|
|
|
|
|
## Access ClusterIP service
|
|
|
|
|
|
|
|
```bash
|
|
|
|
# Listen on port 8080 locally, forwarding to the targetPort of the service's port 8080 in a pod selected by the service
|
|
|
|
kubectl port-forward svc/vllm-leader 8080:8080
|
|
|
|
```
|
|
|
|
|
|
|
|
The output should be similar to the following:
|
|
|
|
|
|
|
|
```text
|
|
|
|
Forwarding from 127.0.0.1:8080 -> 8080
|
|
|
|
Forwarding from [::1]:8080 -> 8080
|
|
|
|
```
|
|
|
|
|
|
|
|
## Serve the model
|
|
|
|
|
|
|
|
Open another terminal and send a request
|
|
|
|
|
|
|
|
```text
|
|
|
|
curl http://localhost:8080/v1/completions \
|
|
|
|
-H "Content-Type: application/json" \
|
|
|
|
-d '{
|
|
|
|
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
|
|
|
|
"prompt": "San Francisco is a",
|
|
|
|
"max_tokens": 7,
|
|
|
|
"temperature": 0
|
|
|
|
}'
|
|
|
|
```
|
|
|
|
|
|
|
|
The output should be similar to the following
|
|
|
|
|
|
|
|
```text
|
|
|
|
{
|
|
|
|
"id": "cmpl-1bb34faba88b43f9862cfbfb2200949d",
|
|
|
|
"object": "text_completion",
|
|
|
|
"created": 1715138766,
|
|
|
|
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
|
|
|
|
"choices": [
|
|
|
|
{
|
|
|
|
"index": 0,
|
|
|
|
"text": " top destination for foodies, with",
|
|
|
|
"logprobs": null,
|
|
|
|
"finish_reason": "length",
|
|
|
|
"stop_reason": null
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"usage": {
|
|
|
|
"prompt_tokens": 5,
|
|
|
|
"total_tokens": 12,
|
|
|
|
"completion_tokens": 7
|
|
|
|
}
|
|
|
|
}
|
|
|
|
```
|