155 lines
5.4 KiB
Markdown
155 lines
5.4 KiB
Markdown
![]() |
(deployment-production-stack)=
|
|||
|
|
|||
|
# Production stack
|
|||
|
|
|||
|
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with:
|
|||
|
|
|||
|
* **Upstream vLLM compatibility** – It wraps around upstream vLLM without modifying its code.
|
|||
|
* **Ease of use** – Simplified deployment via Helm charts and observability through Grafana dashboards.
|
|||
|
* **High performance** – Optimized for LLM workloads with features like multi-model support, model-aware and prefix-aware routing, fast vLLM bootstrapping, and KV cache offloading with [LMCache](https://github.com/LMCache/LMCache), among others.
|
|||
|
|
|||
|
If you are new to Kubernetes, don't worry: in the vLLM production stack [repo](https://github.com/vllm-project/production-stack), we provide a step-by-step [guide](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) and a [short video](https://www.youtube.com/watch?v=EsTJbQtzj0g) to set up everything and get started in **4 minutes**!
|
|||
|
|
|||
|
## Pre-requisite
|
|||
|
|
|||
|
Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine).
|
|||
|
|
|||
|
## Deployment using vLLM production stack
|
|||
|
|
|||
|
The standard vLLM production stack install uses a Helm chart. You can run this [bash script](https://github.com/vllm-project/production-stack/blob/main/tutorials/install-helm.sh) to install Helm on your GPU server.
|
|||
|
|
|||
|
To install the vLLM production stack, run the following commands on your desktop:
|
|||
|
|
|||
|
```bash
|
|||
|
sudo helm repo add vllm https://vllm-project.github.io/production-stack
|
|||
|
sudo helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml
|
|||
|
```
|
|||
|
|
|||
|
This will instantiate a vLLM-production-stack-based deployment named `vllm` that runs a small LLM (Facebook opt-125M model).
|
|||
|
|
|||
|
### Validate Installation
|
|||
|
|
|||
|
Monitor the deployment status using:
|
|||
|
|
|||
|
```bash
|
|||
|
sudo kubectl get pods
|
|||
|
```
|
|||
|
|
|||
|
And you will see that pods for the `vllm` deployment will transit to `Running` state.
|
|||
|
|
|||
|
```text
|
|||
|
NAME READY STATUS RESTARTS AGE
|
|||
|
vllm-deployment-router-859d8fb668-2x2b7 1/1 Running 0 2m38s
|
|||
|
vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs 1/1 Running 0 2m38s
|
|||
|
```
|
|||
|
|
|||
|
**NOTE**: It may take some time for the containers to download the Docker images and LLM weights.
|
|||
|
|
|||
|
### Send a Query to the Stack
|
|||
|
|
|||
|
Forward the `vllm-router-service` port to the host machine:
|
|||
|
|
|||
|
```bash
|
|||
|
sudo kubectl port-forward svc/vllm-router-service 30080:80
|
|||
|
```
|
|||
|
|
|||
|
And then you can send out a query to the OpenAI-compatible API to check the available models:
|
|||
|
|
|||
|
```bash
|
|||
|
curl -o- http://localhost:30080/models
|
|||
|
```
|
|||
|
|
|||
|
Expected output:
|
|||
|
|
|||
|
```json
|
|||
|
{
|
|||
|
"object": "list",
|
|||
|
"data": [
|
|||
|
{
|
|||
|
"id": "facebook/opt-125m",
|
|||
|
"object": "model",
|
|||
|
"created": 1737428424,
|
|||
|
"owned_by": "vllm",
|
|||
|
"root": null
|
|||
|
}
|
|||
|
]
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
To send an actual chatting request, you can issue a curl request to the OpenAI `/completion` endpoint:
|
|||
|
|
|||
|
```bash
|
|||
|
curl -X POST http://localhost:30080/completions \
|
|||
|
-H "Content-Type: application/json" \
|
|||
|
-d '{
|
|||
|
"model": "facebook/opt-125m",
|
|||
|
"prompt": "Once upon a time,",
|
|||
|
"max_tokens": 10
|
|||
|
}'
|
|||
|
```
|
|||
|
|
|||
|
Expected output:
|
|||
|
|
|||
|
```json
|
|||
|
{
|
|||
|
"id": "completion-id",
|
|||
|
"object": "text_completion",
|
|||
|
"created": 1737428424,
|
|||
|
"model": "facebook/opt-125m",
|
|||
|
"choices": [
|
|||
|
{
|
|||
|
"text": " there was a brave knight who...",
|
|||
|
"index": 0,
|
|||
|
"finish_reason": "length"
|
|||
|
}
|
|||
|
]
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
### Uninstall
|
|||
|
|
|||
|
To remove the deployment, run:
|
|||
|
|
|||
|
```bash
|
|||
|
sudo helm uninstall vllm
|
|||
|
```
|
|||
|
|
|||
|
------
|
|||
|
|
|||
|
### (Advanced) Configuring vLLM production stack
|
|||
|
|
|||
|
The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:
|
|||
|
|
|||
|
```yaml
|
|||
|
servingEngineSpec:
|
|||
|
runtimeClassName: ""
|
|||
|
modelSpec:
|
|||
|
- name: "opt125m"
|
|||
|
repository: "vllm/vllm-openai"
|
|||
|
tag: "latest"
|
|||
|
modelURL: "facebook/opt-125m"
|
|||
|
|
|||
|
replicaCount: 1
|
|||
|
|
|||
|
requestCPU: 6
|
|||
|
requestMemory: "16Gi"
|
|||
|
requestGPU: 1
|
|||
|
|
|||
|
pvcStorage: "10Gi"
|
|||
|
```
|
|||
|
|
|||
|
In this YAML configuration:
|
|||
|
* **`modelSpec`** includes:
|
|||
|
* `name`: A nickname that you prefer to call the model.
|
|||
|
* `repository`: Docker repository of vLLM.
|
|||
|
* `tag`: Docker image tag.
|
|||
|
* `modelURL`: The LLM model that you want to use.
|
|||
|
* **`replicaCount`**: Number of replicas.
|
|||
|
* **`requestCPU` and `requestMemory`**: Specifies the CPU and memory resource requests for the pod.
|
|||
|
* **`requestGPU`**: Specifies the number of GPUs required.
|
|||
|
* **`pvcStorage`**: Allocates persistent storage for the model.
|
|||
|
|
|||
|
**NOTE:** If you intend to set up two pods, please refer to this [YAML file](https://github.com/vllm-project/production-stack/blob/main/tutorials/assets/values-01-2pods-minimal-example.yaml).
|
|||
|
|
|||
|
**NOTE:** vLLM production stack offers many more features (*e.g.* CPU offloading and a wide range of routing algorithms). Please check out these [examples and tutorials](https://github.com/vllm-project/production-stack/tree/main/tutorials) and our [repo](https://github.com/vllm-project/production-stack) for more details!
|