(deployment-k8s)= # Using Kubernetes Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes. * [Deployment with CPUs](#deployment-with-cpus) * [Deployment with GPUs](#deployment-with-gpus) Alternatively, you can deploy vLLM to Kubernetes using any of the following: * [Helm](frameworks/helm.md) * [InftyAI/llmaz](integrations/llmaz.md) * [KServe](integrations/kserve.md) * [kubernetes-sigs/lws](frameworks/lws.md) * [meta-llama/llama-stack](integrations/llamastack.md) * [substratusai/kubeai](integrations/kubeai.md) * [vllm-project/aibrix](https://github.com/vllm-project/aibrix) * [vllm-project/production-stack](integrations/production-stack.md) ## Deployment with CPUs :::{note} The use of CPUs here is for demonstration and testing purposes only and its performance will not be on par with GPUs. ::: First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model: ```bash cat <. 2. Create a Kubernetes Service for vLLM Next, create a Kubernetes Service file to expose the `mistral-7b` deployment: ```yaml apiVersion: v1 kind: Service metadata: name: mistral-7b namespace: default spec: ports: - name: http-mistral-7b port: 80 protocol: TCP targetPort: 8000 # The label selector should match the deployment labels & it is useful for prefix caching feature selector: app: mistral-7b sessionAffinity: None type: ClusterIP ``` 3. Deploy and Test Apply the deployment and service configurations using `kubectl apply -f `: ```console kubectl apply -f deployment.yaml kubectl apply -f service.yaml ``` To test the deployment, run the following `curl` command: ```console curl http://mistral-7b.default.svc.cluster.local/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mistralai/Mistral-7B-Instruct-v0.3", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }' ``` If the service is correctly deployed, you should receive a response from the vLLM model. ## Conclusion Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.