Deploy an LLM with vLLM on Hyperstack Kubernetes
This guide outlines the process for deploying a Large Language Model (LLM) using vLLM on a Hyperstack Kubernetes virtual machine, covering everything from connecting to the bastion server to configuring vLLM and testing the model.
Hyperstack's on-demand Kubernetes is currently in Beta testing. See more details here: Kubernetes documentation.
Prerequisites
Create a Hyperstack Kubernetes cluster
- For instructions on how to create a Hyperstack Kubernetes cluster, click here.
1. Connect to bastion server
First, we need to connect to the bastion server. This server acts as a gateway to our Kubernetes cluster.
-
Define variables
a. Set the
BASTION_IP_ADDRESS
to the IP address of the bastion server from your Kubernetes cluster in Hyperstack.
b. Set theKEYPAIR_PATH
to the path of the SSH key pair on your local machine associated with the Kubernetes cluster.BASTION_IP_ADDRESS="38.80.122.252"
KEYPAIR_PATH="../_ssh_keys/example-k8s-key_hyperstack"noteThe script provided in this guide deploys the NousResearch version of Llama-3-8B, allowing you to get started without a HuggingFace account or token. If you prefer to use the original model by Meta, be sure to pass the
HF_TOKEN
environment variable. -
Remove SSH known hosts
To avoid potential SSH key conflicts, remove existing SSH known hosts:
rm -rf ~/.ssh/known_hosts
-
Connect to the bastion server
Use the following script to connect to the bastion server with port forwarding for the Kubernetes dashboard:
if [ ! -f $KEYPAIR_PATH ]; then
echo "Keypair not found at $KEYPAIR_PATH"
else
echo "Connecting to bastion server at $BASTION_IP_ADDRESS"
# Connect to the bastion host (with port forwarding for kubernetes dashboard)
ssh -i $KEYPAIR_PATH -L 8443:localhost:8443 ubuntu@$BASTION_IP_ADDRESS
fi
2. Optional: install Kubernetes dashboard
The Kubernetes dashboard provides a web-based user interface to manage your Kubernetes resources. While optional, it can be very helpful.
-
Open a new screen
To run your Kubernetes dashboard in the background, please open a separate terminal session inside your bastion server by running the command below:
screen
-
Add Kubernetes dashboard repository
Add the repository for the Kubernetes dashboard:
helm repo add kubernetes-dashboard https://kubernetes.github.io/dashboard/
-
Deploy the Kubernetes dashboard
Deploy the dashboard using Helm:
helm upgrade --install kubernetes-dashboard kubernetes-dashboard/kubernetes-dashboard --create-namespace --namespace kubernetes-dashboard
-
Create bearer token
Generate a token to access the dashboard:
kubectl -n kubernetes-dashboard create serviceaccount kubernetes-dashboard
kubectl -n kubernetes-dashboard create token kubernetes-dashboard -
Add permissions
Assign the necessary permissions to the dashboard service account:
kubectl create clusterrolebinding kubernetes-dashboard --clusterrole=cluster-admin --serviceaccount=kubernetes-dashboard:kubernetes-dashboard
-
Configure kubectl for dashboard access
Forward the dashboard service to access it locally:
kubectl -n kubernetes-dashboard port-forward svc/kubernetes-dashboard-kong-proxy 8443:443
You can detach from the screen session using
Ctrl+A, D
.
3. Configure vLLM
We will now set up and deploy vLLM in our Kubernetes cluster.
-
Create vLLM namespace
Create a namespace for vLLM:
kubectl create ns vllm-ns
-
Set deployment configuration
The following Kubernetes deployment configuration sets up a vllm application with four pod replicas using the
vllm/vllm-openai:latest
Docker image, running the vLLM API server with theNousResearch/Meta-Llama-3-8B-Instruct
model. It includes a rolling update strategy to manage pod updates, liveness and readiness probes to ensure the application is healthy and ready to serve traffic, and allocates one GPU per pod (4 GPUs in total).noteYou can adjust the number of
replicas
specified in the configuration below between 1 and 8, depending on the number of GPUs in your cluster. For example, a cluster with 4 GPUs, with 1 GPU per replica, should specify areplicas
value of4
.cat <<EOF > vllm_deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: vllm-app
name: vllm
namespace: vllm-ns
spec:
replicas: 4
selector:
matchLabels:
app: vllm-app
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
app: vllm-app
spec:
containers:
- command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --model
- NousResearch/Meta-Llama-3-8B-Instruct
image: vllm/vllm-openai:latest
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 240
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: vllm-openai
ports:
- containerPort: 8000
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 240
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
volumes:
- emptyDir: {}
name: cache-volume
EOF -
Set Service Configuration
Create a service to expose vLLM:
cat <<EOF > vllm_service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: vllm-app
name: vllm-openai-svc
namespace: vllm-ns
spec:
ports:
- port: 8000
protocol: TCP
targetPort: 8000
selector:
app: vllm-app
type: ClusterIP
EOF -
Deploy vLLM
Apply the deployment and service configurations:
kubectl apply -f vllm_deployment.yaml
kubectl apply -f vllm_service.yaml -
Check deployment
Verify the deployment status:
kubectl describe deployments -n vllm-ns
-
Port forward vLLM
Forward the vLLM service to access it locally:
kubectl port-forward svc/vllm-openai-svc 8000:8000 -n vllm-ns
After completing these steps, you will have a fully deployed Large Language Model (LLM) running on a Hyperstack Kubernetes virtual machine, accessible via a local API endpoint at
http://localhost:8000
.
4. Test the model
After deploying the LLM, you can test it to ensure it's working correctly. Here’s a simple guide to access the VM, send a request, and check the response.
-
Access the VM via SSH
To access your Kubernetes virtual machine, you can establish an SSH connection in one of two ways: directly or through port forwarding to access it on your local machine.
Option A: Direct SSH access:
Execute the following command in a terminal to directly SSH into your VM:
ssh -i [path_to_ssh_key] [os_username]@[vm_ip_address]
- Replace '
[path_to_ssh_key]
' with the path to your SSH key. - Replace '
[os_username]
' with the operating system's username on your VM. For instance:Administrator
for Windowsubuntu
for Ubuntu
- Replace '
[vm_ip_address]
' with your VM's public IP, available under the 'PUBLIC IP' column on the 'My Virtual Machines' page.
Option B: SSH with port forwarding:
To set up port forwarding to your local machine, include
-L 8443:localhost:8443
in your SSH command. This forwards port 8443 from the VM to the same port on your local machine:ssh -i [path_to_ssh_key] -L 8443:localhost:8443 [os_username]@[vm_ip_address]
For detailed instructions on connecting to your VM via SSH, see this complete guide.
- Replace '
-
Send a test request
Use
curl
to send a test request to the model. Open your terminal and run the following command:curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "NousResearch/Meta-Llama-3-8B-Instruct",
"prompt": "Explain the benefits of Kubernetes.",
"max_tokens": 50
}' -
Check the response
You should receive a JSON response with the generated text. The response will look similar to this:
{
"id": "example-id",
"object": "text_completion",
"created": 1612802563,
"model": "NousResearch/Meta-Llama-3-8B-Instruct",
"choices": [
{
"text": "Kubernetes provides automated deployment, scaling, and management of containerized applications. It improves resource utilization and enables easy scaling.",
"index": 0,
"finish_reason": "length"
}
],
"usage": {
"total_tokens": 27
}
}
You have now successfully set up a Large Language Model (LLM) on Hyperstack Kubernetes, allowing you to run a Large Language Model API across different GPUs to ensure availability of your AI application.
By default, Kubernetes uses a round-robin distribution of API calls to evenly distribute traffic, which is effective for simple, stateless applications. However, applications requiring session persistence, those managing uneven loads, or those needing advanced traffic management may require a more sophisticated approach.
For more information on advanced load balancing techniques, see Kubernetes Load Balancing documentation.