Deploy an LLM with vLLM on Hyperstack Kubernetes

This guide outlines the process for deploying a Large Language Model (LLM) using vLLM on a Hyperstack Kubernetes virtual machine, covering everything from connecting to the bastion server to configuring vLLM and testing the model.

warning

Hyperstack's on-demand Kubernetes is currently in Beta testing. See more details here: Kubernetes documentation.

Prerequisites

Create a Hyperstack Kubernetes cluster

For instructions on how to create a Hyperstack Kubernetes cluster, click here.

1. Connect to bastion server

First, we need to connect to the bastion server. This server acts as a gateway to our Kubernetes cluster.

Define variables

a. Set the BASTION_IP_ADDRESS to the IP address of the bastion server from your Kubernetes cluster in Hyperstack.
b. Set the KEYPAIR_PATH to the path of the SSH key pair on your local machine associated with the Kubernetes cluster.
```
BASTION_IP_ADDRESS="38.80.122.252"
KEYPAIR_PATH="../_ssh_keys/example-k8s-key_hyperstack"
```
note
The script provided in this guide deploys the NousResearch version of Llama-3-8B, allowing you to get started without a HuggingFace account or token. If you prefer to use the original model by Meta, be sure to pass the HF_TOKEN environment variable.
Remove SSH known hosts

To avoid potential SSH key conflicts, remove existing SSH known hosts:
```
rm -rf ~/.ssh/known_hosts
```

Connect to the bastion server

Use the following script to connect to the bastion server with port forwarding for the Kubernetes dashboard:

if [ ! -f $KEYPAIR_PATH ]; then
    echo "Keypair not found at $KEYPAIR_PATH"
else
    echo "Connecting to bastion server at $BASTION_IP_ADDRESS"
    # Connect to the bastion host (with port forwarding for kubernetes dashboard)
    ssh -i $KEYPAIR_PATH -L 8443:localhost:8443 ubuntu@$BASTION_IP_ADDRESS
fi

2. Optional: install Kubernetes dashboard

The Kubernetes dashboard provides a web-based user interface to manage your Kubernetes resources. While optional, it can be very helpful.

Open a new screen

To run your Kubernetes dashboard in the background, please open a separate terminal session inside your bastion server by running the command below:
```
screen
```
Add Kubernetes dashboard repository

Add the repository for the Kubernetes dashboard:
```
helm repo add kubernetes-dashboard https://kubernetes.github.io/dashboard/
```

Deploy the Kubernetes dashboard

Deploy the dashboard using Helm:

helm upgrade --install kubernetes-dashboard kubernetes-dashboard/kubernetes-dashboard --create-namespace --namespace kubernetes-dashboard

Create bearer token

Generate a token to access the dashboard:

kubectl -n kubernetes-dashboard create serviceaccount kubernetes-dashboard
kubectl -n kubernetes-dashboard create token kubernetes-dashboard

Add permissions

Assign the necessary permissions to the dashboard service account:

kubectl create clusterrolebinding kubernetes-dashboard --clusterrole=cluster-admin --serviceaccount=kubernetes-dashboard:kubernetes-dashboard

Configure kubectl for dashboard access

Forward the dashboard service to access it locally:
```
kubectl -n kubernetes-dashboard port-forward svc/kubernetes-dashboard-kong-proxy 8443:443
```
You can detach from the screen session using Ctrl+A, D.

3. Configure vLLM

We will now set up and deploy vLLM in our Kubernetes cluster.

Create vLLM namespace

Create a namespace for vLLM:
```
kubectl create ns vllm-ns
```

Set deployment configuration

The following Kubernetes deployment configuration sets up a vllm application with four pod replicas using the vllm/vllm-openai:latest Docker image, running the vLLM API server with the NousResearch/Meta-Llama-3-8B-Instruct model. It includes a rolling update strategy to manage pod updates, liveness and readiness probes to ensure the application is healthy and ready to serve traffic, and allocates one GPU per pod (4 GPUs in total).

note

You can adjust the number of replicas specified in the configuration below between 1 and 8, depending on the number of GPUs in your cluster. For example, a cluster with 4 GPUs, with 1 GPU per replica, should specify a replicas value of 4.

cat <<EOF > vllm_deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: vllm-app
  name: vllm
  namespace: vllm-ns
spec:
  replicas: 4
  selector:
    matchLabels:
      app: vllm-app
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: vllm-app
    spec:
      containers:
      - command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        - --model
        - NousResearch/Meta-Llama-3-8B-Instruct
        image: vllm/vllm-openai:latest
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 240
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        name: vllm-openai
        ports:
        - containerPort: 8000
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 240
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /root/.cache/huggingface
          name: cache-volume
      volumes:
      - emptyDir: {}
        name: cache-volume
EOF

Set Service Configuration

Create a service to expose vLLM:

cat <<EOF > vllm_service.yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    app: vllm-app
  name: vllm-openai-svc
  namespace: vllm-ns
spec:
  ports:
  - port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: vllm-app
  type: ClusterIP
EOF

Deploy vLLM

Apply the deployment and service configurations:

kubectl apply -f vllm_deployment.yaml
kubectl apply -f vllm_service.yaml

Check deployment

Verify the deployment status:
```
kubectl describe deployments -n vllm-ns
```
Port forward vLLM

Forward the vLLM service to access it locally:
```
kubectl port-forward svc/vllm-openai-svc 8000:8000 -n vllm-ns
```
After completing these steps, you will have a fully deployed Large Language Model (LLM) running on a Hyperstack Kubernetes virtual machine, accessible via a local API endpoint at http://localhost:8000.

4. Test the model

After deploying the LLM, you can test it to ensure it's working correctly. Here’s a simple guide to access the VM, send a request, and check the response.

Access the VM via SSH

To access your Kubernetes virtual machine, you can establish an SSH connection in one of two ways: directly or through port forwarding to access it on your local machine.

Option A: Direct SSH access:

Execute the following command in a terminal to directly SSH into your VM:
```
ssh -i [path_to_ssh_key] [os_username]@[vm_ip_address]
```
- Replace '[path_to_ssh_key]' with the path to your SSH key.
- Replace '[os_username]' with the operating system's username on your VM. For instance:
  - Administrator for Windows
  - ubuntu for Ubuntu
- Replace '[vm_ip_address]' with your VM's public IP, available under the 'PUBLIC IP' column on the 'My Virtual Machines' page.
Option B: SSH with port forwarding:

To set up port forwarding to your local machine, include -L 8443:localhost:8443 in your SSH command. This forwards port 8443 from the VM to the same port on your local machine:
```
ssh -i [path_to_ssh_key] -L 8443:localhost:8443 [os_username]@[vm_ip_address]
```
For detailed instructions on connecting to your VM via SSH, see this complete guide.

Send a test request

Use curl to send a test request to the model. Open your terminal and run the following command:

curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "NousResearch/Meta-Llama-3-8B-Instruct",
  "prompt": "Explain the benefits of Kubernetes.",
  "max_tokens": 50
}'

Check the response

You should receive a JSON response with the generated text. The response will look similar to this:

{
  "id": "example-id",
  "object": "text_completion",
  "created": 1612802563,
  "model": "NousResearch/Meta-Llama-3-8B-Instruct",
  "choices": [
    {
      "text": "Kubernetes provides automated deployment, scaling, and management of containerized applications. It improves resource utilization and enables easy scaling.",
      "index": 0,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "total_tokens": 27
  }
}

You have now successfully set up a Large Language Model (LLM) on Hyperstack Kubernetes, allowing you to run a Large Language Model API across different GPUs to ensure availability of your AI application.

By default, Kubernetes uses a round-robin distribution of API calls to evenly distribute traffic, which is effective for simple, stateless applications. However, applications requiring session persistence, those managing uneven loads, or those needing advanced traffic management may require a more sophisticated approach.

For more information on advanced load balancing techniques, see Kubernetes Load Balancing documentation.

Deploy an LLM with vLLM on Hyperstack Kubernetes

Prerequisites​

Create a Hyperstack Kubernetes cluster​

1. Connect to bastion server​

Define variables​

Remove SSH known hosts​

Connect to the bastion server​

2. Optional: install Kubernetes dashboard​

Open a new screen​

Add Kubernetes dashboard repository​

Deploy the Kubernetes dashboard​

Create bearer token​

Add permissions​

Configure kubectl for dashboard access​

3. Configure vLLM​

Create vLLM namespace​

Set deployment configuration​

Set Service Configuration​

Deploy vLLM​

Check deployment​

Port forward vLLM​

4. Test the model​

Access the VM via SSH​

Option A: Direct SSH access:​

Option B: SSH with port forwarding:​

Send a test request​

Check the response​

Back to top​

Prerequisites

Create a Hyperstack Kubernetes cluster

1. Connect to bastion server

Define variables

Remove SSH known hosts

Connect to the bastion server

2. Optional: install Kubernetes dashboard

Open a new screen

Add Kubernetes dashboard repository

Deploy the Kubernetes dashboard

Create bearer token

Add permissions

Configure kubectl for dashboard access

3. Configure vLLM

Create vLLM namespace

Set deployment configuration

Set Service Configuration

Deploy vLLM

Check deployment

Port forward vLLM

4. Test the model

Access the VM via SSH

Option A: Direct SSH access:

Option B: SSH with port forwarding:

Send a test request

Check the response

Back to top