Skip to main content

Deploy an LLM with vLLM on Hyperstack Kubernetes

This guide outlines the process for deploying a Large Language Model (LLM) using vLLM on a Hyperstack Kubernetes virtual machine, covering everything from connecting to the bastion server to configuring vLLM and testing the model.

Prerequisites

Create a Hyperstack Kubernetes cluster

  • For instructions on how to create a Hyperstack Kubernetes cluster, click here.

1. Connect to bastion server

First, we need to connect to the bastion server. This server acts as a gateway to our Kubernetes cluster.

  1. Define variables

    a. Set the BASTION_IP_ADDRESS to the IP address of the bastion server from your Kubernetes cluster in Hyperstack.
    b. Set the KEYPAIR_PATH to the path of the SSH key pair on your local machine associated with the Kubernetes cluster.

    BASTION_IP_ADDRESS="38.80.122.252"
    KEYPAIR_PATH="../_ssh_keys/example-k8s-key_hyperstack"
    note

    The script provided in this guide deploys the NousResearch version of Llama-3-8B, allowing you to get started without a HuggingFace account or token. If you prefer to use the original model by Meta, be sure to pass the HF_TOKEN environment variable.

  2. Remove SSH known hosts

    To avoid potential SSH key conflicts, remove existing SSH known hosts:

    rm -rf ~/.ssh/known_hosts
  3. Connect to the bastion server

    Use the following script to connect to the bastion server with port forwarding for the Kubernetes dashboard:

    if [ ! -f $KEYPAIR_PATH ]; then
    echo "Keypair not found at $KEYPAIR_PATH"
    else
    echo "Connecting to bastion server at $BASTION_IP_ADDRESS"
    # Connect to the bastion host (with port forwarding for kubernetes dashboard)
    ssh -i $KEYPAIR_PATH -L 8443:localhost:8443 ubuntu@$BASTION_IP_ADDRESS
    fi

2. Optional: install Kubernetes dashboard

The Kubernetes dashboard provides a web-based user interface to manage your Kubernetes resources. While optional, it can be very helpful.

  1. Open a new screen

    To run your Kubernetes dashboard in the background, please open a separate terminal session inside your bastion server by running the command below:

    screen
  2. Add Kubernetes dashboard repository

    Add the repository for the Kubernetes dashboard:

    helm repo add kubernetes-dashboard https://kubernetes.github.io/dashboard/
  3. Deploy the Kubernetes dashboard

    Deploy the dashboard using Helm:

    helm upgrade --install kubernetes-dashboard kubernetes-dashboard/kubernetes-dashboard --create-namespace --namespace kubernetes-dashboard
  4. Create bearer token

    Generate a token to access the dashboard:

    kubectl -n kubernetes-dashboard create serviceaccount kubernetes-dashboard
    kubectl -n kubernetes-dashboard create token kubernetes-dashboard
  5. Add permissions

    Assign the necessary permissions to the dashboard service account:

    kubectl create clusterrolebinding kubernetes-dashboard --clusterrole=cluster-admin --serviceaccount=kubernetes-dashboard:kubernetes-dashboard
  6. Configure kubectl for dashboard access

    Forward the dashboard service to access it locally:

    kubectl -n kubernetes-dashboard port-forward svc/kubernetes-dashboard-kong-proxy 8443:443

    You can detach from the screen session using Ctrl+A, D.

3. Configure vLLM

We will now set up and deploy vLLM in our Kubernetes cluster.

  1. Create vLLM namespace

    Create a namespace for vLLM:

    kubectl create ns vllm-ns
  2. Set deployment configuration

    The following Kubernetes deployment configuration sets up a vllm application with four pod replicas using the vllm/vllm-openai:latest Docker image, running the vLLM API server with the NousResearch/Meta-Llama-3-8B-Instruct model. It includes a rolling update strategy to manage pod updates, liveness and readiness probes to ensure the application is healthy and ready to serve traffic, and allocates one GPU per pod (4 GPUs in total).

    note

    You can adjust the number of replicas specified in the configuration below between 1 and 8, depending on the number of GPUs in your cluster. For example, a cluster with 4 GPUs, with 1 GPU per replica, should specify a replicas value of 4.

    cat <<EOF > vllm_deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    labels:
    app: vllm-app
    name: vllm
    namespace: vllm-ns
    spec:
    replicas: 4
    selector:
    matchLabels:
    app: vllm-app
    strategy:
    rollingUpdate:
    maxSurge: 25%
    maxUnavailable: 25%
    type: RollingUpdate
    template:
    metadata:
    labels:
    app: vllm-app
    spec:
    containers:
    - command:
    - python3
    - -m
    - vllm.entrypoints.openai.api_server
    - --model
    - NousResearch/Meta-Llama-3-8B-Instruct
    image: vllm/vllm-openai:latest
    imagePullPolicy: Always
    livenessProbe:
    failureThreshold: 3
    httpGet:
    path: /health
    port: 8000
    scheme: HTTP
    initialDelaySeconds: 240
    periodSeconds: 5
    successThreshold: 1
    timeoutSeconds: 1
    name: vllm-openai
    ports:
    - containerPort: 8000
    protocol: TCP
    readinessProbe:
    failureThreshold: 3
    httpGet:
    path: /health
    port: 8000
    scheme: HTTP
    initialDelaySeconds: 240
    periodSeconds: 5
    successThreshold: 1
    timeoutSeconds: 1
    resources:
    limits:
    nvidia.com/gpu: "1"
    requests:
    nvidia.com/gpu: "1"
    volumeMounts:
    - mountPath: /root/.cache/huggingface
    name: cache-volume
    volumes:
    - emptyDir: {}
    name: cache-volume
    EOF
  3. Set Service Configuration

    Create a service to expose vLLM:

    cat <<EOF > vllm_service.yaml
    apiVersion: v1
    kind: Service
    metadata:
    labels:
    app: vllm-app
    name: vllm-openai-svc
    namespace: vllm-ns
    spec:
    ports:
    - port: 8000
    protocol: TCP
    targetPort: 8000
    selector:
    app: vllm-app
    type: ClusterIP
    EOF
  4. Deploy vLLM

    Apply the deployment and service configurations:

    kubectl apply -f vllm_deployment.yaml
    kubectl apply -f vllm_service.yaml
  5. Check deployment

    Verify the deployment status:

    kubectl describe deployments -n vllm-ns
  6. Port forward vLLM

    Forward the vLLM service to access it locally:

    kubectl port-forward svc/vllm-openai-svc 8000:8000 -n vllm-ns

    After completing these steps, you will have a fully deployed Large Language Model (LLM) running on a Hyperstack Kubernetes virtual machine, accessible via a local API endpoint at http://localhost:8000.

4. Test the model

After deploying the LLM, you can test it to ensure it's working correctly. Here’s a simple guide to access the VM, send a request, and check the response.

  1. Access the VM via SSH

    To access your Kubernetes virtual machine, you can establish an SSH connection in one of two ways: directly or through port forwarding to access it on your local machine.

    Option A: Direct SSH access:

    Execute the following command in a terminal to directly SSH into your VM:

    ssh -i [path_to_ssh_key] [os_username]@[vm_ip_address]
    • Replace '[path_to_ssh_key]' with the path to your SSH key.
    • Replace '[os_username]' with the operating system's username on your VM. For instance:
      • Administrator for Windows
      • ubuntu for Ubuntu
    • Replace '[vm_ip_address]' with your VM's public IP, available under the 'PUBLIC IP' column on the 'My Virtual Machines' page.
         

    Option B: SSH with port forwarding:

    To set up port forwarding to your local machine, include -L 8443:localhost:8443 in your SSH command. This forwards port 8443 from the VM to the same port on your local machine:

    ssh -i [path_to_ssh_key] -L 8443:localhost:8443 [os_username]@[vm_ip_address]

    For detailed instructions on connecting to your VM via SSH, see this complete guide.

  2. Send a test request

    Use curl to send a test request to the model. Open your terminal and run the following command:

    curl -X POST http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "NousResearch/Meta-Llama-3-8B-Instruct",
    "prompt": "Explain the benefits of Kubernetes.",
    "max_tokens": 50
    }'
  3. Check the response

    You should receive a JSON response with the generated text. The response will look similar to this:

    {
    "id": "example-id",
    "object": "text_completion",
    "created": 1612802563,
    "model": "NousResearch/Meta-Llama-3-8B-Instruct",
    "choices": [
    {
    "text": "Kubernetes provides automated deployment, scaling, and management of containerized applications. It improves resource utilization and enables easy scaling.",
    "index": 0,
    "finish_reason": "length"
    }
    ],
    "usage": {
    "total_tokens": 27
    }
    }


Back to top