NIM on GKE

Before you begin

error

Before you proceed further, ensure you have the NVIDIA AI Enterprise License (NVAIE) to access the NIMs. To get started, go to build.nvidia.com and provide your company email address

  1. Get access to NVIDIA NIMs

  2. In the Google Cloud console, on the project selector page, select or create a new project with billing enabled

  3. Ensure you have the following tools installed on your workstation

  4. Enable the required APIs

    gcloud services enable \
      container.googleapis.com \
      file.googleapis.com
    

Set up your GKE Cluster

  1. Choose your region and set your project and machine variables:

    export PROJECT_ID=$(gcloud config get project)
    export REGION=us-central1
    export ZONE=${REGION?}-a
    
  2. Create a GKE cluster:

    gcloud container clusters create nim-demo --location ${REGION?} \
      --workload-pool ${PROJECT_ID?}.svc.id.goog \
      --enable-image-streaming \
      --enable-ip-alias \
      --node-locations ${ZONE?} \
      --workload-pool=${PROJECT_ID?}.svc.id.goog \
      --addons=GcpFilestoreCsiDriver  \
      --machine-type n2d-standard-4 \
      --num-nodes 1 --min-nodes 1 --max-nodes 5 \
      --ephemeral-storage-local-ssd=count=2 \
    --labels=created-by=ai-on-gke,guide=nim-on-gke
    
  3. Get cluster credentials

    kubectl config set-cluster nim-demo
    
  4. Create a nodepool

    gcloud container node-pools create g2-standard-24 --cluster nim-demo \
    --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \
    --machine-type g2-standard-24 \
    --ephemeral-storage-local-ssd=count=2 \
    --enable-image-streaming \
    --num-nodes=1 --min-nodes=1 --max-nodes=2 \
    --node-locations $REGION-a,$REGION-b --region $REGION
    

Set Up Access to NVIDIA NIMs and prepare environment

star

If you have not set up NGC, see NGC Setup to get your access key and begin using NGC.

  1. Get your NGC_API_KEY from NGC

    export NGC_CLI_API_KEY="<YOUR_API_KEY>"
    
  2. As a part of the NGC setup, set your configs

    ngc config set
    
  3. Ensure you have access to the repository by listing the models

    ngc registry model list
    
  4. Create a Kuberntes namespace

    kubectl create namespace nim
    

Deploy a PVC to persist the model

star

This PVC will dynamically provision a PV with the necessary storage to persist model weights across replicas of your pods.

  1. Create a PVC to persist the model weights - recommended for deployments with more than one (1) replica. Save the following yaml as pvc.yaml.

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: model-store-pvc
      namespace: nim
    spec:
      accessModes:
        - ReadWriteMany
      resources:
        requests:
          storage: 30Gi
      storageClassName: standard-rwx
    
  2. Apply PVC

    kubectl apply -f pvc.yaml
    

Deploy the NIM with the generated engine using a Helm chart

  1. Clone the nim-deploy repository

    git clone https://github.com/NVIDIA/nim-deploy.git
    cd nim-deploy/helm
    
  2. Deploy chart with minimal configurations

    helm --namespace nim install demo-nim nim-llm/ --set model.ngcAPIKey=$NGC_CLI_API_KEY --set persistence.enabled=true --set persistence.existingClaim=model-store-pvc
    

Test the NIM

star

Expect the demo-nim deployment to take a few minutes as the Llama3 model downloads.

  1. Expose the service

    kubectl port-forward --namespace nim services/demo-nim-nim-llm 8000
    
  2. Send a test prompt

    curl -X 'POST' \
      'http://localhost:8000/v1/chat/completions' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -d '{
      "messages": [
        {
          "content": "You are a polite and respectful poet.",
          "role": "system"
        },
        {
          "content": "Write a limerick about the wonders of GPUs and Kubernetes?",
          "role": "user"
        }
      ],
      "model": "meta/llama3-8b-instruct",
      "max_tokens": 256,
      "top_p": 1,
      "n": 1,
      "stream": false,
      "frequency_penalty": 0.0
    }' | jq '.choices[0].message.content' -
    
  3. Browse the API by navigating to http://localhost:8000/docs

Clean up

Remove the cluster and deployment by runnign the following command:

gcloud container clusters delete l4-demo --location ${REGION} 

Continue reading: