NIM on GKE

Before you begin

error

Before you proceed further, ensure you have the NVIDIA AI Enterprise License (NVAIE) to access the NIMs. To get started, go to build.nvidia.com and provide your company email address

Get access to NVIDIA NIMs
In the Google Cloud console, on the project selector page, select or create a new project with billing enabled
Ensure you have the following tools installed on your workstation
- gcloud CLI
- kubectl
- git
- jq
- ngc

Enable the required APIs

gcloud services enable \
  container.googleapis.com \
  file.googleapis.com

Set up your GKE Cluster

Choose your region and set your project and machine variables:

export PROJECT_ID=$(gcloud config get project)
export REGION=us-central1
export ZONE=${REGION?}-a

Create a GKE cluster:

gcloud container clusters create nim-demo --location ${REGION?} \
  --workload-pool ${PROJECT_ID?}.svc.id.goog \
  --enable-image-streaming \
  --enable-ip-alias \
  --node-locations ${ZONE?} \
  --workload-pool=${PROJECT_ID?}.svc.id.goog \
  --addons=GcpFilestoreCsiDriver  \
  --machine-type n2d-standard-4 \
  --num-nodes 1 --min-nodes 1 --max-nodes 5 \
  --ephemeral-storage-local-ssd=count=2 \
--labels=created-by=ai-on-gke,guide=nim-on-gke

Get cluster credentials
```
kubectl config set-cluster nim-demo
```

Create a nodepool

gcloud container node-pools create g2-standard-24 --cluster nim-demo \
--accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \
--machine-type g2-standard-24 \
--ephemeral-storage-local-ssd=count=2 \
--enable-image-streaming \
--num-nodes=1 --min-nodes=1 --max-nodes=2 \
--node-locations $REGION-a,$REGION-b --region $REGION

Set Up Access to NVIDIA NIMs and prepare environment

star

If you have not set up NGC, see NGC Setup to get your access key and begin using NGC.

Get your NGC_API_KEY from NGC

export NGC_CLI_API_KEY="<YOUR_API_KEY>"

As a part of the NGC setup, set your configs
```
ngc config set
```
Ensure you have access to the repository by listing the models
```
ngc registry model list
```
Create a Kuberntes namespace
```
kubectl create namespace nim
```

Deploy a PVC to persist the model

star

This PVC will dynamically provision a PV with the necessary storage to persist model weights across replicas of your pods.

Create a PVC to persist the model weights - recommended for deployments with more than one (1) replica. Save the following yaml as pvc.yaml.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-store-pvc
  namespace: nim
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 30Gi
  storageClassName: standard-rwx

Apply PVC
```
kubectl apply -f pvc.yaml
```

Deploy the NIM with the generated engine using a Helm chart

Clone the nim-deploy repository

git clone https://github.com/NVIDIA/nim-deploy.git
cd nim-deploy/helm

Deploy chart with minimal configurations

helm --namespace nim install demo-nim nim-llm/ --set model.ngcAPIKey=$NGC_CLI_API_KEY --set persistence.enabled=true --set persistence.existingClaim=model-store-pvc

Test the NIM

star

Expect the demo-nim deployment to take a few minutes as the Llama3 model downloads.

Expose the service

kubectl port-forward --namespace nim services/demo-nim-nim-llm 8000

Send a test prompt

curl -X 'POST' \
  'http://localhost:8000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "messages": [
    {
      "content": "You are a polite and respectful poet.",
      "role": "system"
    },
    {
      "content": "Write a limerick about the wonders of GPUs and Kubernetes?",
      "role": "user"
    }
  ],
  "model": "meta/llama3-8b-instruct",
  "max_tokens": 256,
  "top_p": 1,
  "n": 1,
  "stream": false,
  "frequency_penalty": 0.0
}' | jq '.choices[0].message.content' -

Browse the API by navigating to http://localhost:8000/docs

Clean up

Remove the cluster and deployment by runnign the following command:

gcloud container clusters delete l4-demo --location ${REGION}

Feedback

Was this page helpful?

Thank you for your feedback.

Continue reading:

Jupyter on GKE

This guide details how to deploy JupyterHub on Google Kubernetes Engine (GKE) using a provided Terraform template, including options for persistent storage and Identity-Aware Proxy (IAP) for secure access. It covers the necessary prerequisites, configuration steps, and installation process, emphasizing the use of Terraform for automation and IAP for authentication. The guide also provides instructions for accessing JupyterHub, setting up user access, and running an example notebook.

Ray on GKE

This guide provides instructions and examples for deploying and managing Ray clusters on Google Kubernetes Engine (GKE) using KubeRay and Terraform. It covers setting up a GKE cluster, deploying a Ray cluster, submitting Ray jobs, and using the Ray Client for interactive sessions. The guide also points to various resources, including tutorials, best practices, and examples for running different types of Ray applications on GKE, such as serving LLMs, using TPUs, and integrating with GCS.

Slurm on GKE

This guide shows you how to deploy [Slurm](https://slurm.schedmd.com/documentation.html) on a Google Kubernetes Engine (GKE) cluster.

vLLM GPU/TPU Fungibility

This tutorial shows you who to serve a large language model (LLM) using both Tensor Processing Units (TPUs) and GPUs on Google Kubernetes Engine (GKE) using the same deployment with [vLLM](https://github.com/vllm-project/vllm)

Jetstream

This tutorial shows you how to serve a large language model (LLM) using Tensor Processing Units (TPUs) on Google Kubernetes Engine (GKE) with [JetStream](https://github.com/google/JetStream) and [MaxText](https://github.com/google/maxtext).

Resource Management with SkyPilot

This tutorial expands on the [SkyPilot Tutorial](https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/tutorials-and-examples/skypilot) by leveraging [Dynamic Workload Scheduler](https://cloud.google.com/blog/products/compute/introducing-dynamic-workload-scheduler) with the help of an open-source project called [Kueue](https://kueue.sigs.k8s.io/)