RAG on GKE

Tags:

This is a sample to deploy a Retrieval Augmented Generation (RAG) application on GKE.

star

The latest recommended release is branch release-1.1.

What is RAG?

RAG is a popular approach for boosting the accuracy of LLM responses, particularly for domain specific or private data sets.

RAG uses a semantically searchable knowledge base (like vector search) to retrieve relevant snippets for a given prompt to provide additional context to the LLM. Augmenting the knowledge base with additional data is typically cheaper than fine tuning and is more scalable when incorporating current events and other rapidly changing data spaces.

RAG on GKE Architecture

A GKE service endpoint serving Hugging Face TGI inference using mistral-7b.
Cloud SQL pgvector instance with vector embeddings generated from an input dataset.
A Ray cluster running on GKE that runs jobs to generate embeddings and populate the vector DB.
A Jupyter notebook running on GKE that reads the dataset using GCS fuse driver integrations and runs a Ray job to populate the vector DB.
A front end chat interface running on GKE that prompts the inference server with context from the vector DB.

This tutorial walks you through installing the RAG infrastructure in a GCP project, generating vector embeddings for a sample Kaggle Netflix shows dataset and prompting the LLM with context.

Prerequisites

Install tooling (required)

Install the following on your computer:

Bring your own cluster (optional)

By default, this tutorial creates a cluster on your behalf. We highly recommend following the default settings.

If you prefer to manage your own cluster, set create_cluster = false and make sure the network_name is set to your cluster’s network in the Installation section. Creating a long-running cluster may be better for development, allowing you to iterate on Terraform components without recreating the cluster every time.

Use gcloud to create a GKE Autopilot cluster. Note that RAG requires the latest Autopilot features.

gcloud container clusters create-auto rag-cluster \
  --location us-central1 \
  --labels=created-by=ai-on-gke,guide=rag-on-gke

Bring your own VPC (optional)

By default, this tutorial creates a new network on your behalf with Private Service Connect already enabled. We highly recommend following the default settings.

If you prefer to use your own VPC, set create_network = false in the in the Installation section. This also requires enabling Private Service Connect for your VPC. Without Private Service Connect, the RAG components cannot connect to the vector DB:

Installation

This section sets up the RAG infrastructure in your GCP project using Terraform.

star

Terraform keeps state metadata in a local file called terraform.tfstate. Deleting the file may cause some resources to not be cleaned up correctly even if you delete the cluster. We suggest using terraform destroy before reapplying/reinstalling.

If needed, clone the repo

git clone https://github.com/ai-on-gke/quick-start-guides
cd quick-start-guides/rag

Edit workloads.tfvars to set your project ID, location, cluster name, and GCS bucket name. Ensure the gcs_bucket name is globally unique (add a random suffix). Optionally, make the following changes:
- (Recommended) Enable authenticated access for JupyterHub, frontend chat and Ray dashboard services.
- (Optional) Set a custom kubernetes_namespace where all k8s resources will be created.
- (Optional) Set autopilot_cluster = false to deploy using GKE Standard.
- (Optional) Set create_cluster = false if you are bringing your own cluster. If using a GKE Standard cluster, ensure it has an L4 nodepool with autoscaling and node autoprovisioning enabled. You can simplify setup by following the Terraform instructions in infrastructure/README.md.
- (Optional) Set create_network = false if you are bringing your own VPC. Ensure your VPC has Private Service Connect enabled as described above.
Initialize the Terraform template
```
 terraform init
```

Run Terraform creation tempalte

terraform apply --var-file=./workloads.tfvars

star

Creation of the Kubernetes cluster, network and all other required components can take up to 10 minutes. Check the terraform apply logs to get the latest status.

Generate vector embeddings for the dataset

This section generates the vector embeddings for your input dataset. Currently, the default dataset is Netflix shows. We will use a Jupyter notebook to run a Ray job that generates the embeddings & populates them into the pgvector instance created above.

Set your the namespace, cluster name and location from workloads.tfvars):

export NAMESPACE=rag
export CLUSTER_LOCATION=us-central1
export CLUSTER_NAME=rag-cluster

Connect to the GKE cluster:

gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_LOCATION}

Connect and login to JupyterHub:
- If IAP is disabled (jupyter_add_auth = false):
  - Port forward to the JupyterHub service: kubectl port-forward service/proxy-public -n ${NAMESPACE} 8081:80 &
  - Go to localhost:8081 in a browser
  - Login with these credentials:
    - Username: admin
    - Password: Use terraform output jupyterhub_password to fetch the password value
- If IAP is enabled (jupyter_add_auth = true):
  - Fetch the domain: terraform output jupyterhub_uri
  - If you used a custom domain, ensure you configured your DNS as described above.
  - Verify the domain status is Active - Note: This can take up to 20 minutes to propagate.
    - kubectl get managedcertificates jupyter-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
  - Once the domain status is Active, go to the domain in a browser and login with your Google credentials.
  - To add additional users to your JupyterHub application, go to Google Cloud Platform IAP, select the rag/proxy-public service and add principals with the role IAP-secured Web App User.
Load the notebook:
- Once logged in to JupyterHub, choose the CPU preset with Default storage.
- Click File -> Open From URL and paste: https://github.com/ai-on-gke/quick-start-guides/rag/example_notebooks/rag-kaggle-ray-sql-interactive.ipynb
Configure Kaggle:
- Create a Kaggle account.
- Generate an API token. See further instructions. This token is used in the notebook to access the Kaggle Netflix shows dataset.
- Replace the variables in the 1st cell of the notebook with your Kaggle credentials (can be found in the kaggle.json file created while generating the API token):
  - KAGGLE_USERNAME
  - KAGGLE_KEY
Generate vector embeddings: Run all the cells in the notebook to generate vector embeddings for the Netflix shows dataset (https://www.kaggle.com/datasets/shivamb/netflix-shows) via a Ray job and store them in the pgvector CloudSQL instance.
- When the last cell succeeded, the vector embeddings have been generated and we can launch the frontend chat interface. Note that the Ray job can take up to 10 minutes to finish.
- Ray may take several minutes to create the runtime environment. During this time, the job will appear to be missing (e.g. Status message: PENDING).
- Connect to the Ray dashboard to check the job status or logs:
  - If IAP is disabled (ray_dashboard_add_auth = false):
    - kubectl port-forward -n ${NAMESPACE} service/ray-cluster-kuberay-head-svc 8265:8265
    - Go to localhost:8265 in a browser
  - If IAP is enabled (ray_dashboard_add_auth = true):
    - Fetch the domain: terraform output ray-dashboard-managed-cert
    - If you used a custom domain, ensure you configured your DNS as described above.
    - Verify the domain status is Active:
      - kubectl get managedcertificates ray-dashboard-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
      - Note: This can take up to 20 minutes to propagate.
    - Once the domain status is Active, go to the domain in a browser and login with your Google credentials.
    - To add additional users to your frontend application, go to Google Cloud Platform IAP, select the rag/ray-cluster-kuberay-head-svc service and add principals with the role IAP-secured Web App User.

Launch the frontend chat interface

Connect to the frontend:
- If IAP is disabled (frontend_add_auth = false):
  - Port forward to the frontend service: kubectl port-forward service/rag-frontend -n ${NAMESPACE} 8080:8080 &
  - Go to localhost:8080 in a browser
- If IAP is enabled (frontend_add_auth = true):
  - Fetch the domain: terraform output frontend_uri
  - If you used a custom domain, ensure you configured your DNS as described above.
  - Verify the domain status is Active:
    - kubectl get managedcertificates frontend-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
    - Note: This can take up to 20 minutes to propagate.
  - Once the domain status is Active, go to the domain in a browser and login with your Google credentials.
  - To add additional users to your frontend application, go to Google Cloud Platform IAP, select the rag/rag-frontend service and add principals with the role IAP-secured Web App User.
Prompt the LLM
- Start chatting! This will fetch context related to your prompt from the vector embeddings in the pgvector CloudSQL instance, augment the original prompt with the context & query the inference model (mistral-7b) with the augmented prompt.

Configure authenticated access via IAP (recommended)

We recommend you configure authenticated access via IAP for your services.

Make sure the OAuth Consent Screen is configured for your project. Ensure User type is set to Internal.
Make sure Policy for Restrict Load Balancer Creation Based on Load Balancer Types allows EXTERNAL_HTTP_HTTPS.
Set the following variables in workloads.tfvars:
- jupyter_add_auth = true
- frontend_add_auth = true
- ray_dashboard_add_auth = true
Allowlist principals for your services via jupyter_members_allowlist, frontend_members_allowlist and ray_dashboard_members_allowlist.
Configure custom domains names via jupyter_domain, frontend_domain and ray_dashboard_domain for your services.
Configure DNS records for your custom domains:
- Register a Domain on Google Cloud Domains or use a domain registrar of your choice.
- Set up your DNS service to point to the public IP
  - Run terraform output frontend_ip_address to get the public ip address of frontend, and add an A record in your DNS configuration to point to the public IP address.
  - Run terraform output jupyterhub_ip_address to get the public ip address of jupyterhub, and add an A record in your DNS configuration to point to the public IP address.
  - Run terraform output ray_dashboard_ip_address to get the public ip address of ray dashboard, and add an A record in your DNS configuration to point to the public IP address.
- Add an A record: If the DNS service of your domain is managed by Google Cloud DNS managed zone, there are two options to add the A record:
  1. Go to https://console.cloud.google.com/net-services/dns/zones, select the zone and click ADD STANDARD, fill in your domain name and public IP address.
  2. Run gcloud dns record-sets create <domain address>. --zone=<zone name> --type="A" --ttl=<ttl in seconds> --rrdatas="<public ip address>"

Cleanup

Run terraform destroy --var-file="workloads.tfvars"
- Network deletion issue: terraform destroy fails to delete the network due to a known issue in the GCP provider. For now, the workaround is to manually delete it.

Troubleshooting

Set your the namespace, cluster name and location from workloads.tfvars:

export NAMESPACE=rag
export CLUSTER_LOCATION=us-central1
export CLUSTER_NAME=rag-cluster

Connect to the GKE cluster:

gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_LOCATION}

Troubleshoot Ray job failures:
- If the Ray actors fail to be scheduled, it could be due to a stockout or quota issue.
  - Run kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=kuberay. There should be a Ray head and Ray worker pod in Running state. If your ray pods aren’t running, it’s likely due to quota or stockout issues. Check that your project and selected cluster_location have L4 GPU capacity.
- Often, retrying the Ray job submission (the last cell of the notebook) helps.
- The Ray job may take 15-20 minutes to run the first time due to environment setup.
Troubleshoot IAP login issues:
- Verify the cert is Active:
  - For JupyterHub kubectl get managedcertificates jupyter-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
  - For the frontend: kubectl get managedcertificates frontend-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
- Verify users are allowlisted for JupyterHub or frontend services:
  - JupyterHub: Go to Google Cloud Platform IAP, select the rag/proxy-public service and check if the user has role IAP-secured Web App User.
  - Frontend: Go to Google Cloud Platform IAP, select the rag/rag-frontend service and check if the user has role IAP-secured Web App User.
- Org error:
  - The OAuth Consent Screen has User type set to Internal by default, which means principals external to the org your project is in cannot log in. To add external principals, change User type to External.
Troubleshoot terraform apply failures:
- Inference server (mistral) fails to deploy:
  - This usually indicates a stockout/quota issue. Verify your project and chosen cluster_location have L4 capacity.
- GCS bucket already exists:
  - GCS bucket names have to be globally unique, pick a different name with a random suffix.
- Cloud SQL instance already exists:
  - Ensure the cloudsql_instance name doesn’t already exist in your project.
- GMP operator webhook connection refused:
  - This is a rare, transient error. Run terraform apply again to resume deployment.
Troubleshoot terraform destroy failures:
- Network deletion issue:
  - terraform destroy fails to delete the network due to a known issue in the GCP provider. For now, the workaround is to manually delete it.
Troubleshoot error: Repo model mistralai/Mistral-7B-Instruct-v0.1 is gated. You must be authenticated to access it. for the pod of deployment mistral-7b-instruct.

The error is because the RAG deployments uses Mistral-7B-instruct which is now a gated model on Hugging Face. Deployments fail as they require a Hugging Face authentication token, which is not part of the current workflow. While we are actively working on long-term fix. This is how to workaround the error:
- Use the guide as a reference to create an access token.
- Go to the model card in Hugging Face and click “Agree and access repository”
- Create a secret as noted in with the Hugging Face credential called hf-secret in the name space where your mistral-7b-instruct deployment is running.
- Add the following entry to env within the deployment mistral-7b-instruct via kubectl edit.
```
- name: HUGGING_FACE_HUB_TOKEN
    valueFrom:
        secretKeyRef:
            name: hf-secret
            key: hf_api_token
```

Feedback

Was this page helpful?

Thank you for your feedback.

Continue reading:

Fine-Tuning ESM2

This sample walks through setting up a Google Cloud GKE environment to fine-tune ESM2 (Evolutionary Scale Modeling) using NVIDIA BioNeMo Framework 2.0

Training ESM2

This samples walks through setting up a Google Cloud GKE environment to train ESM2 (Evolutionary Scale Modeling) using NVIDIA BioNeMo Framework 2.0

Jupyter

This guide details how to deploy JupyterHub on Google Kubernetes Engine (GKE) using a provided Terraform template, including options for persistent storage and Identity-Aware Proxy (IAP) for secure access. It covers the necessary prerequisites, configuration steps, and installation process, emphasizing the use of Terraform for automation and IAP for authentication. The guide also provides instructions for accessing JupyterHub, setting up user access, and running an example notebook.

Digital Human for Customer Service

This sample walks through creatinb intelligent, interactive avatars for customer service across industries in GKE by using NVIDIA NIM services.

LLMs

This guide explains how to deploy NVIDIA NIM inference microservices on a Google Kubernetes Engine (GKE) cluster, requiring an NVIDIA AI Enterprise License for access to the models. It details the process of setting up a GKE cluster with GPU-enabled nodes, configuring access to the NVIDIA NGC registry, and deploying a NIM using a Helm chart with persistent storage. Finally, it demonstrates how to test the deployed NIM service by sending a sample prompt and verifying the response, ensuring the inference microservice is functioning correctly.

Generative Virtual Screening

This guide outlines the steps to deploy NVIDIA's NIM blueprint for [Generative Virtual screening for Drug Discovery](https://build.nvidia.com/nvidia/generative-virtual-screening-for-drug-discovery) on a Google Kubernetes Engine (GKE) cluster. Three NIMs - AlphaFold2, MolMIM & DiffDock are used to demonstrate Protein folding, Molecular generation and Protein docking.