RAG on GKE
This is a sample to deploy a Retrieval Augmented Generation (RAG) application on GKE.
The latest recommended release is branch release-1.1.
What is RAG?
RAG is a popular approach for boosting the accuracy of LLM responses, particularly for domain specific or private data sets.
RAG uses a semantically searchable knowledge base (like vector search) to retrieve relevant snippets for a given prompt to provide additional context to the LLM. Augmenting the knowledge base with additional data is typically cheaper than fine tuning and is more scalable when incorporating current events and other rapidly changing data spaces.
RAG on GKE Architecture
- A GKE service endpoint serving Hugging Face TGI inference using mistral-7b.
- Cloud SQL
pgvector
instance with vector embeddings generated from an input dataset. - A Ray cluster running on GKE that runs jobs to generate embeddings and populate the vector DB.
- A Jupyter notebook running on GKE that reads the dataset using GCS fuse driver integrations and runs a Ray job to populate the vector DB.
- A front end chat interface running on GKE that prompts the inference server with context from the vector DB.
This tutorial walks you through installing the RAG infrastructure in a GCP project, generating vector embeddings for a sample Kaggle Netflix shows dataset and prompting the LLM with context.
Prerequisites
Install tooling (required)
Install the following on your computer:
Bring your own cluster (optional)
By default, this tutorial creates a cluster on your behalf. We highly recommend following the default settings.
If you prefer to manage your own cluster, set create_cluster = false
and make sure the network_name
is set to your cluster’s network in the Installation section. Creating a long-running cluster may be better for development, allowing you to iterate on Terraform components without recreating the cluster every time.
Use gcloud to create a GKE Autopilot cluster. Note that RAG requires the latest Autopilot features.
gcloud container clusters create-auto rag-cluster \
--location us-central1 \
--labels=created-by=ai-on-gke,guide=rag-on-gke
Bring your own VPC (optional)
By default, this tutorial creates a new network on your behalf with Private Service Connect already enabled. We highly recommend following the default settings.
If you prefer to use your own VPC, set create_network = false
in the in the Installation section. This also requires enabling Private Service Connect for your VPC. Without Private Service Connect, the RAG components cannot connect to the vector DB:
Installation
This section sets up the RAG infrastructure in your GCP project using Terraform.
Terraform keeps state metadata in a local file called terraform.tfstate
. Deleting the file may cause some resources to not be cleaned up correctly even if you delete the cluster. We suggest using terraform destroy
before reapplying/reinstalling.
-
If needed, clone the repo
git clone https://github.com/ai-on-gke/quick-start-guides cd quick-start-guides/rag
-
Edit
workloads.tfvars
to set your project ID, location, cluster name, and GCS bucket name. Ensure thegcs_bucket
name is globally unique (add a random suffix). Optionally, make the following changes:- (Recommended) Enable authenticated access for JupyterHub, frontend chat and Ray dashboard services.
- (Optional) Set a custom
kubernetes_namespace
where all k8s resources will be created. - (Optional) Set
autopilot_cluster = false
to deploy using GKE Standard. - (Optional) Set
create_cluster = false
if you are bringing your own cluster. If using a GKE Standard cluster, ensure it has an L4 nodepool with autoscaling and node autoprovisioning enabled. You can simplify setup by following the Terraform instructions ininfrastructure/README.md
. - (Optional) Set
create_network = false
if you are bringing your own VPC. Ensure your VPC has Private Service Connect enabled as described above.
-
Initialize the Terraform template
terraform init
-
Run Terraform creation tempalte
terraform apply --var-file=./workloads.tfvars
Creation of the Kubernetes cluster, network and all other required components can take up to 10 minutes. Check the terraform apply
logs to get the latest status.
Generate vector embeddings for the dataset
This section generates the vector embeddings for your input dataset. Currently, the default dataset is Netflix shows. We will use a Jupyter notebook to run a Ray job that generates the embeddings & populates them into the pgvector
instance created above.
Set your the namespace, cluster name and location from workloads.tfvars
):
export NAMESPACE=rag
export CLUSTER_LOCATION=us-central1
export CLUSTER_NAME=rag-cluster
Connect to the GKE cluster:
gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_LOCATION}
-
Connect and login to JupyterHub:
- If IAP is disabled (
jupyter_add_auth = false
):- Port forward to the JupyterHub service:
kubectl port-forward service/proxy-public -n ${NAMESPACE} 8081:80 &
- Go to localhost:8081 in a browser
- Login with these credentials:
- Username: admin
- Password: Use
terraform output jupyterhub_password
to fetch the password value
- Port forward to the JupyterHub service:
- If IAP is enabled (
jupyter_add_auth = true
):- Fetch the domain:
terraform output jupyterhub_uri
- If you used a custom domain, ensure you configured your DNS as described above.
- Verify the domain status is
Active
- Note: This can take up to 20 minutes to propagate.kubectl get managedcertificates jupyter-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
- Once the domain status is Active, go to the domain in a browser and login with your Google credentials.
- To add additional users to your JupyterHub application, go to Google Cloud Platform IAP, select the rag/proxy-public service and add principals with the role IAP-secured Web App User.
- Fetch the domain:
- If IAP is disabled (
-
Load the notebook:
- Once logged in to JupyterHub, choose the CPU preset with Default storage.
- Click File -> Open From URL and paste:
https://github.com/ai-on-gke/quick-start-guides/rag/example_notebooks/rag-kaggle-ray-sql-interactive.ipynb
-
Configure Kaggle:
- Create a Kaggle account.
- Generate an API token. See further instructions. This token is used in the notebook to access the Kaggle Netflix shows dataset.
- Replace the variables in the 1st cell of the notebook with your Kaggle credentials (can be found in the
kaggle.json
file created while generating the API token):KAGGLE_USERNAME
KAGGLE_KEY
-
Generate vector embeddings: Run all the cells in the notebook to generate vector embeddings for the Netflix shows dataset (https://www.kaggle.com/datasets/shivamb/netflix-shows) via a Ray job and store them in the
pgvector
CloudSQL instance.- When the last cell succeeded, the vector embeddings have been generated and we can launch the frontend chat interface. Note that the Ray job can take up to 10 minutes to finish.
- Ray may take several minutes to create the runtime environment. During this time, the job will appear to be missing (e.g.
Status message: PENDING
). - Connect to the Ray dashboard to check the job status or logs:
- If IAP is disabled (
ray_dashboard_add_auth = false
):kubectl port-forward -n ${NAMESPACE} service/ray-cluster-kuberay-head-svc 8265:8265
- Go to localhost:8265 in a browser
- If IAP is enabled (
ray_dashboard_add_auth = true
):- Fetch the domain:
terraform output ray-dashboard-managed-cert
- If you used a custom domain, ensure you configured your DNS as described above.
- Verify the domain status is Active:
kubectl get managedcertificates ray-dashboard-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
- Note: This can take up to 20 minutes to propagate.
- Once the domain status is Active, go to the domain in a browser and login with your Google credentials.
- To add additional users to your frontend application, go to Google Cloud Platform IAP, select the rag/ray-cluster-kuberay-head-svc service and add principals with the role IAP-secured Web App User.
- Fetch the domain:
- If IAP is disabled (
Launch the frontend chat interface
- Connect to the frontend:
- If IAP is disabled (
frontend_add_auth = false
):- Port forward to the frontend service:
kubectl port-forward service/rag-frontend -n ${NAMESPACE} 8080:8080 &
- Go to localhost:8080 in a browser
- Port forward to the frontend service:
- If IAP is enabled (
frontend_add_auth = true
):- Fetch the domain:
terraform output frontend_uri
- If you used a custom domain, ensure you configured your DNS as described above.
- Verify the domain status is Active:
kubectl get managedcertificates frontend-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
- Note: This can take up to 20 minutes to propagate.
- Once the domain status is Active, go to the domain in a browser and login with your Google credentials.
- To add additional users to your frontend application, go to Google Cloud Platform IAP, select the rag/rag-frontend service and add principals with the role IAP-secured Web App User.
- Fetch the domain:
- If IAP is disabled (
- Prompt the LLM
- Start chatting! This will fetch context related to your prompt from the vector embeddings in the
pgvector
CloudSQL instance, augment the original prompt with the context & query the inference model (mistral-7b) with the augmented prompt.
- Start chatting! This will fetch context related to your prompt from the vector embeddings in the
Configure authenticated access via IAP (recommended)
We recommend you configure authenticated access via IAP for your services.
- Make sure the OAuth Consent Screen is configured for your project. Ensure User type is set to Internal.
- Make sure Policy for Restrict Load Balancer Creation Based on Load Balancer Types allows EXTERNAL_HTTP_HTTPS.
- Set the following variables in
workloads.tfvars
:jupyter_add_auth = true
frontend_add_auth = true
ray_dashboard_add_auth = true
- Allowlist principals for your services via
jupyter_members_allowlist
,frontend_members_allowlist
andray_dashboard_members_allowlist
. - Configure custom domains names via
jupyter_domain
,frontend_domain
andray_dashboard_domain
for your services. - Configure DNS records for your custom domains:
- Register a Domain on Google Cloud Domains or use a domain registrar of your choice.
- Set up your DNS service to point to the public IP
- Run
terraform output frontend_ip_address
to get the public ip address of frontend, and add an A record in your DNS configuration to point to the public IP address. - Run
terraform output jupyterhub_ip_address
to get the public ip address of jupyterhub, and add an A record in your DNS configuration to point to the public IP address. - Run
terraform output ray_dashboard_ip_address
to get the public ip address of ray dashboard, and add an A record in your DNS configuration to point to the public IP address.
- Run
- Add an A record: If the DNS service of your domain is managed by Google Cloud DNS managed zone, there are two options to add the A record:
- Go to https://console.cloud.google.com/net-services/dns/zones, select the zone and click ADD STANDARD, fill in your domain name and public IP address.
- Run
gcloud dns record-sets create <domain address>. --zone=<zone name> --type="A" --ttl=<ttl in seconds> --rrdatas="<public ip address>"
Cleanup
- Run
terraform destroy --var-file="workloads.tfvars"
- Network deletion issue:
terraform destroy
fails to delete the network due to a known issue in the GCP provider. For now, the workaround is to manually delete it.
- Network deletion issue:
Troubleshooting
Set your the namespace, cluster name and location from workloads.tfvars
:
export NAMESPACE=rag
export CLUSTER_LOCATION=us-central1
export CLUSTER_NAME=rag-cluster
Connect to the GKE cluster:
gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_LOCATION}
-
Troubleshoot Ray job failures:
- If the Ray actors fail to be scheduled, it could be due to a stockout or quota issue.
- Run
kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=kuberay
. There should be a Ray head and Ray worker pod inRunning
state. If your ray pods aren’t running, it’s likely due to quota or stockout issues. Check that your project and selectedcluster_location
have L4 GPU capacity.
- Run
- Often, retrying the Ray job submission (the last cell of the notebook) helps.
- The Ray job may take 15-20 minutes to run the first time due to environment setup.
- If the Ray actors fail to be scheduled, it could be due to a stockout or quota issue.
-
Troubleshoot IAP login issues:
- Verify the cert is Active:
- For JupyterHub
kubectl get managedcertificates jupyter-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
- For the frontend:
kubectl get managedcertificates frontend-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
- For JupyterHub
- Verify users are allowlisted for JupyterHub or frontend services:
- JupyterHub: Go to Google Cloud Platform IAP, select the rag/proxy-public service and check if the user has role IAP-secured Web App User.
- Frontend: Go to Google Cloud Platform IAP, select the rag/rag-frontend service and check if the user has role IAP-secured Web App User.
- Org error:
- The OAuth Consent Screen has User type set to Internal by default, which means principals external to the org your project is in cannot log in. To add external principals, change User type to External.
- Verify the cert is Active:
-
Troubleshoot
terraform apply
failures:- Inference server (mistral) fails to deploy:
- This usually indicates a stockout/quota issue. Verify your project and chosen
cluster_location
have L4 capacity.
- This usually indicates a stockout/quota issue. Verify your project and chosen
- GCS bucket already exists:
- GCS bucket names have to be globally unique, pick a different name with a random suffix.
- Cloud SQL instance already exists:
- Ensure the
cloudsql_instance
name doesn’t already exist in your project.
- Ensure the
- GMP operator webhook connection refused:
- This is a rare, transient error. Run
terraform apply
again to resume deployment.
- This is a rare, transient error. Run
- Inference server (mistral) fails to deploy:
-
Troubleshoot
terraform destroy
failures:- Network deletion issue:
terraform destroy
fails to delete the network due to a known issue in the GCP provider. For now, the workaround is to manually delete it.
- Network deletion issue:
-
Troubleshoot error:
Repo model mistralai/Mistral-7B-Instruct-v0.1 is gated. You must be authenticated to access it.
for the pod of deploymentmistral-7b-instruct
.The error is because the RAG deployments uses
Mistral-7B-instruct
which is now a gated model on Hugging Face. Deployments fail as they require a Hugging Face authentication token, which is not part of the current workflow. While we are actively working on long-term fix. This is how to workaround the error:-
Use the guide as a reference to create an access token.
-
Go to the model card in Hugging Face and click “Agree and access repository”
-
Create a secret as noted in with the Hugging Face credential called
hf-secret
in the name space where yourmistral-7b-instruct
deployment is running. -
Add the following entry to
env
within the deploymentmistral-7b-instruct
viakubectl edit
.- name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-secret key: hf_api_token
-
Feedback
Was this page helpful?
Thank you for your feedback.
Thank you for your feedback.