Llamaindex in GKE cluster

Tags:

LlamaIndex bridges the gap between LLMs and domain-specific datasets, enabling efficient indexing and querying of unstructured data for intelligent, data-driven responses. When deployed on a GKE cluster, it ensures scalability and security by leveraging containerized workloads, GPU-based nodes, and seamless cloud-native integrations, making it ideal for ML-powered applications like RAG systems.

Overview

This tutorial will guide you through creating a robust Retrieval-Augmented Generation (RAG) system using LlamaIndex and deploying it on Google Kubernetes Engine (GKE).

What will you learn

Data Preparation and Ingestion: Use LlamaIndex to structure and index your data for efficient querying.
Model Integration: Connect LlamaIndex with an LLM to build a RAG pipeline that can generate precise responses based on your indexed data.
Containerization: Package the application as a container for deployment.
GKE Deployment: Set up a GKE cluster using Terraform and deploy your RAG system, leveraging Kubernetes features.

Before you begin

Ensure you have a GCP project with a billing account.

Ensure you have the following tools installed on your workstation:

gcloud CLI
kubectl
terraform (for an automated deployment)
git

If you previously installed the gcloud CLI, get the latest version by running:

gcloud components update

Ensure that you are signed in using the gcloud CLI tool. Run the following commands:

gcloud auth application-default login

Download the guide files

Clone the repository with our guides and cd to the llamaindex/rag directory by running these commands:

git clone https://github.com/ai-on-gke/tutorials-and-examples.git
cd tutorials-and-examples/llamaindex/rag

Filesystem structure

app - folder with demo Python application that uses llamaindex to ingest data to RAG and infer it through web API.
templates - folder with Kubernetes manifests that require additional processing to specify additional values that are not known from the start.
terraform - folder with terraform config that executes automated provisioning of required infrastructure resources.

Demo application

The demo application consists of following components:

Data ingestion

The data ingestion is a process of adding data to be used by a RAG. It is defined in the app/cmd/ingest_data.py.

File: app/cmd/ingest_data.py

Expand to see key parts

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        embed_model,
    ],
    docstore=RedisDocumentStore.from_host_and_port(
        REDIS_HOST, REDIS_PORT, namespace="document_store"
    ),
    vector_store=vector_store,
    cache=cache,
    docstore_strategy=DocstoreStrategy.UPSERTS,
)

The pipeline variable defined IngestionPipeline, that has the following arguments:

transformations: defines a list of data transformation steps applied during the ingestion pipeline. In our case:
- SentenceSplitter: tries to keep sentences and paragraphs together. Therefore, compared to the original TokenTextSplitter, there are less likely to be hanging sentences or parts of sentences at the end of the node chunk.
- embed_model: an embedding model that generates vector representations for the processed text.
docstore: defines where the processed documents are stored for later retrieval.
vector_store: a place where the resulting nodes are placed.
cache: a place where to store a cache for a faster document search.
docstore_strategy: defines how the document store handles incoming data. In our case, it updates existing documents or inserts new ones if they don’t already exist.

File app/rag_demo/__init__.py:

Expand to see key parts

In this demo we use Redis-stack as our vector storage and we also have to define a data schema for it.

custom_schema = IndexSchema.from_dict(
    {
        "index": {"name": "bucket", "prefix": "doc"},
        # customize fields that are indexed
        "fields": [
            # required fields for llamaindex
            {"type": "tag", "name": "id"},
            {"type": "tag", "name": "doc_id"},
            {"type": "text", "name": "text"},
            # custom vector field for bge-small-en-v1.5 embeddings
            {
                "type": "vector",
                "name": "vector",
                "attrs": {
                    "dims": 384,
                    "algorithm": "hnsw",
                    "distance_metric": "cosine",
                },
            },
        ],
    }
)

RAG server

Simple web application that invokes llamaindex RAG system and returns response. The web application will be a simple FastAPI app, so we can invoke the RAG system via HTTP request. It is defined in the app/rag_demo/main.py file as a FastAPI application and a single /invoke API and connection to the Redis Vector Store.

Expand to see key parts

embed_model = HuggingFaceEmbedding(model_name=EMBEDDING_MODEL_NAME)

# Connect to vector store with already ingested data
vector_store = RedisVectorStore(
    schema=custom_schema,
    redis_url=f"redis://{REDIS_HOST}:{REDIS_PORT}",
)

# Create index from a vector store
index = VectorStoreIndex.from_vector_store(
    vector_store, embed_model=embed_model
)

# Connect to LLM using Ollama
llm = Ollama(
    model=MODEL_NAME,
    base_url=OLLAMA_SERVER_URL,
)

# Create query engine that is ready to query our RAG
query_engine = index.as_query_engine(llm=llm)

Infrastructure Setup

In this section we will use Terraform to automate the creation of infrastructure resources. For more details how it is done please refer to the terraform config in the terraform folder. By default it creates an Autopilot GKE cluster but it can be changed to standard by setting autopilot_cluster=false

It creates:

IAM service accounts:
- for a cluster
- for Kubernetes permissions for app deployments (using Workload Identity Federation) * GCS bucket to store data to be ingested to the RAG.
- Artifact registry as a storage for an app-demo image

Go the terraform directory:

cd terraform

Specify the following values inside the default_env.tfvars file (or make a separate copy):
- <PROJECT_ID> – replace with your project id (you can find it in the project settings).

Other values can be changed, if needed, but can be left with default values.

Optional. You can use a GCS bucket as a storage for a Terraform state. Create a bucket manually and then uncomment the content of the file terraform/backend.tf and specify your bucket:

terraform {
   backend "gcs" {
     bucket = "<BUCKET_NAME>"
     prefix = "terraform/state/llamaindex"
   }
 }

Init terraform modules:

terraform init

Optionally run the plan command to view an execution plan:

terraform plan -var-file=default_env.tfvars

Execute the plan:

terraform apply -var-file=default_env.tfvars

And you should see your resources created:

Apply complete! Resources: 16 added, 0 changed, 0 destroyed.

Outputs:

bucket_name = "llamaindex-rag-demo-tf"
demo_app_image_repo_name = "llamaindex-rag-demo-tf"
gke_cluster_location = "us-central1"
gke_cluster_name = "llamaindex-rag-demo-tf"
project_id = "akvelon-gke-aieco"

Connect the cluster:

gcloud container clusters get-credentials $(terraform output -raw gke_cluster_name) --region $(terraform output -raw gke_cluster_location) --project $(terraform output -raw project_id)

Deploy the application to the cluster

1. Deploy Redis-stack.

For this guide it was decided to use redis-stack as a vector store, but there are many other options.

IMPORTANT: For the simplicity of this guide, Redis is deployed without persistent volumes, so the database is not persistent as well. Please consider proper persistence configuration for production.

Apply Redis-stack manifest:

kubectl apply -f ../redis-stack.yaml

Wait for Redis-stack is successfully deployed

kubectl rollout status deployment/redis-stack

2. Deploy Ollama server

Ollama is a tool that will run LLMs. It interacts with Llama-index through its ollama integration and will serve the desired model, the gemma2-9b in our case.

Deploy resulting Ollama manifest:

kubectl apply -f ../gen/ollama-deployment.yaml

Key notes on the Ollama deployment 'ollama-deployment.yaml' file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  ...
  template:
    metadata:
      ...
      annotations:
        gke-gcsfuse/volumes: 'true' # <- use FUSE to mount our bucket
    spec:
      serviceAccount: ... # <- our service account
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4 # <- specify GPU type for LLM
      containers:
         ...
          volumeMounts: # <- mount bucket volume to be used by Ollama
            - name: ollama-data
              mountPath: /root/.ollama/ # <- Ollama's path where it stores models
          resources:
            limits:
              nvidia.com/gpu: 1 # <- Enable GPU
      volumes:
        - name: ollama-data # <- Volume with a bucket mounted with FUSE
          csi:

Wait for Ollama is successfully deployed:

kubectl rollout status deployment/ollama

Pull the gemma2:9b model within Ollama server pod:

kubectl exec $(kubectl get pod -l app=ollama -o name) -c ollama -- ollama pull gemma2:9b

3. Build the demo app image

Build the llamaindex-rag-demo container image using Cloud Build and push it to the repository that is created by terraform. It uses cloudbuild.yaml file which uses the app/Dockerfile for a build. That may take some time:

gcloud builds submit ../app \
--substitutions _IMAGE_REPO_NAME="$(terraform output -raw demo_app_image_repo_name)" \
--config ../cloudbuild.yaml

More information about the container image and demo application can be found in the app folder.

4. Ingest data to the vector database by running a Kubernetes job.

Upload sample data into our bucket, which is created by the Terraform. This data then will be ingested to our RAG’s vector store.

curl -s https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt | \
gcloud storage cp - gs://$(terraform output -raw bucket_name)/datalake/paul_graham_essay.txt

Create ingestion job:

The manifests are generated from templates in the templates directory and put in the gen directory.

kubectl apply -f ../gen/ingest-data-job.yaml

Key notes on the 'ingest-data-job.yaml' file:

apiVersion: batch/v1
kind: Job
spec:
  template:
    metadata:
      ...
      annotations:
        gke-gcsfuse/volumes: 'true'  # <- use FUSE to mount our bucket
    spec:
      serviceAccount: ... # <- our service account
      containers:
         ...
        image:  ${IMAGE_NAME}  # <- image name with the app that we have previously built
        command: ["python3", "cmd/ingest_data.py"]   # <- run ingestion script instead of a default entrypoint.
        env:
         ...
        - name: INPUT_DIR # <- Make ingestion script look into mounted FUSE directory
          value: /datalake
        volumeMounts: # <- same as with Ollama mount. Mount bucket where we put ingestion data
        - name: datalake
          mountPath: /datalake
      volumes:
      - name: datalake
        csi:
          driver: gcsfuse.csi.storage.gke.io
          volumeAttributes:
            bucketName: ...   # <- our bucket name
            mountOptions: implicit-dirs,only-dir=datalake # <- use another folder of a same bucket
      ...

Wait for data to be ingested. It may take few minutes:

kubectl wait --for=condition=complete --timeout=600s job/llamaindex-ingest-data

Verify that data has been ingested:

kubectl logs -f -l name=ingest-data

Expected output:

Loaded 1 docs
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00,  9.01it/s]
Generating embeddings: 100%|██████████| 21/21 [00:22<00:00,  1.05s/it]
Ingested 21 Nodes

NOTE: The job’s pod might not be available to check if you wait too long, as Kubernetes can delete completed jobs after a period of time.

5. Deploy RAG server

Apply created manifest:

kubectl apply -f ../gen/rag-deployment.yaml

Key notes on the 'rag-deployment.yaml' file:

apiVersion: apps/v1
kind: Deployment
...
spec:
  ...
  template:
    ...
    spec:
      containers:
        - name: llamaindex-rag
          image:  ${IMAGE_NAME} # <- image name with the app that we have previously built
          ...
          env:
- name: MODEL_NAME # <- LLM name to use, gemma2:9b in our case
		  ...
- name: OLLAMA_SERVER_URL  # <- URL to connect to previously deployed Ollama server
              value: http://ollama-service:11434
          ...

Wait for deployment is completed:

kubectl rollout status deployment/llamaindex-rag

Test the RAG

Forward port to get access from a local machine:

kubectl  port-forward svc/llamaindex-rag-service 8000:8000

Go to the http://127.0.0.1:8000/docs. You will see a single API endpoint where you can invoke a RAG system.
Do some prompting. If you ask What did Paul do?, the output should look like this:

{
  "message": "He wrote essays and stories as a young person,  learned to program on an IBM 1401, and later moved on to microcomputers where he created games and programs like a word processor. He initially planned to study philosophy but switched to AI, influenced by a novel and a documentary about artificial intelligence. Paul focused on Lisp and even wrote a book about it called \"On Lisp\".  He explored computer science theory but found more excitement in building things. Eventually, he became interested in art and started taking classes at Harvard while still pursuing his PhD in computer science. He managed to write and submit a dissertation in time to graduate, utilizing parts of his work on \"On Lisp\". \n\n\n"
}

Cleanup

terraform destroy -var-file=default_env.tfvars

Troubleshooting

There may be a temporary error in the pods, where we mount buckets by using FUSE. Normally, they should be resolved without any additional actions.

MountVolume.SetUp failed for volume "datalake" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name gcsfuse.csi.storage.gke.io not found in the list of registered CSI drivers

Feedback

Was this page helpful?

Thank you for your feedback.

Continue reading:

vLLM GPU/TPU Fungibility

This tutorial shows you who to serve a large language model (LLM) using both Tensor Processing Units (TPUs) and GPUs on Google Kubernetes Engine (GKE) using the same deployment with [vLLM](https://github.com/vllm-project/vllm)

KServe

This tutorial will guide you step by step through the process of installing KServe in a GKE Autopilot cluster.

LangChain Chatbot

In this tutorial, you will learn how to deploy a chatbot application using [LangChain](https://python.langchain.com/) and [Streamlit](https://streamlit.io/) on Google Cloud Platform (GCP).

Resource Management with SkyPilot

This tutorial expands on the [SkyPilot Tutorial](https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/tutorials-and-examples/skypilot) by leveraging [Dynamic Workload Scheduler](https://cloud.google.com/blog/products/compute/introducing-dynamic-workload-scheduler) with the help of an open-source project called [Kueue](https://kueue.sigs.k8s.io/)

MLFlow

In this tutorial we will fine-tune gemma-2-9b using LoRA as an experiment in MLFlow. We will deploy MLFlow on a GKE cluster and set up MLFlow to store artifacts inside a GCS bucket. In the end, we will deploy a fine-tuned model using KServe.

NVIDIA NeMo

NVIDIA NeMo™ is an end-to-end platform for development of custom generative AI models anywhere. NVIDIA NeMo framework is designed for enterprise development, it utilizes NVIDIA's state-of-the-art technology to facilitate a complete workflow from automated distributed data processing to training of large-scale bespoke models using sophisticated 3D parallelism techniques, and finally, deployment using retrieval-augmented generation for large-scale inference on an infrastructure of your choice, be it on-premises or in the cloud.