Fine-tune gemma-2-9b and track as an experiment in MLFlow

Tags:

Data scientists often run a lot of experiments and it’s essential to be sure that the best run won’t be lost in numerous experiments. MLFlow helps to track experiments, datasets, metrics, and other data. GKE clusters can provide a lot of resources on demand.

The tutorial overview

In this tutorial we will fine-tune gemma-2-9b using LoRA as an experiment in MLFlow. We will deploy MLFlow on a GKE cluster and set up MLFlow to store artifacts inside a GCS bucket. In the end, we will deploy a fine-tuned model using KServe.

Before you begin

Ensure you have a GCP project with a billing account. Enable the following APIs for your project:

Ensure you have the following tools installed on your workstation:

gcloud CLI
kubectl
terraform
helm
git

If you previously installed the gcloud CLI, get the latest version by running gcloud components update

Set up

If you don’t have a GCP project, you have to create one. Ensure that your project has access to GKE.

Clone the repository with our guides and cd to the llamaindex/rag directory by running these commands:

git clone https://github.com/ai-on-gke/tutorials-and-examples.git
cd tutorials-and-examples/mlflow/finetune-gemma

Run these commands to authenticate:

gcloud auth login
gcloud auth application-default login

Enter the terraform-gke-cluster folder

cd terraform-gke-cluster

and adjust in the example_environments.tfvars file the following variables:

project_id – your GCP project id.
cluster_name – any name for your cluster.
kubernetes_namespace – any GKE environment variable namespace.

Run these commands to create your GKE cluster:

terraform init
terraform plan -var-file=example_environment.tfvars  # verify is it okay
terraform apply -var-file=example_environment.tfvars # print 'yes' when asked

After about 10-15 minutes your cluster will be ready to go. When the cluster is ready, run this command to get your GKE cluster access token:

export KUBECONFIG=~/tutorial_gke_access_token.kube
export REGION=us-central1
export PROJECT_ID=<PROJECT_ID>
export CLUSTER_NAME=<CLUSTER_NAME>
gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project $PROJECT_ID

The last command connects your current terminal session to the GKE cluster and saves kubectl settings in KUBECONFIG path.

Note: in step 1, the “Knative Serving install guide” actually also includes the steps to install the networking layer, and it actually prefers Kourier over Istio.

Install MLFlow

Go to the mlflow-configuration directory. We will use the helm to install MLFlow.

Since we want to access our model from MLFlow UI and deploy it using KServe easily, we need to specify environment variables for MLFlow so it could use GCS bucket as artifact storage. Go to the Google Cloud Console and create a GCS bucket.

Before installing the MLFlow chart, we need to adjust values.yaml so MLFlow could interact with the GCS bucket. Adjust the following values:

<BUCKET_NAME> – your bucket name.

Install the MLFlow chart:

helm install my-mlflow-release oci://registry-1.docker.io/bitnamicharts/mlflow -f values.yaml

The result of this command will show you how to access MLFlow UI. The output should look similar to this:

Pulled: registry-1.docker.io/bitnamicharts/mlflow:2.3.1
Digest: sha256:a9fb6ade4f8b8f4d7e36a8bd45f174dd55f4d0818c246b91bfc99570cd923450
NAME: my-mlflow-release
LAST DEPLOYED: Tue Jan 14 13:38:02 2025
NAMESPACE: default
...
1. Get the running pods
    kubectl get pods --namespace default -l "app.kubernetes.io/name=mlflow,app.kubernetes.io/instance=my-mlflow-release"

2. Get into a pod
    kubectl exec -ti [POD_NAME] bash

3. Execute your script as you would normally do.
MLflow Tracking Server can be accessed through the following DNS name from within your cluster:

    my-mlflow-release-tracking.default.svc.cluster.local (port 80)

To access your MLflow site from outside the cluster follow the steps below:

1. Get the MLflow URL by running these commands:

   kubectl port-forward --namespace default svc/my-mlflow-release-tracking 80:80 &
   echo "MLflow URL: http://127.0.0.1//"

2. Open a browser and access MLflow using the obtained URL.
3. Login with the following credentials below to see your blog:
...

This chart created some service accounts for our MLFlow (you can view them by running kubectl get serviceaccounts). In particular, my-mlflow-release-tracking is used by the MLFlow tracking server. We need to give this service account a read permission to view all experiments in the GCS bucket. Replace the following variables and run the command below to give the permission:

<BUCKET_NAME>: the GCS bucket’s name that you just created.
<PROJECT_NUMBER>: your GCP project’s number.
<PROJECT_ID>: your GCP project’s ID.

gcloud storage buckets add-iam-policy-binding gs://<BUCKET_NAME> \
    --role=roles/storage.objectViewer \
    --member=principal://iam.googleapis.com/projects/<PROJECT_NUMBER>/locations/global/workloadIdentityPools/<PROJECT_ID>.svc.id.goog/subject/ns/default/sa/my-mlflow-release-tracking \
    --condition=None

Let’s check our MLFlow tracking server via UI. We can port forward MLFlow tracking service by running the command below. This process runs in background due to the & sign. You can stop the process by killing it (the PID is printed after running the port-forwarding command).

kubectl port-forward --namespace default svc/my-mlflow-release-tracking 8080:80 &

Check your MLFlow UI at http://127.0.0.1:8080. You should be able to see the MLFlow UI like this: alt text

Fine-tune gemma2-9b using LoRA

Go to the finetune directory. Before we start our fine-tuning job, we need to create a kubernetes secret with a huggingface API token and accept the Google license to be able to download the model weights. Go to the google/gemma-2-9b and accept the license. Then run the commands below to create huggingface secret:

export HF_TOKEN=<HF_TOKEN>
kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl apply -f -

Now we can use MLFlow to track our fine-tuning process as an MLFlow experiment. To do so, we will define several files:

cloudbuild.yaml to use build and push Docker image to Artifact Registry
Dockerfile for containerization
finetune.yaml file for GKE cluster job
finetune.py for fine-tuning

Create an Artifact Registry Docker Repository by running the following command:

gcloud artifacts repositories create gemma \
    --project=${PROJECT_ID} \
    --repository-format=docker \
    --location=us \
    --description="Gemma Repo"

We need to provide read-only access to the Artifact Registry to be able to pull Docker images to create a fine-tuning job. Replace the following variables and run the following command to give the access:

<SERVICE_ACCOUNT_EMAIL>: email of your GKE cluster service account.

Get your GKE cluster’s service account email by running:

gcloud container clusters describe $CLUSTER_NAME \
  --region $REGION \
  --format="value(nodeConfig.serviceAccount)"

And give the permission:

gcloud artifacts repositories add-iam-policy-binding gemma \
    --project="${PROJECT_ID}" \
    --location="us" \
    --member="serviceAccount:<SERVICE_ACCOUNT_EMAIL>" \
    --role="roles/artifactregistry.reader"

Now we can build our Docker image for a fine-tuning job. Run the command below:

gcloud builds submit .

In the end, the output should look like this:

DONE
-------------------------------------------------------------------------------------------------------------------------------------
ID                                    CREATE_TIME                DURATION  SOURCE                                                                                           IMAGES                                                                 STATUS
44ff24d5-042f-49e9-9666-5d00a6057e63  2025-01-14T09:45:13+00:00  12M43S    gs://akvelon-gke-aieco_cloudbuild/source/1736847911.742098-6bbf01cd346f41aeb96cdd0ceb658181.tgz  us-docker.pkg.dev/akvelon-gke-aieco/gemma/finetune-gemma-mlflow:1.0.0  SUCCESS

In the finetune.yaml file you can see the crucial MLFlow environment variables:

MLFLOW_URI: for connecting to the MLFlow tracking server. We set it to http://my-mlflow-release-tracking:80, which is an internal GKE cluster URI.
MLFLOW_ARTIFACT_URI: for connecting experiments to our GCS bucket. You should specify it like gs://<BUCKET_NAME>/<ANY_EXISTING_PATH>.
MLFLOW_EXPERIMENT_NAME: this is an experiment name. If you want to start a fine-tuning job as a new experiment, then change this variable.

Our finetune.py script is using MLFlow to save experiment artifacts in the MLFLOW_ARTIFACT_URI path. In the finetune.yaml file you can see that we are using the default service account, so we need to give the GKE cluster’s service account a user permission to save the artifacts. Replace the following variables and run the command below to give the permission:

<BUCKET_NAME>: the GCS bucket’s name that you just created.
<PROJECT_NUMBER>: your GCP project’s number.
<PROJECT_ID>: your GCP project’s ID.

gcloud storage buckets add-iam-policy-binding gs://<BUCKET_NAME> \
    --role=roles/storage.objectUser \
    --member=principal://iam.googleapis.com/projects/<PROJECT_NUMBER>/locations/global/workloadIdentityPools/<PROJECT_ID>.svc.id.goog/subject/ns/default/sa/default \
    --condition=None

Then, inside the finetune.yaml file, replace the following variables:

<PROJECT_ID>: with your project ID.
<YOUR_BUCKET_PATH>: with your GCS bucket path.

Then run this command to create fine-tuning job:

kubectl apply -f finetune.yaml

You can review your pods by running the following command.

kubectl get pods -w

After the status of your pod is Running, you can check the running logs via this command:

kubectl logs <POD_NAME> -f

After about 15 minutes you should be able to see fine-tuning logs. The output should look like this:

{'loss': 0.8569, 'grad_norm': 1.615104079246521, 'learning_rate': 1.0277984159122733e-07, 'epoch': 0.99}
{'loss': 0.8612, 'grad_norm': 2.9606473445892334, 'learning_rate': 8.391511416816489e-09, 'epoch': 1.0}
{'train_runtime': 247.4258, 'train_samples_per_second': 4.042, 'train_steps_per_second': 2.021, 'train_loss': 1.1304109792709351, 'epoch': 1.0}
100%|██████████| 500/500 [04:07<00:00,  2.02it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:03<00:00,  1.32s/it]
🏃 View run funny-mare-232 at: http://my-mlflow-release-tracking:80/#/experiments/1/runs/31c95f20219946b9934b2f26e4a863af
🧪 View experiment at: http://my-mlflow-release-tracking:80/#/experiments/1

Or you can go to your MLFlow and check metrics.The whole fine-tuning experiment should take about 26 minutes.

Model registry

After the fine-tuning job is finished, you can check the result in MLFlow UI. alt text

Find your experiment and go to the artifacts tab. alt text

Here you can register your model. alt text

Now you have registered an ML model in MLFlow! alt text

Deployment

In this section we have two subsections that describe a simple deployment and a production like. If you want to just check the model result, you should follow the Simple deployment section. If you want to deploy your fine-tuned model using KServe, follow the KServe deployment section.

Simple deployment

Go to the deploy-gemma2/simple directory. In this directory you can see the following files:

cloudbuild.yaml: helps us to create and submit a deployment Docker image.
deploy.yaml: creates deployment and service to invoke our fine-tuned model.
main.py: defines a simple FastAPI application with a single API endpoint predict.
requirements.txt: defines FastAPI dependencies.

To deploy our model, we need to create a new repository in the artifact registry. Run the commands below:

gcloud artifacts repositories create gemma-deployment \
    --project=${PROJECT_ID} \
    --repository-format=docker \
    --location=us \
    --description="Gemma Deployment Repo"

And give your GKE cluster service account read access by running:

gcloud artifacts repositories add-iam-policy-binding gemma-deployment \
    --project="${PROJECT_ID}" \
    --location="us" \
    --member="serviceAccount:<SERVICE_ACCOUNT_EMAIL>" \
    --role="roles/artifactregistry.reader"

Run this command to submit a deployment Docker image to your repository:

gcloud builds submit .

After the command is completed, edit the following variables in deploy.yaml file:

<PROJECT_ID>: your project ID.
<YOUR_BUCKET_NAME>: your bucket where you store the fine-tuned model.
<MLFLOW_RUN_ID>: the MLFLow experiments’s ID.

And run the following command to deploy the model:

kubectl apply -f deploy.yaml

To check status of your deployment, find your pod name by running kubectl get pods and then run this command to check the logs:

kubectl logs <POD_NAME> -f

You can consider your deployment seccessful and ready to use when your logs look like this:

Loading checkpoint shards: 100%|██████████| 35/35 [00:52<00:00,  1.49s/it]
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:11434 (Press CTRL+C to quit)

Run this command to forward the port for your service:

kubectl port-forward svc/gemma2-9b-finetuned-service 8081:11434

And go to the http://127.0.0.1:8081/docs. Here you should be able to invoke your model. For example, for input:

Question: What is the total number of attendees with age over 30 at kubecon "eu"?\nContext: CREATE TABLE attendees (name VARCHAR, age INTEGER, kubecon VARCHAR)\n\nAnswer:

The output should look like this:

SELECT COUNT(*) FROM attendees WHERE age > 30 AND kubecon = "eu"

Clean up

Remove the artifact registry permission:

gcloud artifacts repositories remove-iam-policy-binding gemma-deployment \
    --project="${PROJECT_ID}" \
    --location="us" \
    --member="serviceAccount:<SERVICE_ACCOUNT_EMAIL>" \
    --role="roles/artifactregistry.reader"

KServe deployment

Go to the deploy-gemma2/kserve directory. In this section we will use KServe to deploy our fine-tuned model.

To install KServe, you can follow the guide in our repo. After the successful installation, we need to patch the deployment mode and create an ingress class. Run the command below:

kubectl patch configmap/inferenceservice-config -n kserve --type=strategic -p '{"data": {"deploy": "{\"defaultDeploymentMode\": \"RawDeployment\"}"}}'

The command above switches deployment mode from Serverless to RawDeployment, which bypasses Knative. RawDeployment mode directly manages Kubernetes resources for model deployment, which can offer more flexibility.

Also install the ingress class by running this command:

kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: istio
spec:
  controller: istio.io/ingress-controller
EOF

We need this class to be able to invoke the deployed model.

Before deployment, we have to prepare the environment for the mlserver. Since seldonio/mlserver’s libraries might be outdated, we can provide our custom environment with libraries that we need (you can read more detail here).

You can see that we have a conda-configs directory. Inside this directory is a yaml file that will be used to create our custom environment. Create a configmap with this environment.yaml by running this command below:

kubectl create configmap conda-requirements --from-file=conda-configs

Before you run environment creation job, you need to adjust values in the create-tarball.yaml and deploy.yaml:

<YOUR_BUCKET_NAME> – you should paste your bucket name here.

Now we can run our environment creation job:

kubectl apply -f create-tarball.yaml

You can check the progress of the tarball creation by running the following command:

kubectl logs <POR_NAME> -f

After the job is done, you can see that your bucket now has mlflow-gemma2-env.tar.gz tarball. We will use it in the deploy.yaml. Change the following variables in the deploy.yaml file:

<PATH_TO_YOUR_MODEL> – you can find the path in the MLFlow UI (it should look like gs://<BUCKET_NAME>/asdf12345/artifacts/model).
<YOUR_BUCKET_NAME> – your bucket name that should be mounted via gcsfuse to install your conda environment.

Run the command below:

kubectl apply -f deploy.yaml

You can check deployment status by this command:

kubectl get inferenceservices -w

The output should look like this:

NAME                                  URL                                                              READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
gemma-2-9b-finetuned                  http://gemma-2-9b-finetuned-default.example.com                  True                                                                  4m34s

When the inferenceservice gets the READY=True, you can invoke fine-tuned model using these commands below to access the model:

SERVICE_HOSTNAME=$(kubectl get inferenceservice gemma-2-9b-finetuned -o jsonpath='{.status.url}' | cut -d "/" -f 3)
INGRESS_HOST=127.0.0.1
INGRESS_PORT=8081
kubectl port-forward -n istio-system svc/istio-ingressgateway $INGRESS_PORT:80 &

And the model call command:

curl -v \
  -H "Host: ${SERVICE_HOSTNAME}" \
  -H "Content-Type: application/json" \
  -d @./input.json \
  http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/gemma-2-9b-finetuned/infer | jq

The output should be like this:

{
  "model_name": "gemma-2-9b-finetuned-1",
  "id": "d0d248ea-1b4b-42a0-bf96-14b95819daac",
  "parameters": {
    "content_type": "str"
  },
  "outputs": [
    {
      "name": "output-1",
      "shape": [
        1,
        1
      ],
      "datatype": "BYTES",
      "parameters": {
        "content_type": "str"
      },
      "data": [
        "SELECT COUNT(*) FROM attendees WHERE age > 30 AND kubecon = \"eu\""
      ]
    }
  ]
}

Clean up

To clean up, you need to go to the terraform-gke-cluster directory and run the following command:

terraform destroy -var-file=example_environment.tfvars

Remove all permissions that you gave during this tutorial:

gcloud storage buckets remove-iam-policy-binding gs://<BUCKET_NAME> \
    --role=roles/storage.objectViewer \
    --member=principal://iam.googleapis.com/projects/<PROJECT_NUMBER>/locations/global/workloadIdentityPools/<PROJECT_ID>.svc.id.goog/subject/ns/default/sa/my-mlflow-release-tracking \
    --condition=None

gcloud storage buckets remove-iam-policy-binding gs://<BUCKET_NAME> \
    --role=roles/storage.objectUser \
    --member=principal://iam.googleapis.com/projects/<PROJECT_NUMBER>/locations/global/workloadIdentityPools/<PROJECT_ID>.svc.id.goog/subject/ns/default/sa/default \
    --condition=None

gcloud artifacts repositories remove-iam-policy-binding gemma \
    --project="${PROJECT_ID}" \
    --location="us" \
    --member="serviceAccount:<SERVICE_ACCOUNT>" \
    --role="roles/artifactregistry.reader"

Delete your GCS bucket where you stored everything during the guide and Artifact registry repo gemma.

Feedback

Was this page helpful?

Thank you for your feedback.

Continue reading:

Agent ADK using GKE Autopilot Cluster with Llama and vLLM

This tutorial demonstrates how to deploy the Llama-3.1-8B-Instruct model on Google Kubernetes Engine (GKE) and vLLM for efficient inference. Additionally, it shows how to integrate an ADK agent to interact with the model, supporting both basic chat completions and tool usage. The setup leverages a GKE Autopilot cluster to handle the computational requirements.

ADK VertexAI Example

This tutorial guides you through deploying a containerized agent built with the [Google Agent Development Kit (ADK)](https://google.github.io/adk-docs/) to [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview). The agent uses [VertexAI](https://cloud.google.com/vertex-ai/docs) to access LLMs. GKE provides a managed environment for deploying, managing, and scaling your containerized applications using Google infrastructure.

Multi-Agent Code Development with Flowise

This tutorial will provide instructions on how to deploy and use [FlowiseAI](https://flowiseai.com/) on GKE (Google Kubernetes Engine) to build and operate AI applications using a low-code/no-code approach.

Agent on GKE using vLLM and Ray Serve

This tutorial demonstrates how to deploy the Llama-3.1-8B-Instruct model on Google Kubernetes Engine (GKE) using Ray Serve and vLLM for efficient inference. Additionally, it shows how to integrate an ADK agent to interact with the model, supporting both basic chat completions and tool usage. The setup leverages a GKE Standard cluster with GPU-enabled nodes to handle the computational requirements.

Fine-Tuning ESM2

This sample walks through setting up a Google Cloud GKE environment to fine-tune ESM2 (Evolutionary Scale Modeling) using NVIDIA BioNeMo Framework 2.0

Metaflow

This tutorial will provide instructions on how to deploy and use the [Metaflow](https://docs.metaflow.org/) framework on GKE (Google Kubernetes Engine) and operate AI/ML workloads using [Argo-Workflows](https://argo-workflows.readthedocs.io/en/latest/).