Efficient GPU Resource Management for ML Workloads using SkyPilot, Kueue on GKE

Tags:

This tutorial expands on the SkyPilot Tutorial by leveraging Dynamic Workload Scheduler with the help of an open-source project called Kueue Different from the SkyPilot tutorial, this guide shows how to use SkyPilot with Kueue on GKE to efficiently manage ML workloads with dynamic GPU provisioning.

Overview

This tutorial is designed for ML Platform engineers who plan to use SkyPilot to train or serve LLM models on Google Kubernetes Engine (GKE) while utilizing Dynamic Workload Scheduler (DWS) to acquire GPU resources as they become available. It covers installing Kueue and Skypilot, creating a GKE cluster with queue processing enabled GPU node pools, and deploying and running a LLM model. This setup enhances resource efficiency and reduces cost for ML workloads through dynamic GPU provisioning.

Before you begin

Ensure you have a gcp project with billing enabled and the GKE API activated. Learn how to enable billing and activate the GKE API. You can use gcloud cli to activate GKE API.

gcloud services enable container.googleapis.com

Ensure you have the following tools installed on your workstation

Check out the necessary code files:

git clone https://github.com/ai-on-gke/tutorials-and-examples.git
cd tutorials-and-examples/skypilot-dws-kueue

Setting up your GKE cluster with Terraform

We’ll use Terraform to provision:

A GKE cluster (Autopilot or Standard)
GPU node pools (only for Standard clusters)

Create your environment configuration(.tfvar) file and edit based on example_environment.tfvars.

project_id = "skypilot-project"
cluster_name = "skypilot-tutorial"
autopilot_cluster = true  # Set to false for Standard cluster

(Optional) For Standard clusters: Configure GPU node pools in example_environment.tfvars by uncommenting and adjusting the gpu_pools block as needed.

gpu_pools = [ {
  name                = "gpu-pool"
  queued_provisioning = true
  machine_type        = "g2-standard-24"
  disk_type           = "pd-balanced"
  autoscaling         = true
  min_count           = 0
  max_count           = 3
  initial_node_count  = 0
} ]

Deployment

Initialize the modules
```
terraform init
```

Apply while referencing the .tfvar file we created

terraform apply -var-file=your_environment.tfvar

And you should see your resources created:

Apply complete! Resources: 24 added, 0 changed, 0 destroyed.

Outputs:

gke_cluster_location = "us-central1"
gke_cluster_name = "skypilot-tutorial"
kubernetes_namespace = "ai-on-gke"
project_id = "skypilot-project"
service_account = "tf-gke-skypilot-tutorial@skypilot-project.iam.gserviceaccount.com"

Get kubernetes access

gcloud container clusters get-credentials $(terraform output -raw gke_cluster_name) --region $  (terraform output -raw gke_cluster_location) --project $(terraform output -raw project_id)

Verify your GKE cluster’s version, run:
```
kubectl version
```

Make sure you meet the minimum version requirements (1.30.3-gke.1451000 or later for Autopilot, 1.28.3-gke.1098000 or later for Standard) Server Version: v1.30.6-gke.1596000 If not, you can change the version in Terraform with the kubectl_version variable

Install and configure Kueue

Install Kueue from the official manifest.
Note that --server-side switch . Without it the client cannot render the CRDs because of annotation size limitations. For more configuration options visit Kueue’s installation guide.
```
VERSION=v0.10.2
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/ manifests.yaml
```

Configure Kueue for pod provisioning by patching the Kueue configmap.

# Extract and patch the config
# This is required because SkyPilot creates and manages workloads as pods
kubectl -n kueue-system get cm kueue-manager-config -o jsonpath={.data.controller_manager_config\\. yaml} | yq '.integrations.frameworks += ["pod"]' > /tmp/kueueconfig.yaml

Apply the changes

kubectl -n kueue-system create cm kueue-manager-config --from_file=controller_manager_config.yaml=/ tmp/kueueconfig.yaml --dry-run=client -o yaml | kubectl -n kueue-system apply -f -

Restart the kueue-controller-manager pod with the following command

kubectl -n kueue-system rollout restart deployment kueue-controller-manager
# Wait for the restart to complete
kubectl -n kueue-system rollout status deployment kueue-controller-manager

Install Kueue resources using the provided kueue_resources.yaml.
```
kubectl apply -f kueue_resources.yaml
```

Kueue should be up and running now.

Install SkyPilot

Create a python virtual environment.

cd ~
python -m venv skypilot-test
cd skypilot-test
source bin/activate

Install SkyPilot

pip install -U "skypilot[kubernetes]"
# Verify the installation
sky -v

Find the context names

kubectl config get-contexts

# Find the context name, for example: 
# gke_${PROJECT_NAME}_us-central1-c_demo-us-central1

Create SkyPilot configuration. Add autoscaler: gke to enable SkyPilot to work with GKE’s cluster autoscaling capabilities, allowing you to run workloads without pre-provisioned GPU nodes.

# Create and edit ~/.sky/config.yaml
# Change PROJECT_NAME, LOCATION and CLUSTER_NAME
allowed_clouds:
    - kubernetes
kubernetes:
    # Use the context's name
    allowed_contexts:
    - gke_${PROJECT_NAME}_${LOCATION}_${CLUSTER_NAME}
    autoscaler: gke

And verify again: bash sky check And you should the the following output ``` Kubernetes: enabled

To enable a cloud, follow the hints above and rerun: sky check
If any problems remain, refer to detailed docs at: https://skypilot.readthedocs.io/en/latest/   getting-started/installation.html

Note: The following clouds were disabled because they were not included in allowed_clouds in ~/.    sky/config.yaml: GCP, AWS, Azure, Cudo, Fluidstack, IBM, Lambda, OCI, Paperspace, RunPod, SCP,  vSphere, Cloudflare (for R2 object store)

🎉 Enabled clouds 🎉
  ✔ Kubernetes
```

Configure and Run SkyPilot Job

For SkyPilot to create pods with the necessary pod config we need to add the following config to train_dws.yaml.

experimental:
    config_overrides:
    kubernetes:
        pod_config:
        metadata:
            annotations:
            provreq.kueue.x-k8s.io/maxRunDurationSeconds: "3600"
        provision_timeout: 900

And labels config to the resources section

    labels:
    kueue.x-k8s.io/queue-name: dws-local-queue

Launch the workload

sky launch -c skypilot-dws train_dws.yaml

SkyPilot will wait in Launching state until the node is provisioned.

⚙︎ Launching on Kubernetes.

In another terminal, you can kubectl get pods and it will be in SchedulingGated state

NAME                     READY   STATUS            RESTARTS   AGE
skypilot-dws-00b5-head   0/1     SchedulingGated   0          44s

If you run kubectl describe provisioningrequests you can see in the Conditions: what is happening with the request.

    Conditions:
    Last Transition Time:  2024-12-20T11:40:46Z
    Message:               Provisioning Request was successfully queued.
    Observed Generation:   1
    Reason:                SuccessfullyQueued
    Status:                True
    Type:                  Accepted
    Last Transition Time:  2024-12-20T11:40:47Z
    Message:               Waiting for resources. Currently there are not enough resources  available to fulfill the request.
    Observed Generation:   1
    Reason:                ResourcePoolExhausted
    Status:                False
    Type:                  Provisioned

When the requested resource is availaible the provisioningrequest will reflect that in the Conditions:

    Last Transition Time:  2024-12-20T11:42:55Z
    Message:               Provisioning Request was successfully provisioned.
    Observed Generation:   1
    Reason:                Provisioned
    Status:                True
    Type:                  Provisioned

Now the workload will be running

NAME                     READY   STATUS    RESTARTS   AGE
skypilot-dws-00b5-head   1/1     Running   0          4m49s

And later finished

✓ Job finished (status: SUCCEEDED).

📋 Useful Commands
Job ID: 1
├── To cancel the job:          sky cancel skypilot-dws 1
├── To stream job logs:         sky logs skypilot-dws 1
└── To view job queue:          sky queue skypilot-dws

Cluster name: skypilot-dws
├── To log into the head VM:    ssh skypilot-dws
├── To submit a job:            sky exec skypilot-dws yaml_file
├── To stop the cluster:        sky stop skypilot-dws
└── To teardown the cluster:    sky down skypilot-dws

You can now ssh into the pod, run different workloads and experiment.

Fine-tune and Serve Gemma 2B on GKE

This section details how to fine-tune Gemma 2B for SQL generation on GKE Autopilot using SkyPilot. Model artifacts stored in Google Cloud Storage (GCS) bucket and shared across pods using gcsfuse. The workflow separates training and serving into distinct pods, managed through finetune.yaml and serve.yaml. We’ll use two SkyPilot commands for this workflow:

sky launch: For running the fine-tuning job
sky serve: For deploying the model as a persistent service

Prerequisites

A GKE cluster configured with SkyPilot
HuggingFace account with access to Gemma model

Fine-tuning Implementation

The finetune.py script uses QLoRA with 4-bit quantization to fine-tune Gemma 2B on SQL generation tasks.

Configure GCS Storage Access

The infrastructure Terraform configuration in main.tf includes Workload Identity and GCS bucket setup:

module "skypilot-workload-identity" {
    source              = "terraform-google-modules/kubernetes-engine/google//modules/    workload-identity"
    name                = "skypilot-service-account"
    namespace           = "default"
    project_id          = var.project_id
    roles               = ["roles/storage.admin", "roles/compute.admin"]
    cluster_name = module.infra[0].cluster_name
    location = var.cluster_location
    use_existing_gcp_sa = true
    gcp_sa_name = data.google_service_account.gke_service_account.email
    use_existing_k8s_sa = true
    annotate_k8s_sa = false
}

Get project and service account details

terraform output project_id
terraform output service_account

Configure Workload Identity
Run additional commands to connect the Google Cloud Service Account that was created with Terraform with Identity Federation enabled to be able to use gcsfuse.
```
gcloud iam service-accounts add-iam-policy-binding SERVICE_ACCOUNT \
          --role roles/iam.workloadIdentityUser \
          --member "serviceAccount:PROJECT_ID.svc.id.goog[default/skypilot-service-account]"
```
This will create policy binding that will allow the kubernetes service account to impersonate the google service account. Also note that [default/skypilot-service-account] is the kubernetes namespace and service account name that is deployed by SkyPilot by default. Change if you specifically changed SkyPilot configuration or used another namespace.

Annotate Kubernetes service account

kubectl annotate serviceaccount skypilot-service-account --namespace default iam.gke.io/    gcp-service-account=SERVICE_ACCOUNT

Get the bucket name
```
terraform output model_bucket_name
```
Update gcsfuse configuration in finetune.yaml and serve.yaml
Replace the BUCKET_NAME

Fine-tune the Model

Set up HuggingFace access: Finetune script needs a HuggingFace token and to sign the licence consent agreement.
Follow instructions on the following link: Get access to the model
```
export HF_TOKEN=tokenvalue
```

Launch a fine-tuning job:

sky launch -c finetune finetune.yaml --retry-until-up --env HF_TOKEN=$HF_TOKEN

After fine-tuning is finished you should see the following output

(gemma-finetune, pid=1837) 
100%|██████████| 5000/5000 [12:49<00:00,  6.50it/s]00 [12:49<00:00,  6.81it/s]
(gemma-finetune, pid=1837) /home/sky/miniconda3/lib/python3.10/site-packages/huggingface_hub/   file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in    version 1.0.0. Downloads always resume when possible. If you want to force a new download, use     `force_download=True`.
(gemma-finetune, pid=1837)   warnings.warn(
(gemma-finetune, pid=1837) 
(gemma-finetune, pid=1837) Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00,  3.93s/it]/2 [00:07<00:07,  7.61s/it]
✓ Job finished (status: SUCCEEDED).

Serve the Model

Next, run the finetuned model with the serve.yaml and serve cli

sky serve up serve.yaml

When the serve pods are provisioned you should see the following output

```
⚙︎ Launching serve controller on Kubernetes.
└── Pod is up.
✓ Cluster launched: sky-serve-controller-00b550a3.  View logs at: ~/sky_logs/  sky-2025-01-08-19-25-40-242969/provision.log
⚙︎ Mounting files.
  Syncing (to 1 node): /tmp/service-task-sky-service-7e46-jyxqvkgh -> ~/.sky/serve/ sky_service_7e46/task.yaml.tmp
  Syncing (to 1 node): /tmp/tmpd_sj9qpw -> ~/.sky/serve/sky_service_7e46/config.yaml
✓ Files synced.  View logs at: ~/sky_logs/sky-2025-01-08-19-25-40-242969/file_mounts.log
⚙︎ Running setup on serve controller.
  Check & install cloud dependencies on controller: done.                                        
✓ Setup completed.  View logs at: ~/sky_logs/sky-2025-01-08-19-25-40-242969/setup-*.log
⚙︎ Service registered.

Service name: sky-service-7e46
Endpoint URL: 35.226.190.154:30002
📋 Useful Commands
├── To check service status:    sky serve status sky-service-7e46 [--endpoint]
├── To teardown the service:    sky serve down sky-service-7e46
├── To see replica logs:        sky serve logs sky-service-7e46 [REPLICA_ID]
├── To see load balancer logs:  sky serve logs --load-balancer sky-service-7e46
├── To see controller logs:     sky serve logs --controller sky-service-7e46
├── To monitor the status:      watch -n10 sky serve status sky-service-7e46
└── To send a test request:     curl 35.226.190.154:30002

✓ Service is spinning up and replicas will be ready shortly.

Check if the serving api is ready by running

sky status

And wait for the PROVISIONING status to appear READY

Services
NAME              VERSION  UPTIME   STATUS      REPLICAS  ENDPOINT              
sky-service-7e46  -        -        NO_REPLICA  0/1       35.226.190.154:30002  


Service Replicas
SERVICE_NAME      ID  VERSION  ENDPOINT                  LAUNCHED     RESOURCES                     STATUS        REGION                                                   
sky-service-7e46  1   1        -                         1 min ago    1x Kubernetes({'A100': 1})    PROVISIONING  gke_skypilot_project_us-central1_-skypilot-test

After that take the url from the ENDPOINT and use curl to prompt the served model

curl -X POST http://SKYPILOT_ADDRESS/generate \
        -H "Content-Type: application/json" \
        -d '{ "prompt": "Question: What is the total number of attendees with age over 30 at     kubecon eu? Context: CREATE TABLE attendees (name VARCHAR, age INTEGER, kubecon VARCHAR)    Answer:","top_p": 1.0, "temperature": 0 , "max_tokens":128 }' \
        | jq

And you should see the reply

Answer: SELECT COUNT(name) FROM attendees WHERE age > 30 AND kubecon = \"kubecon eu\"\

Cleanup

Remove the skypilot cluster and serve endpoints:

sky down skypilot-dws
sky down test-finetune
sky serve down --all

Finally destory the provisioned infrastructure.

terraform destroy -var-file=your_environment.tfvar

Troubleshooting

If Kueue install gives the error:
```
the CustomResourceDefinition "workloads.kueue.x-k8s.io" is invalid: metadata.annotations: Too   long: must have at most 262144 bytes
```
Make sure you include the --server-side argument to the kubectl apply command when installing Kueue. Delete it first if repeating the step

If you get an error with the kueue-webhook-service.

Error from server (InternalError): error when creating "kueue_resources.yaml": Internal error   occurred: failed calling webhook "mresourceflavor.kb.io": failed to call webhook: Post "https://  kueue-webhook-service.kueue-system.svc:443/mutate-kueue-x-k8s-io-v1beta1-resourceflavor?  timeout=10s": no endpoints available for service "kueue-webhook-service"

Wait for endpoints for the kueue-webhook-service to be populated with the kubectl wait command

kubectl -n kueue-system wait endpoints/kueue-webhook-service --for=jsonpath={.subsets}

If SkyPilot refuses to start the cluster because there is no nodes that would satify the requirement for GPU

Task from YAML spec: train_dws.yaml
No resource satisfying Kubernetes({'L4': 1}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances     satisfying the request: 1x Kubernetes({'L4': 1}).
To fix: relax or change the resource requirements.

Hint: `sky show-gpus --cloud kubernetes` to list available accelerators.
      `sky check` to check the enabled clouds.

Make sure you added autoscaling: gke to the sky config in step Install SkyPilot

Permission denied when trying to write to the mounted gcsfuse volume.

Make sure you added uid=1000,gid=1000 to the mountOptions: YAML inside of the task yaml file. SkyPilot by default uses 1000 gid and uid

volumes:
- name: gcsfuse-test
    csi:
    driver: gcsfuse.csi.storage.gke.io
    volumeAttributes:
        bucketName: MODEL_BUCKET_NAME
        mountOptions: "implicit-dirs,uid=1000,gid=1000"

Denied by autogke-gpu-limitation

When running sky serve on Autopilot cluster GKE Warden rejects the pods

"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook     \"warden-validating.common-webhooks.networking.gke.io\" denied the request: GKE Warden rejected     the request because it violates one or more constraints.\nViolations details: {\"[denied by     autogke-gpu-limitation]\":[\"The toleration with key 'nvidia.com/gpu' and operator 'Exists' cannot  be specified if the pod does not request to use GPU in Autopilot.\"

Update SkyPilot to version 0.8.0 and above.

More Details

Feedback

Was this page helpful?

Thank you for your feedback.

Continue reading:

Hugging Face TGI

This guide demonstrates how to deploy a Hugging Face Text Generation Inference (TGI) server on Google Kubernetes Engine (GKE) using NVIDIA L4 GPUs, enabling you to serve large language models like Mistral-7b-instruct. It walks you through creating a GKE cluster, deploying the TGI application, sending prompts to the model, and monitoring the service's performance using metrics, while also providing instructions for cleaning up the cluster.

Models as OCI

This project allows you to download a Hugging Face model and package it as a Docker image. The Docker image can then be pushed to Google Artifact Registry for deployment or distribution. Build time can be significant for large models, it is recommended to not exceed models above 10 billion parameters. For reference 8b model roughly takes 35 minutes to build and push with this cloudbuild config.

Workflow orchestration

Workflow orchestration in the ai-on-gke project involves managing and automating the execution of complex, multi-step processes, primarily for AI/ML workloads on Google Kubernetes Engine (GKE).

DWS

This guide provides examples of how to use Dynamic Workload Scheduler (DWS) within Google Kubernetes Engine (GKE), leveraging Kueue for queue management and resource provisioning. It includes sample configurations for Kueue queues with DWS support (dws-queue.yaml) and a sample job definition (job.yaml) that demonstrates how to request resources and set a maximum run duration using DWS.

Flyte

This guide illustrates the deployment of Flyte on Google Kubernetes Engine (GKE) using Helm, utilizing Google Cloud Storage for scalable data storage and Cloud SQL PostgreSQL for a reliable metadata store. By the end of this tutorial, you will have a fully functional Flyte instance on GKE, offering businesses seamless integration with the GCP ecosystem, improved resource efficiency, and cost-effectiveness.

Cross region capacity chasing with SkyPilot

In this tutorial, we will demonstrate how to leverage the open-source software [SkyPilot](https://skypilot.readthedocs.io/en/latest/docs/index.html) to help GKE customers efficiently obtain accelerators across regions, ensuring workload continuity and optimized resource utilization.