Tutorials & Notebooks

Dynamic Resource Allocation

Learn how to use Dynamic Resource Allocation (DRA) in Kubernetes to optimize the utilization of GPUs and TPUs.

Fine-tuning

Learn how to fine-tune machine learning and AI models for your specific use cases. This section covers best practices, step-by-step guides, and practical examples to help you adapt pre-trained models to your data and tasks, improving performance and achieving better results with custom fine-tuning workflows.

Frameworks & Pipelines

Explore leading frameworks and pipelines for building, training, and deploying machine learning and AI models. This section provides overviews, best practices, and hands-on guides for integrating tools like Metaflow, MLflow, LangChain, and LlamaIndex into your AI/ML workflows, enabling efficient experiment tracking, workflow automation, and scalable model management.

GPU/TPU

Discover how to leverage GPUs and TPUs to accelerate machine learning and AI workloads. This section covers setup guides, best practices, and practical examples for utilizing GPU and TPU resources, enabling faster training, efficient inference, and scalable deployment of advanced models.

Job Schedulers

Learn how to efficiently manage and automate machine learning and AI workloads with job schedulers. This section covers popular job scheduling tools, configuration tips, and practical examples to help you orchestrate complex workflows, optimize resource utilization, and streamline large-scale model training and deployment.

Storage

Providing persistent and high-performance storage solutions for AI/ML workloads running on Google Kubernetes Engine (GKE).

Workflow orchestration

Workflow orchestration in the ai-on-gke project involves managing and automating the execution of complex, multi-step processes, primarily for AI/ML workloads on Google Kubernetes Engine (GKE).

Inference servers

Deploying and managing servers dedicated to performing inference tasks for machine learning models.

Security

Continue reading:

Deploying MCP Servers on GKE

This guide provides instructions for deploying a **Ray cluster with the AI Device Kit (ADK)** and a **custom Model Context Protocol (MCP) server** on **Google Kubernetes Engine (GKE)**. It covers setting up the infrastructure with Terraform, containerizing and deploying the Ray Serve application, deploying a custom MCP server for real-time weather data, and finally deploying an ADK agent that utilizes these components. The guide also includes steps for verifying deployments and cleaning up resources.

N8n with Agent and Tool example

This tutorial will provide instructions on how to deploy and use [n8n](httpshttps://n8n.io/) on GKE (Google Kubernetes Engine) to build and operate AI applications using a low-code/no-code approach.

65k-nodes-benchmark

This guide outlines the process of benchmarking a 65,000-node Google Kubernetes Engine (GKE) cluster using CPU-only machines to simulate AI workloads and evaluate the Kubernetes control plane's performance. It details how to deploy the cluster with Terraform, run diverse simulated AI workloads (including training and inference) using ClusterLoader2, and collect performance metrics to assess scalability and stability. The benchmark results provide insights into pod state transitions, scheduling throughput, and API server latency under extreme load, allowing for a comprehensive evaluation of the control plane's capabilities.

Resource Management with SkyPilot

This tutorial expands on the [SkyPilot Tutorial](https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/tutorials-and-examples/skypilot) by leveraging [Dynamic Workload Scheduler](https://cloud.google.com/blog/products/compute/introducing-dynamic-workload-scheduler) with the help of an open-source project called [Kueue](https://kueue.sigs.k8s.io/)

DWS

This guide provides examples of how to use Dynamic Workload Scheduler (DWS) within Google Kubernetes Engine (GKE), leveraging Kueue for queue management and resource provisioning. It includes sample configurations for Kueue queues with DWS support (dws-queue.yaml) and a sample job definition (job.yaml) that demonstrates how to request resources and set a maximum run duration using DWS.

Models as OCI

This project allows you to download a Hugging Face model and package it as a Docker image. The Docker image can then be pushed to Google Artifact Registry for deployment or distribution. Build time can be significant for large models, it is recommended to not exceed models above 10 billion parameters. For reference 8b model roughly takes 35 minutes to build and push with this cloudbuild config.