Tutorials & Notebooks

Fine-tuning

Learn how to fine-tune machine learning and AI models for your specific use cases. This section covers best practices, step-by-step guides, and practical examples to help you adapt pre-trained models to your data and tasks, improving performance and achieving better results with custom fine-tuning workflows.

Frameworks & Pipelines

Explore leading frameworks and pipelines for building, training, and deploying machine learning and AI models. This section provides overviews, best practices, and hands-on guides for integrating tools like Metaflow, MLflow, LangChain, and LlamaIndex into your AI/ML workflows, enabling efficient experiment tracking, workflow automation, and scalable model management.

GPU/TPU

Discover how to leverage GPUs and TPUs to accelerate machine learning and AI workloads. This section covers setup guides, best practices, and practical examples for utilizing GPU and TPU resources, enabling faster training, efficient inference, and scalable deployment of advanced models.

Job Schedulers

Learn how to efficiently manage and automate machine learning and AI workloads with job schedulers. This section covers popular job scheduling tools, configuration tips, and practical examples to help you orchestrate complex workflows, optimize resource utilization, and streamline large-scale model training and deployment.

Storage

Providing persistent and high-performance storage solutions for AI/ML workloads running on Google Kubernetes Engine (GKE).

Workflow orchestration

Workflow orchestration in the ai-on-gke project involves managing and automating the execution of complex, multi-step processes, primarily for AI/ML workloads on Google Kubernetes Engine (GKE).

Inference servers

Deploying and managing servers dedicated to performing inference tasks for machine learning models.

Security

Continue reading:

Hugging Face TGI

This guide demonstrates how to deploy a Hugging Face Text Generation Inference (TGI) server on Google Kubernetes Engine (GKE) using NVIDIA L4 GPUs, enabling you to serve large language models like Mistral-7b-instruct. It walks you through creating a GKE cluster, deploying the TGI application, sending prompts to the model, and monitoring the service's performance using metrics, while also providing instructions for cleaning up the cluster.

KServe

This tutorial will guide you step by step through the process of installing KServe in a GKE Autopilot cluster.

vLLM GPU/TPU Fungibility

This tutorial shows you who to serve a large language model (LLM) using both Tensor Processing Units (TPUs) and GPUs on Google Kubernetes Engine (GKE) using the same deployment with [vLLM](https://github.com/vllm-project/vllm)

Cross region capacity chasing with SkyPilot

In this tutorial, we will demonstrate how to leverage the open-source software [SkyPilot](https://skypilot.readthedocs.io/en/latest/docs/index.html) to help GKE customers efficiently obtain accelerators across regions, ensuring workload continuity and optimized resource utilization.

Hugging Face to GCS

This guide provides instructions for how to hydrate GCS buckets with models from Hugging Face with a Kubernetes Job.

Llamaindex

This tutorial will guide you through creating a robust Retrieval-Augmented Generation (RAG) system using LlamaIndex and deploying it on Google Kubernetes Engine (GKE).