GPU/TPU

Discover how to leverage GPUs and TPUs to accelerate machine learning and AI workloads. This section covers setup guides, best practices, and practical examples for utilizing GPU and TPU resources, enabling faster training, efficient inference, and scalable deployment of advanced models.

Using TPUs with KubeRay on GKE

This guide provides instructions for deploying and managing Ray custom resources on Google Kubernetes Engine (GKE) with TPUs. It details how to install the KubeRay TPU webhook, an admission webhook which bootstraps required environment variables for TPU initialization and enables atomic scheduling of multi-host TPU workers on GKE nodepools. This guide also provides a sample workload to verify proper TPU initialization and links to more advanced workloads to run with TPUs and Ray on GKE.

vLLM GPU/TPU Fungibility

This tutorial shows you who to serve a large language model (LLM) using both Tensor Processing Units (TPUs) and GPUs on Google Kubernetes Engine (GKE) using the same deployment with vLLM

Continue reading:

Hugging Face to GCS

This guide provides instructions for how to hydrate GCS buckets with models from Hugging Face with a Kubernetes Job.

Fine-tuning Gemma 3-1B-it on L4

This tutorial guides you through fine-tuning the Gemma 3-1B-it language model on Google Kubernetes Engine (GKE) using L4 GPU, leveraging Parameter Efficient Fine Tuning (PEFT) and LoRA. It covers setting up a GKE cluster, containerizing the fine-tuning code, running the fine-tuning job, and uploading the resulting model to Hugging Face. Finally, it demonstrates how to deploy and interact with the fine-tuned model using vLLM on GKE.

Deploying MCP Servers on GKE

This guide provides instructions for deploying a **Ray cluster with the AI Device Kit (ADK)** and a **custom Model Context Protocol (MCP) server** on **Google Kubernetes Engine (GKE)**. It covers setting up the infrastructure with Terraform, containerizing and deploying the Ray Serve application, deploying a custom MCP server for real-time weather data, and finally deploying an ADK agent that utilizes these components. The guide also includes steps for verifying deployments and cleaning up resources.

Jupyter

This guide details how to deploy JupyterHub on Google Kubernetes Engine (GKE) using a provided Terraform template, including options for persistent storage and Identity-Aware Proxy (IAP) for secure access. It covers the necessary prerequisites, configuration steps, and installation process, emphasizing the use of Terraform for automation and IAP for authentication. The guide also provides instructions for accessing JupyterHub, setting up user access, and running an example notebook.

Llamaindex

This tutorial will guide you through creating a robust Retrieval-Augmented Generation (RAG) system using LlamaIndex and deploying it on Google Kubernetes Engine (GKE).

RAG

This tutorial demonstrates how to deploy a Retrieval Augmented Generation (RAG) application on Google Kubernetes Engine (GKE), integrating a Hugging Face TGI inference server, a Cloud SQL pgvector database, and a Ray cluster for generating vector embeddings. It walks you through setting up the infrastructure with Terraform, populating the vector database with embeddings from a sample dataset using a Jupyter notebook, and launching a frontend chat interface. The guide also covers optional configurations like using your own cluster or VPC, enabling authenticated access via IAP, and troubleshooting common issues.