Dynamic Resource Allocation

Learn how to use Dynamic Resource Allocation (DRA) in Kubernetes to optimize the utilization of GPUs and TPUs.

Dynamic Resource Allocation (DRA) is a Kubernetes feature designed to modernize how workloads request and share specialized hardware, such as GPUs and other attached accelerators. By providing an experience similar to how Kubernetes handles storage, DRA allows developers to claim the exact hardware they need without getting bogged down in the manual complexities of per-node device management.

Why DRA Matters

Historically, Kubernetes managed accelerators through the static Device Plugin model, which treated hardware as simple integer counts (e.g., “1 GPU”) and required platform teams to pre-configure rigid, dedicated node pools for every hardware variant or sharing configuration.

DRA shifts this paradigm by enabling:

Storage-Like Claims: Workloads use ResourceClaims to dynamically request hardware, decoupling the application requirements from the underlying node configuration.
Infrastructure Flexibility: The same physical hardware pool can be dynamically partitioned or shared (using Time-Slicing, MPS, or MIG) on the fly, depending on active workload requests.
Declarative Scheduling Constraints: Developers can use CEL (Common Expression Language) selectors to request specific hardware attributes (like memory sizes or interconnect topologies), ensuring the scheduler automatically matches the application with the most suitable equipment.

Ultimately, DRA empowers developers to build high-performance applications more efficiently by providing a consistent, self-service, and scalable way to leverage specialized infrastructure across the entire cluster.

Resources

To learn more about the concepts, specifications, and architecture of DRA, refer to the official documentation:

GPU fungibility with DRA and Custom Compute Classes

This tutorial guides you through how to achieve GPU fungibility on GKE using Custom Compute Classes and Dynamic Resource Allocation (DRA).

Time slicing of GPUs with DRA

This tutorial guides you through how to do device sharing of NVIDIA GPUs with Dynamic Resource Allocation on Google Kubernetes Engine (GKE) with the time slicing mode

Device Sharing of GPUs with DRA using Multi-Process Service (MPS)

This tutorial guides you through how to do device sharing of NVIDIA GPUs with Dynamic Resource Allocation on Google Kubernetes Engine (GKE) using Multi-Process Service (MPS) mode.

Device sharing of GPUs with DRA using MIG

This tutorial guides you through how to do device sharing of NVIDIA GPUs with Dynamic Resource Allocation on Google Kubernetes Engine (GKE) using Multi-Instance GPU (MIG) mode.

Device alignment of GPU, NIC, and CPU with DRA

Learn how to achieve optimal hardware alignment for GPUs, NICs, and exclusive CPUs on GKE using Dynamic Resource Allocation (DRA) to maximize performance.

Dynamic Resource Allocation of TPUs

This tutorial guides you through how to dynamically allocate Google Cloud TPUs using Dynamic Resource Allocation (DRA) on Google Kubernetes Engine (GKE).

Continue reading:

Resource Management with SkyPilot

This tutorial expands on the [SkyPilot Tutorial](https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/tutorials-and-examples/skypilot) by leveraging [Dynamic Workload Scheduler](https://cloud.google.com/blog/products/compute/introducing-dynamic-workload-scheduler) with the help of an open-source project called [Kueue](https://kueue.sigs.k8s.io/)

Deploying MCP Servers on GKE

This guide provides instructions for deploying a **Ray cluster with the AI Device Kit (ADK)** and a **custom Model Context Protocol (MCP) server** on **Google Kubernetes Engine (GKE)**. It covers setting up the infrastructure with Terraform, containerizing and deploying the Ray Serve application, deploying a custom MCP server for real-time weather data, and finally deploying an ADK agent that utilizes these components. The guide also includes steps for verifying deployments and cleaning up resources.

Models as OCI

This project allows you to download a Hugging Face model and package it as a Docker image. The Docker image can then be pushed to Google Artifact Registry for deployment or distribution. Build time can be significant for large models, it is recommended to not exceed models above 10 billion parameters. For reference 8b model roughly takes 35 minutes to build and push with this cloudbuild config.

KServe

This tutorial will guide you step by step through the process of installing KServe in a GKE Autopilot cluster.

LangChain Chatbot

In this tutorial, you will learn how to deploy a chatbot application using [LangChain](https://python.langchain.com/) and [Streamlit](https://streamlit.io/) on Google Cloud Platform (GCP).

Ray

This guide provides instructions and examples for deploying and managing Ray clusters on Google Kubernetes Engine (GKE) using KubeRay and Terraform. It covers setting up a GKE cluster, deploying a Ray cluster, submitting Ray jobs, and using the Ray Client for interactive sessions. The guide also points to various resources, including tutorials, best practices, and examples for running different types of Ray applications on GKE, such as serving LLMs, using TPUs, and integrating with GCS.