Time slicing of GPUs with DRA
Background
This tutorial guides you through how to do device sharing of NVIDIA GPUs with Dynamic Resource Allocation on Google Kubernetes Engine (GKE). This guide covers time slicing, with separate guides covering the other device sharing modes, MIG and Multi-Process Service (MPS).
GPU Device Sharing Modes
When sharing a single physical GPU among multiple containers or pods, you typically choose between three primary mechanisms. Here is a quick summary of how they work and their tradeoffs:
-
Time Slicing (Time Sharing):
- How it works: The GPU’s compute resources are shared in time. The GPU context switches between different workloads.
- Pros: Simple to configure, works on almost all modern GPUs, and has no memory overhead.
- Cons: No memory isolation (one workload can consume all memory and OOM the other) and potential latency impact if one workload hogs the GPU.
- Best for: Development, testing, or workloads with low and bursty utilization where hard isolation is not required.
-
Multi-Process Service (MPS):
- How it works: Multiple processes share the GPU compute and memory simultaneously (spatial sharing).
- Pros: Higher utilization and lower overhead than time slicing. It allows concurrent execution of kernels from different processes.
- Cons: Limited isolation compared to MIG. Memory limits can be set but are not as strictly enforced at the hardware level as MIG.
- Best for: Cooperative workloads, like running multiple inference servers that trust each other and benefit from high throughput.
-
Multi-Instance GPU (MIG):
- How it works: The GPU is physically partitioned into separate “instances,” each with its own dedicated compute and memory resources.
- Pros: Hard isolation, guaranteed QoS (Quality of Service), and complete memory separation.
- Cons: Rigid partition sizes and requires specific high-end GPUs (Ampere architecture or newer, e.g., A100, H100).
- Best for: Production workloads requiring strict isolation, predictable performance, and multi-tenancy security.
Why Choose DRA for GPU Sharing?
While GPU sharing (Time-Slicing, MPS, and MIG) is available through both the standard GKE GPU Device Plugin and Dynamic Resource Allocation (DRA), DRA offers significant advantages in flexibility and management.
The Limitations of the Device Plugin
In the traditional Device Plugin model, GPU sharing is “baked into” the node pool. You must specify the sharing mode and the specific partitions or clients (e.g., specific MIG profiles like 1g.5gb) when you create the node pool. This creates several challenges:
- Infrastructure Rigidity: If a new workload requires a different partition size or sharing mode, you must create a new node pool.
- Resource Waste: A node pool configured for a specific sharing strategy cannot easily adapt to workloads that need a full, non-shared GPU without wasting the other configured slots.
- Manual Labeling: Developers must know which node labels correspond to which sharing modes and manually add
nodeSelectorentries to their manifests.
The DRA Advantage: Dynamic, Per-GPU Sharing
DRA shifts the sharing configuration to the workload manifest. Instead of targeting a pre-configured node pool, the pod requests a GPU with specific sharing parameters in its ResourceClaim.
- Dynamic Granularity: You can request Time-Slicing for a dev workload and MIG or MPS for a production workload, and GKE will dynamically allocate and configure the GPUs to match these requests.
- No Fixed Node Pools: You don’t need to pre-build node pools for every possible sharing ratio or partition size. The same underlying hardware can be partitioned differently for each claim.
- Developer-Centric: Developers define their resource needs directly in their YAML, and GKE handles the infrastructure orchestration to meet those needs.
This transition from infrastructure-level configuration to workload-level requests makes DRA a more flexible, scalable, and efficient solution for multi-tenant GPU environments.
Let’s get started and explore how to share GPUs with time slicing.
Prepare the Environment
To set up your environment with Cloud Shell, follow these steps:
- In the Google Cloud console, click the Activate Cloud Shell icon to launch a session in the bottom pane.
- Set the default environment variables:
export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export CLUSTER_NAME=gpu-vllm-timeslicing
export LOCATION=us-central1 # Choose a region that has NVIDIA A100 GPUs available
export ZONE=us-central1-c # Choose a zone within the region that has A100 GPUs available. Look at https://cloud.google.com/compute/docs/gpus/gpu-regions-zones for availability.
export HF_TOKEN=HUGGING_FACE_TOKEN # Replace with your actual Hugging Face token
export CLUSTER_VERSION="1.35.2-gke.1269001" # Must be 1.34 or later
export NAMESPACE=default
Create and configure Google Cloud Resources
Create a GKE Cluster
gcloud container clusters create $CLUSTER_NAME \
--location=$LOCATION \
--cluster-version=$CLUSTER_VERSION \
--project=$PROJECT_ID \
--num-nodes=1 \
--labels=created-by=ai-on-gke,guide=gpu-time-slicing
Create a node pool with A100 GPUs
This creates a node pool with just a single machine with a single A100 GPU. We disable installation of the GPU Device Plugin since we will be installing the NVIDIA GPU DRA driver instead. We request spot capacity here to increase the chance of getting a node quickly.
gcloud container node-pools create a100-pool \
--cluster=${CLUSTER_NAME} \
--location=${LOCATION} \
--node-locations=${ZONE} \
--machine-type="a2-highgpu-1g" \
--accelerator="type=nvidia-tesla-a100,count=1,gpu-driver-version=disabled" \
--num-nodes=1 \
--node-labels=gke-no-default-nvidia-gpu-device-plugin=true,nvidia.com/gpu.present=true,cloud.google.com/gke-nvidia-gpu-dra-driver=true \
--spot
Configure Kubectl to communicate with your cluster
To configure kubectl to communicate with your cluster, run the following command:
gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${LOCATION}
Create Kubernetes Secret for Hugging Face credentials
Make sure you have accepted the model license terms on Hugging Face for the google/gemma-3-1b-it model before proceeding. Your Hugging Face token must have access to this model.
To create a Kubernetes Secret that contains the Hugging Face token, run the following command:
kubectl create secret generic hf-secret --from-literal=hf_api_token=${HF_TOKEN} --namespace=${NAMESPACE}
Install the NVIDIA GPU driver
Since we disabled the installation of the GPU Device Plugin at node pool creation time, we need to install the NVIDIA GPU driver manually.
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
Install the NVIDIA GPU DRA driver
We install the NVIDIA GPU DRA driver using a Helm chart. Make sure that you have Helm installed, if not,
you can follow the Helm documentation to install it. Time slicing is
still a beta feature in the NVIDIA GPU DRA driver, so we need to enable it by setting
featureGates.TimeSlicingSettings=true.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
--version="25.12.0" --create-namespace --namespace=nvidia-dra-driver-gpu \
--set nvidiaDriverRoot="/home/kubernetes/bin/nvidia/" \
--set gpuResourcesEnabledOverride=true \
--set resources.computeDomains.enabled=false \
--set featureGates.TimeSlicingSettings=true \
--set kubeletPlugin.priorityClassName="" \
--set 'kubeletPlugin.tolerations[0].operator=Exists' # Needed to ensure the driver can run on tainted GPU nodes
Verify that the NVIDIA GPU DRA driver is working
Check that the NVIDIA GPU DRA driver is installed and working by inspecting the driver pod:
kubectl -n nvidia-dra-driver-gpu get pods
The pod should be in a Running state. If not, you can inspect the logs with:
kubectl -n nvidia-dra-driver-gpu logs -l app.kubernetes.io/name=nvidia-dra-driver-gpu -c gpus
Verify that the driver has published a ResourceSlice object that lists the GPU on the node:
It might take a minute or two for the driver to fully initialize and publish the ResourceSlice after installation.
kubectl get resourceslices -o yaml
You should see the description of the GPU:
spec:
driver: gpu.nvidia.com
nodeName: gke-gpu-vllm-timeslicing-a100-pool-efc3ca7a-0w1g
pool:
generation: 1
name: gke-gpu-vllm-timeslicing-a100-pool-efc3ca7a-0w1g
resourceSliceCount: 1
devices:
- name: gpu-0
attributes:
addressingMode:
string: None
architecture:
string: Ampere
brand:
string: Nvidia
cudaComputeCapability:
version: 8.0.0
cudaDriverVersion:
version: 13.0.0
driverVersion:
version: 580.126.9
productName:
string: NVIDIA A100-SXM4-40GB
resource.kubernetes.io/pciBusID:
string: "0000:00:04.0"
resource.kubernetes.io/pcieRoot:
string: pci0000:00
type:
string: gpu
uuid:
string: GPU-83d857e2-7326-75a9-c355-4270d0605409
capacity:
memory:
value: 40Gi
Create the DRA ResourceClaim
We will be using a single GPU that will be shared between two vllm replicas. Therefore we create a single ResourceClaim that will be referenced from the Pod spec in the Deployment, rather than using a ResourceClaimTemplate. By having multiple pods reference the exact same ResourceClaim, we explicitly tell Kubernetes to share the identical GPU slice across those pods.
Inspect the following claim.yaml. We use interval: Long in the time slicing configuration to set a long time slice interval, which is suitable for inference workloads to minimize context switching overhead.
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: gpu-claim
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.nvidia.com
config:
- requests: ["gpu"]
opaque:
driver: gpu.nvidia.com
parameters:
apiVersion: resource.nvidia.com/v1beta1
kind: GpuConfig
sharing:
strategy: TimeSlicing
timeSlicingConfig:
interval: Long
Apply the manifest
kubectl apply -f claim.yaml --namespace=${NAMESPACE}
Deploy the vllm workload
We are using the Gemma 3 models as they are smaller and therefore let us run the tutorial using hardware (i.e., GPUs) that are more easily available than what would be needed to run larger foundation models.
We create a Deployment that runs two replicas of vllm. We reference the ResourceClaim gpu-claim in
the pod spec, so both pods will reference the same claim.
Notice the new DRA syntax in the Pod spec: spec.resourceClaims defines the claim to be used, and resources.claims in the container spec links the container to that defined claim.
We set the gpu-memory-utilization parameter for vllm to 0.42, which means that each
replica will use 42% of the GPU memory. This prevents the first replica from consuming
the entire GPU, leaving memory available for the second replica.
Inspect the following vllm.yaml.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-gpu
spec:
replicas: 2
selector:
matchLabels:
app: vllm-gpu
template:
metadata:
labels:
app: vllm-gpu
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
resourceClaims:
- name: gpu
resourceClaimName: gpu-claim
containers:
- name: vllm-gpu
image: vllm/vllm-openai:v0.7.2
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --host=0.0.0.0
- --port=8000
- --model=google/gemma-3-1b-it
- --gpu-memory-utilization=0.42
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token
ports:
- containerPort: 8000
resources:
claims:
- name: gpu
readinessProbe:
tcpSocket:
port: 8000
initialDelaySeconds: 15
periodSeconds: 10
volumeMounts:
- name: dshm
mountPath: /dev/shm
volumes:
- name: dshm
emptyDir:
medium: Memory
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm-gpu
type: LoadBalancer
ports:
- name: http
protocol: TCP
port: 8000
targetPort: 8000
Apply the manifest
kubectl apply -f vllm.yaml --namespace=${NAMESPACE}
View the logs from the running model servers:
kubectl logs -l app=vllm-gpu --prefix -f
You should see something like this from both replicas:
(APIServer pid=1) INFO 04-25 21:39:25 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
Generate traffic to the model
We will send requests to the model servers and then use the logs to verify that we are getting responses from both replicas.
First we get the external IP of the service
Provisioning the external IP for the vllm-service LoadBalancer may take a few minutes. If the export command fails or returns empty, wait a moment and try again.
export vllm_service=$(kubectl get service vllm-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}' -n ${NAMESPACE})
Send a bunch of requests to the model servers. We use a loop to ensure that traffic hits both replicas.
for i in {1..5}; do
curl http://$vllm_service:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-3-1b-it",
"prompt": "Write a story about san francisco",
"max_tokens": 100,
"temperature": 0
}'
done
The output from both replicas should contain lines similar to this:
(APIServer pid=1) INFO: 10.0.3.1:47463 - "POST /v1/completions HTTP/1.1" 200 OK
Despite having only a single GPU, we were able to run two replicas of vllm, because we enabled time slicing in the NVIDIA GPU DRA driver.
Understanding the DRA Benefit
This tutorial demonstrated one of the key benefits of using Dynamic Resource Allocation (DRA) for GPU sharing. With the traditional Device Plugin, you would have had to provision a node pool statically configured with time slicing enabled.
With DRA, we were able to run multiple replicas of a workload sharing the same physical A100 GPU without any node-level reconfiguration. Kubernetes and the NVIDIA DRA driver dynamically fulfilled the ResourceClaim requests defined in the workload manifests. This provides significant flexibility, maximizes hardware utilization, and allows developers to request exactly what they need at the application level.
Clean up
To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, run the following command to delete the cluster:
gcloud container clusters delete ${CLUSTER_NAME} \
--location=${LOCATION}
Feedback
Was this page helpful?
Thank you for your feedback.
Thank you for your feedback.