Building Agents with Agent Development Kit (ADK) on GKE using Ray Serve for Self-Hosted LLMs
This tutorial demonstrates how to deploy the Llama-3.1-8B-Instruct model on Google Kubernetes Engine (GKE) using Ray Serve and vLLM for efficient inference. Additionally, it shows how to integrate an ADK agent to interact with the model, supporting both basic chat completions and tool usage. The setup leverages a GKE Standard cluster with GPU-enabled nodes to handle the computational requirements.
By the end of this tutorial, you will:
- Set up a GKE Autopilot cluster with a GPU node pool.
- Deploy the Llama-3.1-8B-Instruct model using Ray Serve and vLLM.
- Deploy an ADK agent that communicates with the Ray Serve endpoint.
- Test the setup with basic chat completion and tool usage scenarios.
- Monitor and debug the deployment using Ray Dashboard and Kubernetes logs.
Prerequisites
- A terminal with
kubectl
,helm
andgcloud
andterraform
installed. - A Hugging Face account with a token that has
Read
permission to access the Llama-3.1-8B-Instruct model. - Sufficient GPU quota in your Google Cloud project. See About GPUs and Allocation quotas.
- Access to the code repository: https://github.com/ai-on-gke/tutorials-and-examples
Filesystem structure
ray-serve/
├── adk_agent/ # ADK agent deployment and cloudbuild configuration
| ├── example_agent/ # Agent code
| ├── _init_.py
| ├── agent-deployment.yaml
| ├── agent.py
| ├── cloudbuild.yml
| ├── Dockerfile
| ├── main.py
| ├── requirements.txt
| ├── ray-serve-vllm/ # Ray image cloudbuild and service manifest
| ├── cloudbuild.yml
| ├── Dockerfile
| ├── ray-service.yaml
| ├── requirements.txt
| ├── serve-chat-completion.py
| ├── terarform/
| ├── artifact_registry.tf
| ├── default_env.tfvars
| ├── main.tf
| ├── network.tf
| ├── outputs.tf
| ├── providers.tf
| ├── variables.tf
| └── workdload_identity.tf
Step 1: Set Up the Infrastructure with Terraform
Start by setting up the GKE cluster, service account, IAM roles, and Artifact Registry using Terraform.
Download the code and navigate to the tutorial directory:
git clone https://github.com/ai-on-gke/tutorials-and-examples.git
cd tutorials-and-examples/ray-serve/terraform
Set the environment variables, replacing <my-project-id>
and <MY_HF_TOKEN>
:
gcloud config set project <my-project-id>
export PROJECT_ID=$(gcloud config get project)
export REGION=us-central1
export HF_TOKEN=<MY_HF_TOKEN>
export CLUSTER_NAME=llama-ray-cluster
Update the default_env.tfvars
with your own Google Cloud Project ID Name.
Initialize Terraform and apply the configuration:
terraform init
terraform apply --var-file=./default_env.tfvars
Review the plan and type yes to confirm. This will create:
- A GKE Autopilot cluster named llama-ray-cluster.
- A service account adk-ray-agent-sa.
- An IAM role binding granting the service account roles/artifactregistry.reader.
- An Artifact Registry repository llama-ray.
Configure kubectl
to communicate with the cluster:
gcloud container clusters get-credentials $CLUSTER_NAME --region=$REGION --project $PROJECT_ID
Create a Kubernetes secret for the Hugging Face token:
kubectl create secret generic hf-secret \
--from-literal=hf_api_token=$HF_TOKEN \
--dry-run=client -o yaml | kubectl apply -f -
Step 2: Containerize and Deploy the Ray Serve Application
Build and push the container image for the Ray Serve application to Artifact Registry.
Navigate to the ray-serve-vllm directory:
cd ../ray-serve-vllm
Build and push the container image:
gcloud builds submit . \
--project=$PROJECT_ID \
--region=$REGION
Deploy the Ray Serve application:
-
Open the
ray-service.yaml
file in theray-serve-vllm
directory. -
Replace
<PROJECT_ID>
with your Google Cloud project name. -
Install kuberay-operator via Helm:
# Add the Helm repo helm repo add kuberay https://ray-project.github.io/kuberay-helm/ helm repo update # Confirm the repo exists helm search repo kuberay --devel # Install both CRDs and KubeRay operator v1.3.2. helm install kuberay-operator kuberay/kuberay-operator --version 1.3.2 # Check the KubeRay operator Pod in `default` namespace kubectl get pods # NAME READY STATUS RESTARTS AGE # kuberay-operator-6fcbb94f64-mbfnr 1/1 Running 0 17s
-
Apply the manifest:
kubectl apply -f ray-service.yaml
Verify the deployment:
-
Check the Ray service status:
kubectl get rayservice llama-31-8b -o yaml
The serviceStatus
should be running
.
-
Check the raycluster:
kubectl get raycluster
And you should see output similar to this:
NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE llama-31-8b-raycluster-qgzmk 1 1 10 33Gi 2 ready 58m
-
Check the pods:
kubectl get pods
You should see two pods: one Ray Head Pod and one Ray Worker Pod where Worker pods run the LLM and allow for scaling as the requests grow in volume and Head pod is used for management and should stay at a single replica.
NAME READY STATUS RESTARTS AGE kuberay-operator-bb8d4d9c4-7h2vg 1/1 Running 0 9m33s llama-31-8b-raycluster-w9jzw-gpu-group-worker-9l5zx 1/1 Running 0 9m7s llama-31-8b-raycluster-w9jzw-head-45rx4 1/1 Running 0 9m8s
-
Check the config map for the chat template:
kubectl get configmaps
Which should display output similar to:
NAME DATA AGE kube-root-ca.crt 1 4h47m llama-chat-templates 1 3h18m
Step 3: Test the Ray Serve Deployment
Test the model by setting up port forwarding and sending requests.
Set up port forwarding to the Ray Serve endpoint:
kubectl port-forward service/llama-31-8b-serve-svc 8000:8000
Test: Tool Usage Capability
Send a weather query that requires tool usage:
curl http://127.0.0.1:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{
"role": "user",
"content": "What is the weather in New York today?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The city name"
}
},
"required": ["city"]
}
}
}
]
}'
The response should include a tool call to get_weather
with the parameter city: "New York"
, and the "finish_reason"
should be "tool_calls"
. And the response should contain something like this:
{"id":"chatcmpl-5de32dca-853e-44c2-9f92-388a0ab954c7","object":"chat.completion","created":1747236975,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":null,"tool_calls":[{"id":"chatcmpl-tool-8dd7be12aefe48db82c95485622202cb","type":"function","function":{"name":"get_weather","arguments":"{\"city\": \"New York\"}"}}]},"logprobs":null,"finish_reason":"tool_calls","stop_reason":128008}],"usage":{"prompt_tokens":254,"total_tokens":273,"completion_tokens":19,"prompt_tokens_details":null},"prompt_logprobs":null}
Step 4: Deploy the ADK Agent
Navigate to the ADK agent directory:
cd ../adk_agent
Build and push the ADK agent container image:
gcloud builds submit . \
--project=${PROJECT_ID} \
--region=$REGION
Update the ./example-agent/agent-deployment.yaml
file
kubectl apply -f ./example-agent/agent-deployment.yaml
Verify the deployment:
-
Check the pods:
kubectl get pods
You should see three pods: the two Ray pods and the ADK agent pod.
NAME READY STATUS RESTARTS AGE adk-agent-6ddffd9b5b-m2txm 1/1 Running 0 2m56s kuberay-operator-bb8d4d9c4-7h2vg 1/1 Running 0 14m llama-31-8b-raycluster-w9jzw-gpu-group-worker-9l5zx 1/1 Running 0 14m llama-31-8b-raycluster-w9jzw-head-45rx4 1/1 Running 0 14m
-
Check the services:
kubectl get services
You should see six services, including the ADK agent LoadBalancer service.
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE adk-agent LoadBalancer 34.118.231.166 34.72.141.248 80:30319/TCP 2m4s kuberay-operator ClusterIP 34.118.227.98 <none> 8080/TCP 14m kubernetes ClusterIP 34.118.224.1 <none> 443/TCP 4h53m llama-31-8b-head-svc ClusterIP None <none> 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 95s llama-31-8b-raycluster-w9jzw-head-svc ClusterIP None <none> 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 13m llama-31-8b-serve-svc ClusterIP 34.118.226.99 <none> 8000/TCP 95s
Get the external IP of the ADK agent LoadBalancer:
kubectl get service adk-agent -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
Step 5: Test the ADK Agent
Send requests directly to the ADK agent LoadBalancer external IP.
Test: Example ADK Agent
The ADK agent framework uses sessions to manage state and context. The /run endpoint expects a session_id in the request payload.
-
Create a session for test user:
curl http://$EXTERNAL_IP:80/apps/example-agent/users/user1/sessions \ -X POST \ -H "Content-Type: application/json" \ -d '{}'
The response is expected to be something like this:
{"id":"836c4e74-5ee3-4bc8-96c2-94d77f079e34","app_name":"example-agent","user_id":"user1","state":{},"events":[],"last_update_time":1747352210.0}
-
Copy the session id and send the example request:
curl http://$EXTERNAL_IP:80/run \ -X POST \ -H "Content-Type: application/json" \ -d '{ "app_name": "example-agent", "user_id": "user1", "session_id": "836c4e74-5ee3-4bc8-96c2-94d77f079e34", "new_message": { "parts": [{"text": "What is the weather like in Seattle?"}], "role": "user" } }'
The response should be similar to the Ray Serve test. Example:
[{"content":{"parts":[{"functionCall":{"id":"chatcmpl-tool-63bff6965a04437eb9e16aa8e8e4786b","args":{"city":"Seattle"},"name":"get_current_weather"}}],"role":"model"},"partial":false,"invocation_id":"e-18006854-6c17-4d27-8a8f-342bd252106a","author":"weather_agent","actions":{"state_delta":{},"artifact_delta":{},"requested_auth_configs":{}},"long_running_tool_ids":[],"id":"YOqORD5M","timestamp":1747352916.229682},{"content":{"parts":[{"functionResponse":{"id":"chatcmpl-tool-63bff6965a04437eb9e16aa8e8e4786b","name":"get_current_weather","response":{"result":"The weather in Seattle is currently 12°C with rainy conditions."}}}],"role":"user"},"invocation_id":"e-18006854-6c17-4d27-8a8f-342bd252106a","author":"weather_agent","actions":{"state_delta":{},"artifact_delta":{},"requested_auth_configs":{}},"id":"x9zqpTMd","timestamp":1747352917.060611},{"content":{"parts":[{"text":"The function call returns the current weather in Seattle."}],"role":"model"},"partial":false,"invocation_id":"e-18006854-6c17-4d27-8a8f-342bd252106a","author":"weather_agent","actions":{"state_delta":{},"artifact_delta":{},"requested_auth_configs":{}},"id":"CbXReduJ","timestamp":1747352917.067558}]
Here is how the request path looks like:
Step 6: Monitor and Debug
Use the Ray Dashboard for observability:
kubectl port-forward <raycluster-head-pod> 8265:8265
Access the dashboard at http://127.0.0.1:8265
to view logs, metrics and debug Ray applications. This will allow you to see a visual representation on how your Ray cluster is performing.
GPU Resource Utilization
The deployment uses one VLLM instance (num_replicas: 1
) with tensor parallelism (TENSOR_PARALLELISM: 2
), requiring 2 GPUs. The GKE cluster provides 2 GPUs via one g2-standard-24
node in the node pool (workerGroupSpecs.replicas: 1
). This ensures efficient resource utilization for the Ray Worker Pod.
Clean Up
To avoid incurring charges, delete the GKE cluster and associated resources:
cd ../terraform
terraform destroy --var-file=./default_env.tfvars
Verify the destruction plan and type “yes” to confirm.
Feedback
Was this page helpful?
Thank you for your feedback.
Thank you for your feedback.