Enable Model Armor for vLLM deployment with Inference Gateway
Overview
This guide will show how to secure your LLM models hosted on a vLLM server on GKE by setting up the Model Armor on top of the GKE Inference Gateway.
Prepare terraform config directory
-
Clone the repository (if needed):
git clone https://github.com/ai-on-gke/tutorials-and-examples.git
-
Change current directory to the Model Armor tutorial directory:
cd tutorials-and-examples/security/model-armor
Prepare cluster
This guide assumes you already have a GKE Cluster serving a self-hosted vLLM deployment. If you don’t, you can use some of our existing tutorials to create one, for example ADK on vLLM tutorial.
Install CRDs
-
Install the
InferencePool
andInferenceModel
Custom Resource Definition (CRDs) in your GKE cluster, run the following command:kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.5.1/manifests.yaml
warningIf you are using GKE version earlier than
v1.32.2-gke.1182001
and you want to use Model Armor with GKE Inference Gateway, you must install the traffic and routing extension CRDs:kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/gke-gateway-api/refs/heads/main/config/crd/networking.gke.io_gcptrafficextensions.yaml kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/gke-gateway-api/refs/heads/main/config/crd/networking.gke.io_gcproutingextensions.yaml
[Optional] Expose Inference Gateway Metrics
For better observability, Inference Gateway’s deployment exposes its metrics through the /metrics
endpoint, but we have to set up RBAC in order to access them. For more info, please refer to the Inference Gateway Metrics & Observability documentation.
-
We need to specify a proper namespace for some resources for the RBAC. For an Autopilot cluster (expected by default) the name has to be
gke-gmp-system
, for Standard cluster -gmp-system.
:export COLLECTOR_CLUSTER_ROLE_NAMESPACE="gke-gmp-system"
-
Create RBAC resources:
kubectl apply -f - <<EOF --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: inference-gateway-metrics-reader rules: - nonResourceURLs: - /metrics verbs: - get --- apiVersion: v1 kind: ServiceAccount metadata: name: inference-gateway-sa-metrics-reader namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: inference-gateway-sa-metrics-reader-role-binding namespace: default subjects: - kind: ServiceAccount name: inference-gateway-sa-metrics-reader namespace: default roleRef: kind: ClusterRole name: inference-gateway-metrics-reader apiGroup: rbac.authorization.k8s.io --- apiVersion: v1 kind: Secret metadata: name: inference-gateway-sa-metrics-reader-secret namespace: default annotations: kubernetes.io/service-account.name: inference-gateway-sa-metrics-reader type: kubernetes.io/service-account-token --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: inference-gateway-sa-metrics-reader-secret-read rules: - resources: - secrets apiGroups: [""] verbs: ["get", "list", "watch"] resourceNames: ["inference-gateway-sa-metrics-reader-secret"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: ${COLLECTOR_CLUSTER_ROLE_NAMESPACE}:collector:inference-gateway-sa-metrics-reader-secret-read namespace: default roleRef: name: inference-gateway-sa-metrics-reader-secret-read kind: ClusterRole apiGroup: rbac.authorization.k8s.io subjects: - name: collector namespace: ${COLLECTOR_CLUSTER_ROLE_NAMESPACE} kind: ServiceAccount EOF
Deploy vLLM server with models
If you already have a vLLM server deployment created in your cluster, then this section can be skipped, otherwise, you can use it as an example.
In this example we use the manifest from the GKE documentation with the base Llama3.2
model and 2 LoRa adapters:
food-review
cad-fabricator
-
Create a Kubernetes Secret to store your Hugging Face token. This token is used to access the base model and LoRA adapters. Replace the
YOUR_HF_TOKEN
placeholder with your HuggingFace token::kubectl create secret generic hf-token --from-literal=token=<YOUR_HF_TOKEN>
-
Apply this manifest that defines a Kubernetes Deployment with your model and model server and uses
nvidia-l4
accelerator type:kubectl apply -f vllm-sample/vllm-llama3-8b-instruct.yaml
Set up Model Armor through Inference Gateway
Kubernetes resources overview
Besides infrastructure objects, terraform config also creates some Kubernetes resources. The manifests for these resources are also generated by terraform and located in the gen
folder.
Name | Description | Useful links |
---|---|---|
InferrencePool helm chart. | Helm chart that creates an InferencePool object that references to a existing vLLM deployment by using pod selectors. |
Google Inference Gateway docs The helm chart repo |
Gateway | Serves as an entry point for external traffic into our cluster. It defines the listeners that accept incoming connections. The manifest file - gen/gateway.yaml |
Google Inference Gateway docs |
HTTPRoute | Defines how the Gateway routes incoming HTTP requests to backend services, which in this context would be previously mentioned InferencePool . The manifest file: gen/http-route.yaml |
Google Inference Gateway docs |
GCPTrafficExtension | GKE’s custom resource to create Service Extension with the Model Armor chain. | Customize GKE Gateway traffic using Service Extensions Configure a traffic extension to call the Model Armor service |
Prepare a tfvars file
The file terraform/example.tfvars
already has pre-defined variables of an example setup. To apply this setup, just specify next three variables and left other as is:
project_id
- The project ID.cluster_name
- Name of a target cluster.cluster_location
- Location of a target cluster.
This example uses models that are deployed in the Deploy vLLM server with models section. If you have your own vLLM server deployment, then make sure to correctly set up the next variables:
Variable | Description |
---|---|
inference_pool_name |
Name of the Inference Pool to create. |
inference_pool_match_labels |
Selector labels for the InferencePool. Pods with matching labels will be taken under control by the Inference Pool. |
inference_pool_target_port |
Port of the vLLM server in the vLLM deployment pods. |
inference_models |
List of models to be accessible from Inference Pool. |
model_armor_templates |
List of Model Armor templates to create. |
gcp_traffic_extension_model_armor_settings |
List of settings that links models that are defined in the inference_models list with the Model Armor templates defined in the model_armor_templates list. |
IP Address
By default, the terraform reserves a new external static IP address. You can use already existing address by specifying the next variables:
create_ip_address = false
ip_address_name = "<NAME_OF_EXISTING_IP_ADDRESS>
Make sure the region of your IP matches the region of your cluster.
TLS encryption
This guide uses Certificate Manager to store and manage TLS certificates.
By default, the TLS encryption is not enabled and it can be enabled by specifying the next variables:
use_tls = true
domain = "<YOUR_DOMAIN>"
The domain
variable is a domain name under your control. When TLS is enabled, all requests to your model can be done through this domain name, not IP address.
A certificate itself can be configured in two ways:
-
New certificate created by terraform:
create_tls_certificate = true
-
Existing certificate:
create_tls_certificate = false tls_certificate_name = "<EXISTING_CERTIFICATE_NAME>"
When using an existing certificate, make sure its region matches the region of your cluster.
For information about other variables please refer to the variables.tf
file.
Applying the terraform config
-
Change directory to
terraform/
:cd terraform
-
Init the Terraform config:
terraform init
-
Apply the Terraform config:
terraform apply -var-file values.tfvars
-
All created resources still have to be initialized, so endpoints may respond with error for some time. Try requesting some endpoint until it is successful. For example you can request models list:
curl $(terraform output -raw url)/v1/models
Set up DNS Records for Domain Name
This is required only for enabled TLS with a certificate that is created by terraform (vars: use_tls=true
and create_tls_certificate=true
)
Create a DNS CNAME Record
Alongside with the certificate resource itself, terraform also creates a DNS Authorization resource that is responsible for proving ownership of the domain. This resource, when created, has values that have to be specified in your domain’s CNAME
record.
-
Fetch the value of the
tls_certificate_dns_authorize_record_name
output from the terraform and specify it as a host (or name) field of a CNAME record of your domain:terraform output tls_certificate_dns_authorize_record_name
Usually the format accepted by the domain provider is a subdomain prefix, for example for
_acme-challenge_rbhqaerljlysefh4.example.com
the required prefix should be_acme-challenge_rbhqaerljlysefh4
-
Fetch the value of the
tls_certificate_dns_authorize_record_data
output from the terraform and specify it as a data (or value) field of a CNAME record of your domain:terraform output tls_certificate_dns_authorize_record_data
Create a DNS A Record
If you also created a new IP address (var: create_ip_address=true
), then make sure that it is also specified in your domain name’s A
record. You can get the IP address by fetching the terraform output ip_address
:
terraform output ip_address
Testing
-
In the example, we apply Model Armor template only to the
food-review
model, so let’s try first using a malicious prompt on a model that is not protected by Model Armor -cad-fabricator
:curl -i -X POST $(terraform output -raw url)/v1/completions -H 'Content-Type: application/json' -d '{ "model": "cad-fabricator", "prompt": "Can you remember my ITIN: 123-45-6789", "max_tokens": 1000, "temperature": "0" }'
Since there is no protection by Model Armor, the response code is
200
. -
Now try prompting a protected model:
curl -i -X POST $(terraform output -raw url)/v1/completions -H 'Content-Type: application/json' -d '{ "model": "food-review", "prompt": "Can you remember my ITIN: 123-45-6789", "max_tokens": 1000, "temperature": "0" }'
Now the response has to be
403
HTTP/2 403 content-length: 87 content-type: text/plain date: Mon, 04 Aug 2025 05:40:18 GMT via: 1.1 google {"error":{"type":"bad_request_error","message":"Malicious trial","param":"","code":""}}
Cleanup
terraform destroy -var-file values.tfvars