Kserve
KServe is a highly scalable, standards-based platform for model inference on Kubernetes. Installing KServe on GKE Autopilot can be challenging due to the security policies enforced by Autopilot. This tutorial will guide you step by step through the process of installing KServe in a GKE Autopilot cluster.
Additionally, this tutorial includes an example of serving Gemma2 with vLLM in KServe, demonstrating how to utilize GPU resources in KServe on Google Kubernetes Engine (GKE).
Before you begin
-
Ensure you have a gcp project with billing enabled and enabled the GKE API.
-
Ensure you have the following tools installed on your workstation
Set up your GKE Cluster
-
Set the default environment variables:
export PROJECT_ID=$(gcloud config get project) export REGION=us-central1 export CLUSTER_NAME=kserve-demo
-
Create a GKE Autopilot cluster:
gcloud container clusters create-auto ${CLUSTER_NAME} \ --location=$REGION \ --project=$PROJECT_ID \ --workload-policies=allow-net-admin \ --labels=created-by=ai-on-gke,guide=kserve # Get credentials gcloud container clusters get-credentials ${CLUSTER_NAME} \ --region ${REGION} \ --project ${PROJECT_ID}
If you’re using an existing cluster, ensure it is updated to allow net admin permissions. This is necessary for the installation of Istio later on:
gcloud container clusters update ${CLUSTER_NAME} \ --region=${REGION} --project=$PROJECT_ID \ --workload-policies=allow-net-admin
Install KServe
KServe relies on Knative and requires a networking layer. In this tutorial, we will use Istio, the networking layer that integrates best with Knative.
You will see warnings that Autopilot mutated the CRDs during this tutorial. These warnings are safe to ignore.
-
Install Knative
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-crds.yaml kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-core.yaml
-
Install Istio
helm repo add istio https://istio-release.storage.googleapis.com/charts helm repo update kubectl create namespace istio-system helm install istio-base istio/base -n istio-system --set defaultRevision=default helm install istiod istio/istiod -n istio-system --wait helm install istio-ingressgateway istio/gateway -n istio-system
-
Verify the installation
kubectl get deployments -n istio-system
You should see something similar to the following output:
NAME READY UP-TO-DATE AVAILABLE AGE istio-ingressgateway 1/1 1 1 17h istiod 1/1 1 1 20h
-
Install Knative-Istio
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.15.1/net-istio.yaml
-
Verify the installation
kubectl get pods -n knative-serving
You should see something similar to the following output:
NAME READY STATUS RESTARTS AGE activator-749cf94f87-b7p9n 1/1 Running 0 17m autoscaler-5c764b5f7d-m8zvk 1/1 Running 1 (14m ago) 17m controller-5649f5bbb7-wvlmk 1/1 Running 4 (13m ago) 17m net-istio-controller-7f8dfbddb7-d8cmq 1/1 Running 0 18s net-istio-webhook-54ffc96585-cpgfl 2/2 Running 0 18s webhook-64c67b4fc-smdtl 1/1 Running 3 (13m ago) 17m
-
Install DNS In this tutorial we use Magic DNS. To configure a real DNS, follow the steps here.
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-default-domain.yaml
-
Install Cert Manager, which is required to provision webhook certs for production grade installation.
helm repo add jetstack https://charts.jetstack.io --force-update helm install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace --version v1.15.3 --set crds.enabled=true --set global.leaderElection.namespace=cert-manager
-
Install Kserve and Kserve cluster runtimes
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0-rc0/kserve.yaml # Wait until kserve-controller-manager is ready kubectl rollout status deployment kserve-controller-manager -n kserve # Install cluster runtimes kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0-rc0/kserve-cluster-resources.yaml # View these runtimes kubectl get ClusterServingRuntimes -n kserve
-
To request accelerators (GPUs) for your Google Kubernetes Engine (GKE) Autopilot workloads, nodeSelector is used in the manifest. Therefore, we will enable nodeSelector in Knative, which is disabled by default.
kubectl patch configmap/config-features \ --namespace knative-serving \ --type merge \ --patch '{"data":{"kubernetes.podspec-nodeselector":"enabled", "kubernetes.podspec-tolerations":"enabled"}}'
Restart knative webhook to consume the config, for example:
kubectl get pods -n knative-serving # Find the webhook pod and delete it to restart the pod. kubectl delete pod webhook-64c67b4fc-nmzwt -n knative-serving
After successfully installing KServe, you can now explore various examples such as, first inference service, canary rollout, inference batcher and auto-scaling. In the next step, we’ll demonstrate how to deploy Gemma2 using vLLM in KServe with GKE Autopilot.
Deploy Gemma2 served with vllm.
-
Generate a hugging face access token follow these steps. Specify a Name of your choice and a Role of at least Read.
-
Make sure you accepted the term to use gemma2 in hugging face.
-
Create the hugging face token
kubectl create namespace kserve-test # Specify your hugging face token. export HF_TOKEN = XXX kubectl apply -f - <<EOF apiVersion: v1 kind: Secret metadata: name: hf-secret namespace: kserve-test type: Opaque stringData: hf_api_token: ${HF_TOKEN} EOF
-
Create the inference service
kubectl apply -f - <<EOF apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: huggingface-gemma2 namespace: kserve-test spec: predictor: nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4 cloud.google.com/gke-accelerator-count: "1" model: modelFormat: name: huggingface args: - --enable_docs_url=True - --model_name=gemma2 - --model_id=google/gemma-2-2b env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-secret key: hf_api_token resources: limits: cpu: "6" memory: 24Gi nvidia.com/gpu: "1" requests: cpu: "6" memory: 24Gi nvidia.com/gpu: "1" EOF
Wait for the service to be ready:
kubectl get inferenceservice huggingface-gemma2 -n kserve-test kubectl get pods -n kserve-test # Replace pod_name with the correct pod name. kubectl events --for pod/POD_NAME -n kserve-test --watch
Test the Inference Service
-
Find the URL returned in kubectl get inferenceservice
URL=$(kubectl get inferenceservice huggingface-gemma2 -n kserve-test -o jsonpath='{.status.url}') echo $URL
URL should look like this:
http://huggingface-gemma2.kserve-test.34.121.87.225.sslip.io
-
Open the swagger UI at $URL/docs
-
Play with the openai chat API with the example input below. Click execute and you can see the response.
{ "model": "gemma2", "messages": [ { "role": "system", "content": "You are an assistant that speaks like Shakespeare." }, { "role": "user", "content": "Write a poem about colors" } ], "max_tokens": 30, "stream": false }
Clean up
Delete the GKE cluster.
gcloud container clusters delete ${CLUSTER_NAME} \
--location=$REGION \
--project=$PROJECT_ID \
Feedback
Was this page helpful?
Thank you for your feedback.
Thank you for your feedback.