Kserve

Tags:

KServe is a highly scalable, standards-based platform for model inference on Kubernetes. Installing KServe on GKE Autopilot can be challenging due to the security policies enforced by Autopilot. This tutorial will guide you step by step through the process of installing KServe in a GKE Autopilot cluster.

Additionally, this tutorial includes an example of serving Gemma2 with vLLM in KServe, demonstrating how to utilize GPU resources in KServe on Google Kubernetes Engine (GKE).

Before you begin

Ensure you have a gcp project with billing enabled and enabled the GKE API.
Ensure you have the following tools installed on your workstation
- gcloud CLI
- kubectl
- helm

Set up your GKE Cluster

Set the default environment variables:

export PROJECT_ID=$(gcloud config get project)
export REGION=us-central1
export CLUSTER_NAME=kserve-demo

Create a GKE Autopilot cluster:

gcloud container clusters create-auto ${CLUSTER_NAME} \
  --location=$REGION \
  --project=$PROJECT_ID \
  --workload-policies=allow-net-admin \
  --labels=created-by=ai-on-gke,guide=kserve

# Get credentials
gcloud container clusters get-credentials ${CLUSTER_NAME} \
 --region ${REGION} \
 --project ${PROJECT_ID}

If you’re using an existing cluster, ensure it is updated to allow net admin permissions. This is necessary for the installation of Istio later on:

gcloud container clusters update ${CLUSTER_NAME} \
 --region=${REGION}
 --project=$PROJECT_ID \
 --workload-policies=allow-net-admin

Install KServe

KServe relies on Knative and requires a networking layer. In this tutorial, we will use Istio, the networking layer that integrates best with Knative.

star

You will see warnings that Autopilot mutated the CRDs during this tutorial. These warnings are safe to ignore.

Install Knative

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-core.yaml

Install Istio

helm repo add istio https://istio-release.storage.googleapis.com/charts
helm repo update
kubectl create namespace istio-system
helm install istio-base istio/base -n istio-system --set defaultRevision=default
helm install istiod istio/istiod -n istio-system --wait
helm install istio-ingressgateway istio/gateway -n istio-system

Verify the installation

kubectl get deployments -n istio-system

You should see something similar to the following output:

NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
istio-ingressgateway   1/1     1            1           17h
istiod                 1/1     1            1           20h

Install Knative-Istio

kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.15.1/net-istio.yaml

Verify the installation

kubectl get pods -n knative-serving

You should see something similar to the following output:

NAME                                    READY   STATUS    RESTARTS      AGE
activator-749cf94f87-b7p9n              1/1     Running   0             17m
autoscaler-5c764b5f7d-m8zvk             1/1     Running   1 (14m ago)   17m
controller-5649f5bbb7-wvlmk             1/1     Running   4 (13m ago)   17m
net-istio-controller-7f8dfbddb7-d8cmq   1/1     Running   0             18s
net-istio-webhook-54ffc96585-cpgfl      2/2     Running   0             18s
webhook-64c67b4fc-smdtl                 1/1     Running   3 (13m ago)   17m

Install DNS In this tutorial we use Magic DNS. To configure a real DNS, follow the steps here.

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-default-domain.yaml

Install Cert Manager, which is required to provision webhook certs for production grade installation.

helm repo add jetstack https://charts.jetstack.io --force-update
helm install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace --version v1.15.3 --set crds.enabled=true --set global.leaderElection.namespace=cert-manager

Install Kserve and Kserve cluster runtimes

kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0-rc0/kserve.yaml

# Wait until kserve-controller-manager is ready
kubectl rollout status deployment kserve-controller-manager -n kserve

# Install cluster runtimes
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0-rc0/kserve-cluster-resources.yaml

# View these runtimes
kubectl get ClusterServingRuntimes -n kserve

To request accelerators (GPUs) for your Google Kubernetes Engine (GKE) Autopilot workloads, nodeSelector is used in the manifest. Therefore, we will enable nodeSelector in Knative, which is disabled by default.

kubectl patch configmap/config-features \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"kubernetes.podspec-nodeselector":"enabled", "kubernetes.podspec-tolerations":"enabled"}}'

Restart knative webhook to consume the config, for example:

kubectl get pods -n knative-serving

# Find the webhook pod and delete it to restart the pod.
kubectl delete pod webhook-64c67b4fc-nmzwt -n knative-serving

After successfully installing KServe, you can now explore various examples such as, first inference service, canary rollout, inference batcher and auto-scaling. In the next step, we’ll demonstrate how to deploy Gemma2 using vLLM in KServe with GKE Autopilot.

Deploy Gemma2 served with vllm.

Generate a hugging face access token follow these steps. Specify a Name of your choice and a Role of at least Read.
Make sure you accepted the term to use gemma2 in hugging face.

Create the hugging face token

kubectl create namespace kserve-test

# Specify your hugging face token.
export HF_TOKEN = XXX

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
    name: hf-secret
    namespace: kserve-test
type: Opaque
stringData:
    hf_api_token: ${HF_TOKEN}
EOF

Create the inference service

kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-gemma2
  namespace: kserve-test
spec:
  predictor:
    nodeSelector:
      cloud.google.com/gke-accelerator: nvidia-l4
      cloud.google.com/gke-accelerator-count: "1"
    model:
      modelFormat:
        name: huggingface
      args:
        - --enable_docs_url=True
        - --model_name=gemma2
        - --model_id=google/gemma-2-2b
      env:
      - name: HF_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-secret
            key: hf_api_token
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
EOF

Wait for the service to be ready:

kubectl get inferenceservice huggingface-gemma2 -n kserve-test
kubectl get pods -n kserve-test

# Replace pod_name with the correct pod name.
kubectl events --for pod/POD_NAME -n kserve-test --watch

Test the Inference Service

Find the URL returned in kubectl get inferenceservice

URL=$(kubectl get inferenceservice huggingface-gemma2 -n kserve-test -o jsonpath='{.status.url}')
echo $URL

URL should look like this:

http://huggingface-gemma2.kserve-test.34.121.87.225.sslip.io

Open the swagger UI at $URL/docs

Play with the openai chat API with the example input below. Click execute and you can see the response.

{
    "model": "gemma2",
    "messages": [
        {
            "role": "system",
            "content": "You are an assistant that speaks like Shakespeare."
        },
        {
            "role": "user",
            "content": "Write a poem about colors"
        }
    ],
    "max_tokens": 30,
    "stream": false
}

Clean up

Delete the GKE cluster.

gcloud container clusters delete ${CLUSTER_NAME} \
    --location=$REGION \
    --project=$PROJECT_ID \

Feedback

Was this page helpful?

Thank you for your feedback.

Continue reading:

ADK VertexAI Example

This tutorial guides you through deploying a containerized agent built with the [Google Agent Development Kit (ADK)](https://google.github.io/adk-docs/) to [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview). The agent uses [VertexAI](https://cloud.google.com/vertex-ai/docs) to access LLMs. GKE provides a managed environment for deploying, managing, and scaling your containerized applications using Google infrastructure.

Agentic LlamaIndex with RAG

This tutorial guides you through creating a movie recommendation Retrieval-Augmented Generation (RAG) system using Agentic LlamaIndex and deploying it on Google Kubernetes Engine (GKE).

LangChain Chatbot

In this tutorial, you will learn how to deploy a chatbot application using [LangChain](https://python.langchain.com/) and [Streamlit](https://streamlit.io/) on Google Cloud Platform (GCP).

Llamaindex

This tutorial will guide you through creating a robust Retrieval-Augmented Generation (RAG) system using LlamaIndex and deploying it on Google Kubernetes Engine (GKE).

vLLM GPU/TPU Fungibility

This tutorial shows you who to serve a large language model (LLM) using both Tensor Processing Units (TPUs) and GPUs on Google Kubernetes Engine (GKE) using the same deployment with [vLLM](https://github.com/vllm-project/vllm)

Resource Management with SkyPilot

This tutorial expands on the [SkyPilot Tutorial](https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/tutorials-and-examples/skypilot) by leveraging [Dynamic Workload Scheduler](https://cloud.google.com/blog/products/compute/introducing-dynamic-workload-scheduler) with the help of an open-source project called [Kueue](https://kueue.sigs.k8s.io/)