Kserve

KServe is a highly scalable, standards-based platform for model inference on Kubernetes. Installing KServe on GKE Autopilot can be challenging due to the security policies enforced by Autopilot. This tutorial will guide you step by step through the process of installing KServe in a GKE Autopilot cluster.

Additionally, this tutorial includes an example of serving Gemma2 with vLLM in KServe, demonstrating how to utilize GPU resources in KServe on Google Kubernetes Engine (GKE).

Before you begin

  1. Ensure you have a gcp project with billing enabled and enabled the GKE API.

  2. Ensure you have the following tools installed on your workstation

Set up your GKE Cluster

  1. Set the default environment variables:

    export PROJECT_ID=$(gcloud config get project)
    export REGION=us-central1
    export CLUSTER_NAME=kserve-demo
    
  2. Create a GKE Autopilot cluster:

    gcloud container clusters create-auto ${CLUSTER_NAME} \
      --location=$REGION \
      --project=$PROJECT_ID \
      --workload-policies=allow-net-admin \
      --labels=created-by=ai-on-gke,guide=kserve
    
    # Get credentials
    gcloud container clusters get-credentials ${CLUSTER_NAME} \
     --region ${REGION} \
     --project ${PROJECT_ID}
    

    If you’re using an existing cluster, ensure it is updated to allow net admin permissions. This is necessary for the installation of Istio later on:

    gcloud container clusters update ${CLUSTER_NAME} \
     --region=${REGION}
     --project=$PROJECT_ID \
     --workload-policies=allow-net-admin 
    

Install KServe

KServe relies on Knative and requires a networking layer. In this tutorial, we will use Istio, the networking layer that integrates best with Knative.

star

You will see warnings that Autopilot mutated the CRDs during this tutorial. These warnings are safe to ignore.

  1. Install Knative

    kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-crds.yaml
    kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-core.yaml
    
  2. Install Istio

    helm repo add istio https://istio-release.storage.googleapis.com/charts
    helm repo update
    kubectl create namespace istio-system
    helm install istio-base istio/base -n istio-system --set defaultRevision=default
    helm install istiod istio/istiod -n istio-system --wait
    helm install istio-ingressgateway istio/gateway -n istio-system
    
  3. Verify the installation

    kubectl get deployments -n istio-system
    

    You should see something similar to the following output:

    NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
    istio-ingressgateway   1/1     1            1           17h
    istiod                 1/1     1            1           20h
    
  4. Install Knative-Istio

    kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.15.1/net-istio.yaml
    
  5. Verify the installation

    kubectl get pods -n knative-serving
    

    You should see something similar to the following output:

    NAME                                    READY   STATUS    RESTARTS      AGE
    activator-749cf94f87-b7p9n              1/1     Running   0             17m
    autoscaler-5c764b5f7d-m8zvk             1/1     Running   1 (14m ago)   17m
    controller-5649f5bbb7-wvlmk             1/1     Running   4 (13m ago)   17m
    net-istio-controller-7f8dfbddb7-d8cmq   1/1     Running   0             18s
    net-istio-webhook-54ffc96585-cpgfl      2/2     Running   0             18s
    webhook-64c67b4fc-smdtl                 1/1     Running   3 (13m ago)   17m
    
  6. Install DNS In this tutorial we use Magic DNS. To configure a real DNS, follow the steps here.

    kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-default-domain.yaml
    
  7. Install Cert Manager, which is required to provision webhook certs for production grade installation.

    helm repo add jetstack https://charts.jetstack.io --force-update
    helm install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace --version v1.15.3 --set crds.enabled=true --set global.leaderElection.namespace=cert-manager
    
  8. Install Kserve and Kserve cluster runtimes

    kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0-rc0/kserve.yaml
    
    # Wait until kserve-controller-manager is ready
    kubectl rollout status deployment kserve-controller-manager -n kserve
    
    # Install cluster runtimes
    kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0-rc0/kserve-cluster-resources.yaml
    
    # View these runtimes
    kubectl get ClusterServingRuntimes -n kserve
    
  9. To request accelerators (GPUs) for your Google Kubernetes Engine (GKE) Autopilot workloads, nodeSelector is used in the manifest. Therefore, we will enable nodeSelector in Knative, which is disabled by default.

    kubectl patch configmap/config-features \
      --namespace knative-serving \
      --type merge \
      --patch '{"data":{"kubernetes.podspec-nodeselector":"enabled", "kubernetes.podspec-tolerations":"enabled"}}'
    

    Restart knative webhook to consume the config, for example:

    kubectl get pods -n knative-serving
    
    # Find the webhook pod and delete it to restart the pod.
    kubectl delete pod webhook-64c67b4fc-nmzwt -n knative-serving
    

After successfully installing KServe, you can now explore various examples such as, first inference service, canary rollout, inference batcher and auto-scaling. In the next step, we’ll demonstrate how to deploy Gemma2 using vLLM in KServe with GKE Autopilot.

Deploy Gemma2 served with vllm.

  1. Generate a hugging face access token follow these steps. Specify a Name of your choice and a Role of at least Read.

  2. Make sure you accepted the term to use gemma2 in hugging face.

  3. Create the hugging face token

    kubectl create namespace kserve-test
    
    # Specify your hugging face token.
    export HF_TOKEN = XXX
    
    kubectl apply -f - <<EOF
    apiVersion: v1
    kind: Secret
    metadata:
        name: hf-secret
        namespace: kserve-test
    type: Opaque
    stringData:
        hf_api_token: ${HF_TOKEN}
    EOF
    
  4. Create the inference service

    kubectl apply -f - <<EOF
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: huggingface-gemma2
      namespace: kserve-test
    spec:
      predictor:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-l4
          cloud.google.com/gke-accelerator-count: "1"
        model:
          modelFormat:
            name: huggingface
          args:
            - --enable_docs_url=True
            - --model_name=gemma2
            - --model_id=google/gemma-2-2b
          env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
          resources:
            limits:
              cpu: "6"
              memory: 24Gi
              nvidia.com/gpu: "1"
            requests:
              cpu: "6"
              memory: 24Gi
              nvidia.com/gpu: "1"
    EOF
    

    Wait for the service to be ready:

    kubectl get inferenceservice huggingface-gemma2 -n kserve-test
    kubectl get pods -n kserve-test
    
    # Replace pod_name with the correct pod name.
    kubectl events --for pod/POD_NAME -n kserve-test --watch
    

Test the Inference Service

  1. Find the URL returned in kubectl get inferenceservice

    URL=$(kubectl get inferenceservice huggingface-gemma2 -n kserve-test -o jsonpath='{.status.url}')
    echo $URL
    

    URL should look like this:

    http://huggingface-gemma2.kserve-test.34.121.87.225.sslip.io
    
  2. Open the swagger UI at $URL/docs

  3. Play with the openai chat API with the example input below. Click execute and you can see the response.

    {
        "model": "gemma2",
        "messages": [
            {
                "role": "system",
                "content": "You are an assistant that speaks like Shakespeare."
            },
            {
                "role": "user",
                "content": "Write a poem about colors"
            }
        ],
        "max_tokens": 30,
        "stream": false
    }
    

Clean up

Delete the GKE cluster.

gcloud container clusters delete ${CLUSTER_NAME} \
    --location=$REGION \
    --project=$PROJECT_ID \

Continue reading: