Inference servers


Creating Inference Checkpoints

Overviews how to convert your inference checkpoint for various model servers

Hugging Face TGI

This guide demonstrates how to deploy a Hugging Face Text Generation Inference (TGI) server on Google Kubernetes Engine (GKE) using NVIDIA L4 GPUs, enabling you to serve large language models like Mistral-7b-instruct. It walks you through creating a GKE cluster, deploying the TGI application, sending prompts to the model, and monitoring the service’s performance using metrics, while also providing instructions for cleaning up the cluster.

Serving LLM with TPUs using Jetstream

This tutorial shows you how to serve a large language model (LLM) using Tensor Processing Units (TPUs) on Google Kubernetes Engine (GKE) with JetStream and MaxText.

Kserve

This tutorial will guide you step by step through the process of installing KServe in a GKE Autopilot cluster.