Inference servers

Deploying and managing servers dedicated to performing inference tasks for machine learning models.

Creating Inference Checkpoints

Overviews how to convert your inference checkpoint for various model servers

Hugging Face TGI

This guide demonstrates how to deploy a Hugging Face Text Generation Inference (TGI) server on Google Kubernetes Engine (GKE) using NVIDIA L4 GPUs, enabling you to serve large language models like Mistral-7b-instruct. It walks you through creating a GKE cluster, deploying the TGI application, sending prompts to the model, and monitoring the service’s performance using metrics, while also providing instructions for cleaning up the cluster.

Kserve

This tutorial will guide you step by step through the process of installing KServe in a GKE Autopilot cluster.

Continue reading:

ADK VertexAI Example

This tutorial guides you through deploying a containerized agent built with the [Google Agent Development Kit (ADK)](https://google.github.io/adk-docs/) to [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview). The agent uses [VertexAI](https://cloud.google.com/vertex-ai/docs) to access LLMs. GKE provides a managed environment for deploying, managing, and scaling your containerized applications using Google infrastructure.

vLLM GPU/TPU Fungibility

This tutorial shows you who to serve a large language model (LLM) using both Tensor Processing Units (TPUs) and GPUs on Google Kubernetes Engine (GKE) using the same deployment with [vLLM](https://github.com/vllm-project/vllm)

Checkpoints

Overviews how to convert your inference checkpoint for various model servers

Security

Identity Aware Proxy

Overviews how to secure application endpoints with Identity Aware Proxy (IAP)

Model Armor

Overviews how to set up Inference Gateway with Model Armor to secure interaction with LLM models