Creating Inference Checkpoints
Overviews how to convert your inference checkpoint for various model servers
Overviews how to convert your inference checkpoint for various model servers
Deploying and managing servers dedicated to performing inference tasks for machine learning models.
This tutorial shows you how to serve a large language model (LLM) using Tensor Processing Units (TPUs) on Google Kubernetes Engine (GKE) with JetStream and MaxText.
This tutorial shows you who to serve a large language model (LLM) using both Tensor Processing Units (TPUs) and GPUs on Google Kubernetes Engine (GKE) using the same deployment with vLLM