Was this page helpful?
Thank you for your feedback.
This guide demonstrates how to deploy a Hugging Face Text Generation Inference (TGI) server on Google Kubernetes Engine (GKE) using NVIDIA L4 GPUs, enabling you to serve large language models like Mistral-7b-instruct. It walks you through creating a GKE cluster, deploying the TGI application, sending prompts to the model, and monitoring the service's performance using metrics, while also providing instructions for cleaning up the cluster.
This tutorial expands on the [SkyPilot Tutorial](https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/tutorials-and-examples/skypilot) by leveraging [Dynamic Workload Scheduler](https://cloud.google.com/blog/products/compute/introducing-dynamic-workload-scheduler) with the help of an open-source project called [Kueue](https://kueue.sigs.k8s.io/)
This guide shows you how to deploy [Slurm](https://slurm.schedmd.com/documentation.html) on a Google Kubernetes Engine (GKE) cluster.
Streamline your AI/ML workflows with GKE's powerful orchestration capabilities. Manage complex pipelines, schedule jobs, and automate resource allocation.
Ensure the quality and performance of your AI/ML models with GKE's robust evaluation infrastructure. Deploy evaluation services and dashboards to monitor key metrics and track model performance.
This tutorial shows you how to serve a large language model (LLM) using Tensor Processing Units (TPUs) on Google Kubernetes Engine (GKE) with [JetStream](https://github.com/google/JetStream) and [MaxText](https://github.com/google/maxtext).
How satisfied are you with the content of the website?
Very satisfied
Somewhat satisfied
Neither satisfied nor dissatisfied
Somewhat dissatisfied
Very dissatisfied
I don’t know yet