Hugging Face TGI

Tags:

Overview

This guide demonstrates how to deploy a Hugging Face Text Generation Inference (TGI) server on Google Kubernetes Engine (GKE) using NVIDIA L4 GPUs, enabling you to serve large language models like Mistral-7b-instruct. It walks you through creating a GKE cluster, deploying the TGI application, sending prompts to the model, and monitoring the service’s performance using metrics, while also providing instructions for cleaning up the cluster.

Prerequisites

Make sure you have:

A Google Cloud project with billing enabled.
Google Cloud SDK (gcloud CLI) installed and configured.
kubectl installed.
Terraform installed.
Sufficient Google Cloud quotas for NVIDIA L4 GPUs and G2 machines.

Installation

Clone the AI-on-GKE/tutorial-and-examples repository

git clone https://github.com/ai-on-gke/tutorials-and-examples
cd tutorials-and-examples/hugging-face-tgi/

Set environment variables

export REGION=us-central1
export PROJECT_ID=$(gcloud config get project)

Create GKE cluster

gcloud container clusters create l4-demo --location ${REGION}   \
--workload-pool ${PROJECT_ID}.svc.id.goog   --enable-image-streaming \
--node-locations=$REGION-a --addons GcsFuseCsiDriver  \
--machine-type n2d-standard-4  \
--num-nodes 1 --min-nodes 1 --max-nodes 5   \
--ephemeral-storage-local-ssd=count=2 --enable-ip-alias \
--labels=created-by=ai-on-gke,guide=hf-tgi

Get cluster credentials
```
kubectl config set-cluster l4-demo
```

Create GPU node pool

gcloud container node-pools create g2-standard-24 --cluster l4-demo \
  --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \
  --machine-type g2-standard-24 \
  --ephemeral-storage-local-ssd=count=2 \
 --enable-image-streaming \
 --num-nodes=1 --min-nodes=1 --max-nodes=2 \
 --node-locations $REGION-a,$REGION-b --region $REGION

Deploy the application using Terraform:
- Set the project_id in workloads.tfvars.
- Apply the Terraform configuration:
```
terraform init
terraform apply -var-file=workloads.tfvars
```
Check application status
```
 kubectl logs -l app=mistral-7b-instruct -n l4-demo
```
Look for logs indicating the model has loaded successfully.

Set up port forward for testing

kubectl port-forward deployment/mistral-7b-instruct 8080:8080 -n l4-demo &

API Interaction and Service Monitoring

Try a few prompts by sending a request to the TGI server using the forwarded port

export USER_PROMPT="How to deploy a container on K8s?"

curl 127.0.0.1:8080/generate -X POST \
    -H 'Content-Type: application/json' \
    --data-binary @- <<EOF
{
    "inputs": "[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n$USER_PROMPT[/INST]",
    "parameters": {"max_new_tokens": 400}
}
EOF

Explore the /metrics endpoint of the service for performance information
- Fetch and filter for tgi_request_count. This command will retrieve all metrics and then filter the output to show only lines containing tgi_request_count.
```
curl -s 127.0.0.1:8080/metrics | grep tgi_request_count
```
  Expected Output (similar to):
```
# TYPE tgi_request_count counter
tgi_request_count 1.0
```
- To find tgi_batch_inference_count without the progress meter:
```
curl -s 127.0.0.1:8080/metrics | grep tgi_batch_inference_count
```
  And the output will be similar to:
```
# TYPE tgi_batch_inference_count counter
tgi_batch_inference_count{method="decode"} 292
tgi_batch_inference_count{method="prefill"} 1
```
Go to cloud monitoring and search for one of those metrics. For example, tgi_request_count or tgi_batch_inference_count. Those metrics should show up if you search for them in PromQL.

Clean up

Remove the cluster and deployment by running the following command:

gcloud container clusters delete l4-demo --location ${REGION}

Feedback

Was this page helpful?

Thank you for your feedback.

Continue reading:

Models as OCI

This project allows you to download a Hugging Face model and package it as a Docker image. The Docker image can then be pushed to Google Artifact Registry for deployment or distribution. Build time can be significant for large models, it is recommended to not exceed models above 10 billion parameters. For reference 8b model roughly takes 35 minutes to build and push with this cloudbuild config.

Workflow orchestration

Workflow orchestration in the ai-on-gke project involves managing and automating the execution of complex, multi-step processes, primarily for AI/ML workloads on Google Kubernetes Engine (GKE).

DWS

This guide provides examples of how to use Dynamic Workload Scheduler (DWS) within Google Kubernetes Engine (GKE), leveraging Kueue for queue management and resource provisioning. It includes sample configurations for Kueue queues with DWS support (dws-queue.yaml) and a sample job definition (job.yaml) that demonstrates how to request resources and set a maximum run duration using DWS.

Flyte

This guide illustrates the deployment of Flyte on Google Kubernetes Engine (GKE) using Helm, utilizing Google Cloud Storage for scalable data storage and Cloud SQL PostgreSQL for a reliable metadata store. By the end of this tutorial, you will have a fully functional Flyte instance on GKE, offering businesses seamless integration with the GCP ecosystem, improved resource efficiency, and cost-effectiveness.

Multikueue, DWS and GKE Autopilot

In this guide you will learn how to set up a multi-cluster environment where job computation is distributed across three GKE clusters in different regions using MultiKueue, Dynamic Workload Scheduler (DWS), and GKE Autopilot.

Cross region capacity chasing with SkyPilot

In this tutorial, we will demonstrate how to leverage the open-source software [SkyPilot](https://skypilot.readthedocs.io/en/latest/docs/index.html) to help GKE customers efficiently obtain accelerators across regions, ensuring workload continuity and optimized resource utilization.