Creating Inference Checkpoints

Tags:

Overview

This document outlines the process for converting inference checkpoints for use with various model servers, such as Jetstream with MaxText or Pytorch/XLA backends. The core of this process utilizes the checkpoint_entrypoint.sh script, packaged within a Docker container, to handle the specific conversion steps required by different server configurations. The goal is to prepare your trained model checkpoints for efficient deployment and inference serving.

Checkpoint creation

star

The checkpoint_converter.sh script overviews how to convert your inference checkpoint for various model servers.

Clone the AI-on-GKE/tutorial-and-examples repository

git clone https://github.com/ai-on-gke/tutorials-and-examples
cd tutorials-and-examples/inference-servers/checkpoints

Build the Docker image that contains the conversion script and its dependencies. Tag the image and push it to a container registry (like Google Container Registry - GCR) accessible by your execution environment (e.g., Kubernetes).

docker build -t inference-checkpoint .
docker tag inference-checkpoint ${LOCATION}-docker.pkg.dev/${PROJECT_ID}/jetstream/inference-checkpoint:latest
docker push ${LOCATION}-docker.pkg.dev/${PROJECT_ID}/jetstream/inference-checkpoint:latest

The conversion is typically run as a containerized job, for example, using a Kubernetes job. You will need to configure the job to use the ${LOCATION}-docker.pkg.dev/${PROJECT_ID}/jetstream/inference-checkpoint:latest image and pass the required arguments based on your target inference server and checkpoint details.

Jetstream + MaxText

--bucket_name: [string] The GSBucket name to store checkpoints, without gs://.
--inference_server: [string] The name of the inference server that serves your model. (Optional) (default=jetstream-maxtext)
--model_path: [string] The model path.
--model_name: [string] The model name. ex. llama-2, llama-3, gemma.
--huggingface: [bool] The model is from Hugging Face. (Optional) (default=False)
--quantize_type: [string] The type of quantization. (Optional)
--quantize_weights: [bool] The checkpoint is to be quantized. (Optional) (default=False)
--input_directory: [string] The input directory, likely a GSBucket path.
--output_directory: [string] The output directory, likely a GSBucket path.
--meta_url: [string] The url from Meta. (Optional)
--version: [string] The version of repository. (Optional) (default=main)

Jetstream + Pytorch/XLA

--inference_server: [string] The name of the inference server that serves your model.
--model_path: [string] The model path.
--model_name: [string] The model name. ex. llama-2, llama-3, gemma.
--quantize_weights: [bool] The checkpoint is to be quantized. (Optional) (default=False)
--quantize_type: [string] The type of quantization. Availabe quantize type: {"int8", "int4"} x {"per_channel", "blockwise"}. (Optional) (default=int8_per_channel)
--version: [string] The version of repository to override, ex. jetstream-v0.2.2, jetstream-v0.2.3. (Optional) (default=main)
--input_directory: [string] The input directory, likely a GSBucket path. (Optional)
--output_directory: [string] The output directory, likely a GSBucket path.
--huggingface: [bool] The model is from Hugging Face. (Optional) (default=False)

Feedback

Was this page helpful?

Thank you for your feedback.

Continue reading:

ADK VertexAI Example

This tutorial guides you through deploying a containerized agent built with the [Google Agent Development Kit (ADK)](https://google.github.io/adk-docs/) to [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview). The agent uses [VertexAI](https://cloud.google.com/vertex-ai/docs) to access LLMs. GKE provides a managed environment for deploying, managing, and scaling your containerized applications using Google infrastructure.

Fine-tuning Gemma 3-1B-it on L4

This tutorial guides you through fine-tuning the Gemma 3-1B-it language model on Google Kubernetes Engine (GKE) using L4 GPU, leveraging Parameter Efficient Fine Tuning (PEFT) and LoRA. It covers setting up a GKE cluster, containerizing the fine-tuning code, running the fine-tuning job, and uploading the resulting model to Hugging Face. Finally, it demonstrates how to deploy and interact with the fine-tuned model using vLLM on GKE.

vLLM GPU/TPU Fungibility

This tutorial shows you who to serve a large language model (LLM) using both Tensor Processing Units (TPUs) and GPUs on Google Kubernetes Engine (GKE) using the same deployment with [vLLM](https://github.com/vllm-project/vllm)

Inference servers

Deploying and managing servers dedicated to performing inference tasks for machine learning models.

Security

Identity Aware Proxy

Overviews how to secure application endpoints with Identity Aware Proxy (IAP)