Blueprints

Jupyter on GKE

This guide details how to deploy JupyterHub on Google Kubernetes Engine (GKE) using a provided Terraform template, including options for persistent storage and Identity-Aware Proxy (IAP) for secure access. It covers the necessary prerequisites, configuration steps, and installation process, emphasizing the use of Terraform for automation and IAP for authentication. The guide also provides instructions for accessing JupyterHub, setting up user access, and running an example notebook.

NVIDIA BioNeMo

Deploying and managing servers dedicated to performing inference tasks for machine learning models.

NVIDIA NeMo

NVIDIA NeMo™ is an end-to-end platform for development of custom generative AI models anywhere. NVIDIA NeMo framework is designed for enterprise development, it utilizes NVIDIA’s state-of-the-art technology to facilitate a complete workflow from automated distributed data processing to training of large-scale bespoke models using sophisticated 3D parallelism techniques, and finally, deployment using retrieval-augmented generation for large-scale inference on an infrastructure of your choice, be it on-premises or in the cloud.

NVIDIA NIMs

These guides explains how to deploy NVIDIA NIM inference microservices on a Google Kubernetes Engine (GKE) cluster

RAG on GKE

This tutorial demonstrates how to deploy a Retrieval Augmented Generation (RAG) application on Google Kubernetes Engine (GKE), integrating a Hugging Face TGI inference server, a Cloud SQL pgvector database, and a Ray cluster for generating vector embeddings. It walks you through setting up the infrastructure with Terraform, populating the vector database with embeddings from a sample dataset using a Jupyter notebook, and launching a frontend chat interface. The guide also covers optional configurations like using your own cluster or VPC, enabling authenticated access via IAP, and troubleshooting common issues.

Continue reading:

Hugging Face to GCS

This guide provides instructions for how to hydrate GCS buckets with models from Hugging Face with a Kubernetes Job.

Llamaindex

This tutorial will guide you through creating a robust Retrieval-Augmented Generation (RAG) system using LlamaIndex and deploying it on Google Kubernetes Engine (GKE).

Metaflow

This tutorial will provide instructions on how to deploy and use the [Metaflow](https://docs.metaflow.org/) framework on GKE (Google Kubernetes Engine) and operate AI/ML workloads using [Argo-Workflows](https://argo-workflows.readthedocs.io/en/latest/).

vLLM GPU/TPU Fungibility

This tutorial shows you who to serve a large language model (LLM) using both Tensor Processing Units (TPUs) and GPUs on Google Kubernetes Engine (GKE) using the same deployment with [vLLM](https://github.com/vllm-project/vllm)

LangChain Chatbot

In this tutorial, you will learn how to deploy a chatbot application using [LangChain](https://python.langchain.com/) and [Streamlit](https://streamlit.io/) on Google Cloud Platform (GCP).

Ray

This guide provides instructions and examples for deploying and managing Ray clusters on Google Kubernetes Engine (GKE) using KubeRay and Terraform. It covers setting up a GKE cluster, deploying a Ray cluster, submitting Ray jobs, and using the Ray Client for interactive sessions. The guide also points to various resources, including tutorials, best practices, and examples for running different types of Ray applications on GKE, such as serving LLMs, using TPUs, and integrating with GCS.