Load Hugging Face Models into Cloud Storage
Overview
This guide uses a Kubernetes Job to load meta-llama/Meta-Llama-3-8B model weights hosted on Hugging Face, into a Cloud Storage Bucket, which is used in a vLLM model deployment.
Before you begin
-
Ensure you have a GCP project with billing enabled and have enabled the GKE and Cloud Storage APIs.
-
Follow this link to learn how to enable billing for your project.
-
The GKE and Cloud Storage APIs can be enabled by running:
gcloud services enable container.googleapis.com gcloud services enable storage.googleapis.com
-
-
Ensure you have the following tools installed on your workstation:
-
Configure access to Hugging Face models.
- Create a Hugging Face account, if you don’t already have one.
- To get access to the Llama models for deployment to GKE, you must first sign the meta-llama/Meta-Llama-3-8B license consent agreement.
- You will also need to generate a Hugging Face access token. Make sure the token has
Read
permission.
Set up your GKE Cluster
Let’s start by setting a few environment variables that will be used throughout this post. You should modify these variables to meet your environment and needs.
Run the following commands to set the env variables and make sure to replace <my-project-id>
, <your-hf-token>
, and <your-hf-username>
with your own values:
gcloud config set project <my-project-id>
export PROJECT_ID=$(gcloud config get project)
export REGION=us-central1
export HF_TOKEN=<your-hf-token>
export HF_USER=<your-hf-username>
export CLUSTER_NAME=meta-llama-3-8b-cluster
You might have to rerun the export commands if for some reason you reset your shell and the variables are no longer set. This can happen for example when your Cloud Shell disconnects.
Create a GKE Autopilot cluster by running the following command. If you choose to create a GKE standard cluster, you will need enable Workload Identity Federation for GKE, and the Cloud Storage FUSE CSI Driver on your cluster.
gcloud container clusters create-auto ${CLUSTER_NAME} \
--project=${PROJECT_ID} \
--region=${REGION} \
--labels=created-by=ai-on-gke,guide=hf-gcs-transfer
Create a Kubernetes secret for Hugging Face credentials
In your shell session, do the following:
-
Configure
kubectl
to communicate with your cluster:gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${REGION}
-
Create a Kubernetes Secret that contains your Hugging Face token. This is only required for gated models:
kubectl create secret generic hf-secret \ --from-literal=hf_api_token=${HF_TOKEN} \ --dry-run=client -o yaml | kubectl apply -f -
Create your Cloud Storage bucket
Now, create the Cloud Storage bucket for the model weights, by running the following command.
Cloud Storage bucket’s names must be globally unique, and you must have the Storage Admin (roles/storage.admin
) IAM role for the project where the bucket is created. See Create a Bucket for details.
export BUCKET_NAME=${PROJECT_ID}-meta-llama-3-8b
export BUCKET_URI=gs://${BUCKET_NAME}
gcloud storage buckets create ${BUCKET_URI} --project=${PROJECT_ID}
Deploy the Kubernetes Job to populate the Cloud Storage Bucket
-
Configure access for the
producer-job
Job, to the Cloud Storage bucket.To make your Cloud Storage bucket accessible by your GKE cluster, authenticate using Workload Identity Federation for GKE with the Cloud Storage bucket.
starIf you don’t have Workload Identity Federation for GKE enabled, follow these steps to enable it.
Grant the Storage Admin (
roles/storage.admin
) IAM role for Cloud Storage to the Kubernetes ServiceAccount by running the following commands. If you are using a custom workload identity pool, you will need to update the workload identity pool name in the command below. Custom Workload Identitiy pools are not supported in Autopilot clusters.export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)") export WORKLOAD_IDENTITY_POOL=${PROJECT_ID}.svc.id.goog export NAMESPACE=default export SERVICE_ACCOUNT=hf-sa gcloud storage buckets add-iam-policy-binding ${BUCKET_URI} \ --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${WORKLOAD_IDENTITY_POOL}/subject/ns/${NAMESPACE}/sa/${SERVICE_ACCOUNT}" \ --role "roles/storage.admin"
-
Save the following file to a file named producer-job.yaml
apiVersion: batch/v1 kind: Job metadata: name: producer-job namespace: "${NAMESPACE}" spec: template: spec: serviceAccountName: "${SERVICE_ACCOUNT}" # Without this, the job will run on an E2 machine, which results in a slower transfer time. affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cloud.google.com/machine-family operator: In values: - "c3" initContainers: - name: download image: ubuntu:22.04 resources: # Need a big enough machine that can fit the full model in RAM, with some buffer room. requests: memory: "30Gi" limits: memory: "30Gi" command: ["bash", "-c"] args: - | start=$(date +%s) apt-get update && apt-get install -y aria2 git # Get directory name from MODEL_ID (e.g., "meta-llama/Meta-Llama-3-8B" -> "Meta-Llama-3-8B") git_dir="${MODEL_ID##*/}" hf_endpoint="https://huggingface.co" download_url="https://$HF_USER:$HF_TOKEN@${hf_endpoint#https://}/$MODEL_ID" echo "INFO: Cloning model repository metadata into '$git_dir'..." GIT_LFS_SKIP_SMUDGE=1 git clone $download_url && cd $git_dir # remove files we don't want to upload rm -r -f .git rm -r -f .gitattributes rm -r -f original cd .. # Get the list of files. file_list=($(find "$git_dir/" -type f -name "[!.]*" -print)) # Strip git dir path. files=() for file in "${file_list[@]}"; do trimmed_file="${file#$git_dir/}" files+=("$trimmed_file") done # Create a file that maps each URL to its desired relative filename. # This is needed because aria2c uses the file's content hash as the identifier. > download_list.txt # Create or clear the file for file in "${files[@]}"; do url="$hf_endpoint/$MODEL_ID/resolve/main/$file" # Write the URL and the desired filename, separated by a space, on the same line echo "$url $file" >> download_list.txt done echo "--- Download List (URL and Filename) ---" cat download_list.txt echo "----------------------------------------" # Use xargs to read 2 arguments per line (-n 2): the URL ($1) and the filename ($2). # Then, use the -o option in aria2c to specify the output filename. cat download_list.txt | xargs -P 4 -n 2 sh -c ' aria2c --header="Authorization: Bearer ${HF_TOKEN}" \ --console-log-level=error \ --file-allocation=none \ --max-connection-per-server=16 \ --split=16 \ --min-split-size=3M \ --max-concurrent-downloads=16 \ -c \ -d "${MODEL_DIR}" \ -o "$2" \ "$1" ' _ end=$(date +%s) du -sh ${MODEL_DIR} echo "download took $((end-start)) seconds" env: - name: MODEL_ID value: "meta-llama/Meta-Llama-3-8B" - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-secret key: hf_api_token - name: MODEL_DIR value: "/data/model" volumeMounts: - mountPath: "/data" name: model-tmpfs containers: - name: gcloud-upload image: gcr.io/google.com/cloudsdktool/cloud-sdk:stable resources: # Need a big enough machine that can fit the full model in RAM, with some buffer room. requests: memory: "30Gi" limits: memory: "30Gi" command: ["bash", "-c"] args: - | start=$(date +%s) gcloud storage cp -r "${MODEL_DIR}" "${BUCKET_URI}" end=$(date +%s) echo "gcloud storage cp took $((end-start)) seconds" env: - name: MODEL_DIR value: "/data/model" volumeMounts: - name: model-tmpfs # Mount the same volume as the download container mountPath: /data restartPolicy: Never volumes: - name: model-tmpfs emptyDir: medium: Memory parallelism: 1 # Run 1 Pods concurrently completions: 1 # Once 1 Pods complete successfully, the Job is done backoffLimit: 0 # Max retries on failure --- apiVersion: v1 kind: ServiceAccount metadata: name: "${SERVICE_ACCOUNT}" namespace: "${NAMESPACE}"
starIf you are using a GKE Standard cluster with Node Autoprovisioning disabled, you will need to manually provision a C3 nodepool with 1 node, that has sufficient RAM memory to fit the model weights in RAM memory.
It might take a few minutes for the Job to schedule, and finish copying data to the GCS bucket. When the Job completes, its status is marked “Complete”. After the Job completes, your Cloud Storage bucket should contain the meta-llama/Meta-Llama-3-8B files ( except for the
.gitattributes
and theoriginal/
folder) within amodel
folder. -
Deploy the Job in
producer-job.yaml
by running the following command. It usesenvsubst
to substitue the required environment variables.envsubst '$NAMESPACE $SERVICE_ACCOUNT $HF_USER $BUCKET_URI' < producer-job.yaml | kubectl apply -f -
-
Monitor the status of the transfer.
To check the status of your Job, run the following command:
kubectl get job producer-job --namespace ${NAMESPACE}
starOnce you see that the Job has the “Complete” Status, the transfer is complete.
To see logs for the download container while it is running, run the following command:
kubectl logs jobs/producer-job -c download --namespace=$NAMESPACE
To see logs for the upload container while it is running, run the following command:
kubectl logs jobs/producer-job -c gcloud-upload
-
Once the Job completes, you can clean up the Job by running this command:
kubectl delete job producer-job --namespace ${NAMESPACE} kubectl delete serviceaccount hf-sa --namespace ${NAMESPACE} kubectl delete secret hf-secret --namespace ${NAMESPACE}
Deploy the vLLM Model Server on GKE
Now that meta-llama/Meta-Llama-3-8B model weights exist in your Cloud Storage bucket, you can deploy the vLLM Model server on GKE, and load model weights from your Cloud Storage bucket to optimize model load time.
-
Configure access for the model deployment, to the Cloud Storage bucket.
Grant the Storage Object Viewer (
roles/storage.objectViewer
) IAM role for thellama3-8b-vllm-deployment-service-account
ServiceAccount to allow thellama3-8b-vllm-deployment
to load the model weights from the Cloud Storage bucket. This is a different IAM binding than the one we used to grant theproducer-job
the access needed to populate the Cloud Storage bucket with the model weights.export SERVICE_ACCOUNT=llama3-8b-vllm-deployment-service-account gcloud storage buckets add-iam-policy-binding ${BUCKET_URI} \ --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${WORKLOAD_IDENTITY_POOL}/subject/ns/${NAMESPACE}/sa/${SERVICE_ACCOUNT}" \ --role "roles/storage.objectViewer"
-
Deploy the following manifest, to create a meta-llama/Meta-Llama-3-8B model deployment, which loads the model weights from your Cloud Storage bucket into the GPU using GCSFuse.
kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: llama3-8b-vllm-inference-server name: llama3-8b-vllm-deployment namespace: "${NAMESPACE}" spec: replicas: 1 selector: matchLabels: app: llama3-8b-vllm-inference-server template: metadata: annotations: gke-gcsfuse/cpu-limit: "0" gke-gcsfuse/ephemeral-storage-limit: "0" gke-gcsfuse/memory-limit: "0" gke-gcsfuse/volumes: "true" labels: ai.gke.io/inference-server: vllm ai.gke.io/model: LLaMA3_8B app: llama3-8b-vllm-inference-server spec: containers: - args: - --model=/data/model command: - python3 - -m - vllm.entrypoints.openai.api_server image: vllm/vllm-openai:v0.7.2 name: inference-server ports: - containerPort: 8000 name: metrics readinessProbe: failureThreshold: 60 httpGet: path: /health port: 8000 periodSeconds: 10 resources: limits: nvidia.com/gpu: "1" requests: nvidia.com/gpu: "1" volumeMounts: - mountPath: /dev/shm name: dshm - mountPath: /data/model name: model-src nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4 serviceAccountName: "${SERVICE_ACCOUNT}" volumes: - emptyDir: medium: Memory name: dshm - csi: driver: gcsfuse.csi.storage.gke.io volumeAttributes: bucketName: ${BUCKET_NAME} mountOptions: implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1,only-dir:model name: model-src --- apiVersion: v1 kind: Service metadata: labels: app: llama3-8b-vllm-inference-server name: llama3-8b-vllm-service namespace: "${NAMESPACE}" spec: ports: - port: 8000 protocol: TCP targetPort: 8000 selector: app: llama3-8b-vllm-inference-server type: ClusterIP --- apiVersion: v1 kind: ServiceAccount metadata: name: "${SERVICE_ACCOUNT}" namespace: "${NAMESPACE}" EOF
-
Check the status of the model deployment by running:
kubectl get deployment llama3-8b-vllm-deployment --namespace ${NAMESPACE}
Once your model deployment is running, follow the vLLM documentation to build and send a request to your endpoint.
Cleanup
-
Delete the model deployment, service, and service account by running:
kubectl delete deployment llama3-8b-vllm-deployment --namespace ${NAMESPACE} kubectl delete service llama3-8b-vllm-service --namespace ${NAMESPACE} kubectl delete serviceaccount ${SERVICE_ACCOUNT} --namespace ${NAMESPACE}
-
Delete the Cloud Storage bucket, and all of its contents by running:
gcloud storage rm --recursive ${BUCKET_URI}
Feedback
Was this page helpful?
Thank you for your feedback.
Thank you for your feedback.