By amacaskill
Last modified June 20, 2025

Load Hugging Face Models into Cloud Storage

Overview

This guide uses a Kubernetes Job to load meta-llama/Meta-Llama-3-8B model weights hosted on Hugging Face, into a Cloud Storage Bucket, which is used in a vLLM model deployment.

Before you begin

  1. Ensure you have a GCP project with billing enabled and have enabled the GKE and Cloud Storage APIs.

    • Follow this link to learn how to enable billing for your project.

    • The GKE and Cloud Storage APIs can be enabled by running:

      gcloud services enable container.googleapis.com
      gcloud services enable storage.googleapis.com
      
  2. Ensure you have the following tools installed on your workstation:

  3. Configure access to Hugging Face models.

Set up your GKE Cluster

Let’s start by setting a few environment variables that will be used throughout this post. You should modify these variables to meet your environment and needs.

Run the following commands to set the env variables and make sure to replace <my-project-id>, <your-hf-token>, and <your-hf-username> with your own values:

gcloud config set project <my-project-id>
export PROJECT_ID=$(gcloud config get project)
export REGION=us-central1
export HF_TOKEN=<your-hf-token>
export HF_USER=<your-hf-username>
export CLUSTER_NAME=meta-llama-3-8b-cluster
star

You might have to rerun the export commands if for some reason you reset your shell and the variables are no longer set. This can happen for example when your Cloud Shell disconnects.

Create a GKE Autopilot cluster by running the following command. If you choose to create a GKE standard cluster, you will need enable Workload Identity Federation for GKE, and the Cloud Storage FUSE CSI Driver on your cluster.

gcloud container clusters create-auto ${CLUSTER_NAME} \
  --project=${PROJECT_ID} \
  --region=${REGION} \
  --labels=created-by=ai-on-gke,guide=hf-gcs-transfer 

Create a Kubernetes secret for Hugging Face credentials

In your shell session, do the following:

  1. Configure kubectl to communicate with your cluster:

    gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${REGION}
    
  2. Create a Kubernetes Secret that contains your Hugging Face token. This is only required for gated models:

    kubectl create secret generic hf-secret \
      --from-literal=hf_api_token=${HF_TOKEN} \
      --dry-run=client -o yaml | kubectl apply -f -
    

Create your Cloud Storage bucket

Now, create the Cloud Storage bucket for the model weights, by running the following command.

star

Cloud Storage bucket’s names must be globally unique, and you must have the Storage Admin (roles/storage.admin) IAM role for the project where the bucket is created. See Create a Bucket for details.

export BUCKET_NAME=${PROJECT_ID}-meta-llama-3-8b
export BUCKET_URI=gs://${BUCKET_NAME}
gcloud storage buckets create ${BUCKET_URI} --project=${PROJECT_ID}

Deploy the Kubernetes Job to populate the Cloud Storage Bucket

  1. Configure access for the producer-job Job, to the Cloud Storage bucket.

    To make your Cloud Storage bucket accessible by your GKE cluster, authenticate using Workload Identity Federation for GKE with the Cloud Storage bucket.

    star

    If you don’t have Workload Identity Federation for GKE enabled, follow these steps to enable it.

    Grant the Storage Admin (roles/storage.admin) IAM role for Cloud Storage to the Kubernetes ServiceAccount by running the following commands. If you are using a custom workload identity pool, you will need to update the workload identity pool name in the command below. Custom Workload Identitiy pools are not supported in Autopilot clusters.

    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export WORKLOAD_IDENTITY_POOL=${PROJECT_ID}.svc.id.goog
    export NAMESPACE=default
    export SERVICE_ACCOUNT=hf-sa
    gcloud storage buckets add-iam-policy-binding ${BUCKET_URI} \
    --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${WORKLOAD_IDENTITY_POOL}/subject/ns/${NAMESPACE}/sa/${SERVICE_ACCOUNT}" \
    --role "roles/storage.admin"
    
  2. Save the following file to a file named producer-job.yaml

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: producer-job
      namespace: "${NAMESPACE}"
    spec:
      template:
        spec:
          serviceAccountName: "${SERVICE_ACCOUNT}"
          # Without this, the job will run on an E2 machine, which results in a slower transfer time.
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: cloud.google.com/machine-family
                    operator: In
                    values:
                    - "c3"
          initContainers:
          - name: download
            image: ubuntu:22.04
            resources:
              # Need a big enough machine that can fit the full model in RAM, with some buffer room.
              requests:
                memory: "30Gi"
              limits:
                memory: "30Gi"
            command: ["bash", "-c"]
            args:
            - |
              start=$(date +%s)
              apt-get update && apt-get install -y aria2 git
    
              # Get directory name from MODEL_ID (e.g., "meta-llama/Meta-Llama-3-8B" -> "Meta-Llama-3-8B")
              git_dir="${MODEL_ID##*/}"
    
              hf_endpoint="https://huggingface.co"
              download_url="https://$HF_USER:$HF_TOKEN@${hf_endpoint#https://}/$MODEL_ID"
              echo "INFO: Cloning model repository metadata into '$git_dir'..."
              GIT_LFS_SKIP_SMUDGE=1 git clone $download_url && cd $git_dir
    
              # remove files we don't want to upload
              rm -r -f .git
              rm -r -f .gitattributes
              rm -r -f original
              cd ..
    
              # Get the list of files.
              file_list=($(find "$git_dir/" -type f -name "[!.]*" -print))
    
              # Strip git dir path. 
              files=()
              for file in "${file_list[@]}"; do
                trimmed_file="${file#$git_dir/}"
                files+=("$trimmed_file")
              done
    
              # Create a file that maps each URL to its desired relative filename.
              # This is needed because aria2c uses the file's content hash as the identifier.
              > download_list.txt # Create or clear the file
    
              for file in "${files[@]}"; do
                  url="$hf_endpoint/$MODEL_ID/resolve/main/$file"
                  # Write the URL and the desired filename, separated by a space, on the same line
                  echo "$url $file" >> download_list.txt
              done
    
              echo "--- Download List (URL and Filename) ---"
              cat download_list.txt
              echo "----------------------------------------"
    
              # Use xargs to read 2 arguments per line (-n 2): the URL ($1) and the filename ($2).
              # Then, use the -o option in aria2c to specify the output filename.
              cat download_list.txt | xargs -P 4 -n 2 sh -c '
                aria2c --header="Authorization: Bearer ${HF_TOKEN}" \
                      --console-log-level=error \
                      --file-allocation=none \
                      --max-connection-per-server=16 \
                      --split=16 \
                      --min-split-size=3M \
                      --max-concurrent-downloads=16 \
                      -c \
                      -d "${MODEL_DIR}" \
                      -o "$2" \
                      "$1"
              ' _
    
              end=$(date +%s)
              du -sh ${MODEL_DIR}
              echo "download took $((end-start)) seconds"
            env:
            - name: MODEL_ID
              value: "meta-llama/Meta-Llama-3-8B"
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: MODEL_DIR
              value: "/data/model"
            volumeMounts:
              - mountPath: "/data"
                name: model-tmpfs
          containers:
          - name: gcloud-upload
            image: gcr.io/google.com/cloudsdktool/cloud-sdk:stable
            resources:
              # Need a big enough machine that can fit the full model in RAM, with some buffer room.
              requests:
                memory: "30Gi"
              limits:
                memory: "30Gi"
            command: ["bash", "-c"]
            args:
            - |
              start=$(date +%s)
              gcloud storage cp -r "${MODEL_DIR}" "${BUCKET_URI}"
              end=$(date +%s)
              echo "gcloud storage cp took $((end-start)) seconds"
            env:
            - name: MODEL_DIR
              value: "/data/model"
            volumeMounts:
            - name: model-tmpfs  # Mount the same volume as the download container
              mountPath: /data
          restartPolicy: Never
          volumes:
            - name: model-tmpfs
              emptyDir:
                medium: Memory
      parallelism: 1         # Run 1 Pods concurrently
      completions: 1         # Once 1 Pods complete successfully, the Job is done
      backoffLimit: 0        # Max retries on failure
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: "${SERVICE_ACCOUNT}"
      namespace: "${NAMESPACE}"
    
    star

    If you are using a GKE Standard cluster with Node Autoprovisioning disabled, you will need to manually provision a C3 nodepool with 1 node, that has sufficient RAM memory to fit the model weights in RAM memory.

    It might take a few minutes for the Job to schedule, and finish copying data to the GCS bucket. When the Job completes, its status is marked “Complete”. After the Job completes, your Cloud Storage bucket should contain the meta-llama/Meta-Llama-3-8B files ( except for the .gitattributes and the original/ folder) within a model folder.

  3. Deploy the Job in producer-job.yaml by running the following command. It uses envsubst to substitue the required environment variables.

    envsubst '$NAMESPACE $SERVICE_ACCOUNT $HF_USER $BUCKET_URI' < producer-job.yaml | kubectl apply -f -
    
  4. Monitor the status of the transfer.

    To check the status of your Job, run the following command:

    kubectl get job producer-job --namespace ${NAMESPACE}
    
    star

    Once you see that the Job has the “Complete” Status, the transfer is complete.

    To see logs for the download container while it is running, run the following command:

    kubectl logs jobs/producer-job -c download --namespace=$NAMESPACE
    

    To see logs for the upload container while it is running, run the following command:

    kubectl logs jobs/producer-job -c gcloud-upload
    
  5. Once the Job completes, you can clean up the Job by running this command:

    kubectl delete job producer-job --namespace ${NAMESPACE}
    kubectl delete serviceaccount hf-sa --namespace ${NAMESPACE}
    kubectl delete secret hf-secret --namespace ${NAMESPACE}
    

Deploy the vLLM Model Server on GKE

Now that meta-llama/Meta-Llama-3-8B model weights exist in your Cloud Storage bucket, you can deploy the vLLM Model server on GKE, and load model weights from your Cloud Storage bucket to optimize model load time.

  1. Configure access for the model deployment, to the Cloud Storage bucket.

    Grant the Storage Object Viewer (roles/storage.objectViewer) IAM role for the llama3-8b-vllm-deployment-service-account ServiceAccount to allow the llama3-8b-vllm-deployment to load the model weights from the Cloud Storage bucket. This is a different IAM binding than the one we used to grant the producer-job the access needed to populate the Cloud Storage bucket with the model weights.

    export SERVICE_ACCOUNT=llama3-8b-vllm-deployment-service-account
    gcloud storage buckets add-iam-policy-binding ${BUCKET_URI} \
    --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${WORKLOAD_IDENTITY_POOL}/subject/ns/${NAMESPACE}/sa/${SERVICE_ACCOUNT}" \
    --role "roles/storage.objectViewer"
    
  2. Deploy the following manifest, to create a meta-llama/Meta-Llama-3-8B model deployment, which loads the model weights from your Cloud Storage bucket into the GPU using GCSFuse.

    kubectl apply -f - <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: llama3-8b-vllm-inference-server
      name: llama3-8b-vllm-deployment
      namespace: "${NAMESPACE}"
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: llama3-8b-vllm-inference-server
      template:
        metadata:
          annotations:
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/volumes: "true"
          labels:
            ai.gke.io/inference-server: vllm
            ai.gke.io/model: LLaMA3_8B
            app: llama3-8b-vllm-inference-server
        spec:
          containers:
          - args:
            - --model=/data/model
            command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            image: vllm/vllm-openai:v0.7.2
            name: inference-server
            ports:
            - containerPort: 8000
              name: metrics
            readinessProbe:
              failureThreshold: 60
              httpGet:
                path: /health
                port: 8000
              periodSeconds: 10
            resources:
              limits:
                nvidia.com/gpu: "1"
              requests:
                nvidia.com/gpu: "1"
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
            - mountPath: /data/model
              name: model-src
          nodeSelector:
            cloud.google.com/gke-accelerator: nvidia-l4
          serviceAccountName: "${SERVICE_ACCOUNT}"
          volumes:
          - emptyDir:
              medium: Memory
            name: dshm
          - csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: ${BUCKET_NAME}
                mountOptions: implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1,only-dir:model
            name: model-src
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: llama3-8b-vllm-inference-server
      name: llama3-8b-vllm-service
      namespace: "${NAMESPACE}"
    spec:
      ports:
      - port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        app: llama3-8b-vllm-inference-server
      type: ClusterIP
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: "${SERVICE_ACCOUNT}"
      namespace: "${NAMESPACE}"
    EOF
    
  3. Check the status of the model deployment by running:

    kubectl get deployment llama3-8b-vllm-deployment --namespace ${NAMESPACE}
    

    Once your model deployment is running, follow the vLLM documentation to build and send a request to your endpoint.

Cleanup

  1. Delete the model deployment, service, and service account by running:

    kubectl delete deployment llama3-8b-vllm-deployment  --namespace ${NAMESPACE}
    kubectl delete service llama3-8b-vllm-service --namespace ${NAMESPACE}
    kubectl delete serviceaccount ${SERVICE_ACCOUNT} --namespace ${NAMESPACE}   
    
  2. Delete the Cloud Storage bucket, and all of its contents by running:

    gcloud storage rm --recursive ${BUCKET_URI}
    

Continue reading: