Ray on GKE

This directory contains examples, guides and best practices for running Ray on Google Kubernetes Engine. Most examples use the ray-on-gke terraform module to install KubeRay and deploy RayCluster resources.

Getting Started

It is highly recommended to use the infrastructure terraform module to create your GKE cluster.

Create a RayCluster on a GKE cluster

  1. Clone the quick-start-guides repository.

    git clone https://github.com/ai-on-gke/quick-start-guides.git
    
  2. Edit ray-on-gke/workloads.tfvars with your environment specific variables and configurations. The following variables require configuration:

    • project_id
    • cluster_name
    • cluster_location

    If you need a new cluster, you can specify create_cluster: true.

  3. Run the following commands to install KubeRay and deploy a Ray cluster onto your existing cluster.

    cd ray-on-gke/
    terraform init
    terraform apply --var-file=workloads.tfvars
    
  4. Validate that the RayCluster is ready:

    $ kubectl get raycluster
    NAME                  DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
    ray-cluster-kuberay   1                 1                   ready    3m41s
    
star

See tfvars examples to explore different configuration options for the Ray cluster using the terraform templates.

Install Ray

Ensure Ray is installed in your environment. See Installing Ray for more details.

Submit a Ray job

  1. To submit a Ray job, first establish a connection to the Ray head. For this example we’ll use kubectl port-forward to connect to the Ray head via localhost.

    kubectl -n ai-on-gke port-forward service/ray-cluster-kuberay-head-svc 8265 &
    
  2. Submit a Ray job that prints resources available in your Ray cluster:

    $ ray job submit --address http://localhost:8265 -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
    Job submission server address: http://localhost:8265
    
    -------------------------------------------------------
    Job 'raysubmit_4JBD9mLhh9sjqm8g' submitted successfully
    -------------------------------------------------------
    
    Next steps
      Query the logs of the job:
        ray job logs raysubmit_4JBD9mLhh9sjqm8g
      Query the status of the job:
        ray job status raysubmit_4JBD9mLhh9sjqm8g
      Request the job to be stopped:
        ray job stop raysubmit_4JBD9mLhh9sjqm8g
    
    Tailing logs until the job exits (disable with --no-wait):
    2024-03-19 20:46:28,668 INFO worker.py:1405 -- Using address 10.80.0.19:6379 set in the environment variable RAY_ADDRESS
    2024-03-19 20:46:28,668 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.80.0.19:6379...
    2024-03-19 20:46:28,677 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at 10.80.0.19:8265
    {'node:__internal_head__': 1.0, 'object_store_memory': 2295206707.0, 'memory': 8000000000.0, 'CPU': 4.0, 'node:10.80.0.19': 1.0}
    Handling connection for 8265
    
    ------------------------------------------
    Job 'raysubmit_4JBD9mLhh9sjqm8g' succeeded
    ------------------------------------------
    

Ray Client for interactive sessions

The RayClient API enables Python scripts to interactively connect to remote Ray clusters. See Ray Client for more details.

  1. To use the client, first establish a connection to the Ray head. For this example we’ll use kubectl port-forward to connect to the Ray head Service via localhost.

    kubectl -n ai-on-gke port-forward service/ray-cluster-kuberay-head-svc 10001 &
    
  2. Next, define a Python script containing remote code you want to run on your Ray cluster. Similar to the previous example, this remote function will print the resources available in the cluster:

    # cluster_resources.py
    import ray
    
    ray.init("ray://localhost:10001")
    
    @ray.remote
    def cluster_resources():
      return ray.cluster_resources()
    
    print(ray.get(cluster_resources.remote()))
    
  3. Run the Python script:

    $ python cluster_resources.py
    {'CPU': 4.0, 'node:__internal_head__': 1.0, 'object_store_memory': 2280821145.0, 'node:10.80.0.22': 1.0, 'memory': 8000000000.0}
    

Guides & Tutorials

See the following guides and tutorials for running Ray applications on GKE:

Blogs & Best Practices


Continue reading: