Setting up a TPU Cluster

!!! note This document is a work in progress. It is not yet complete and may contain errors or omissions. Please check back later for updates, and ask in the Discord for help.

This guide will walk you through the steps to set up and manage a TPU cluster using Google Cloud Platform (GCP) and Ray. Currently, this is the only way to use multinode TPU (i.e. more than a v4-8) with Marin.

Prerequisites

Before you begin, ensure you have:

Completed the basic installation
Access to a Google Cloud project with TPU quota
Google Cloud SDK installed
Ray installed
A GCS bucket set up using Prepare a Storage Bucket, with soft delete disabled and permissions granted to your cluster's service account

TPU Cluster Architecture

A typical TPU cluster consists of:

Head Node: A persistent GCP VM that coordinates the cluster
Worker Nodes: Autoscaling TPU VMs that execute the actual computation

Cluster Configuration

The cluster configuration is defined in a YAML template file. This template allows you to customize various aspects of your cluster:

Machine types for head and worker nodes
Number of CPUs and amount of RAM
Disk size and type
TPU types and versions
Minimum and maximum worker counts
Docker images and initialization commands
Network configuration
Environment variables and secrets. We use GCP secrets manager to store secrets.

For our clusters, each config is in a separate file in the infra directory. These files are automatically generated by the scripts/ray/cluster.py script, which reads the infra/marin-cluster-template.yaml file. More information can be found in the infra/README.md.

Building and Uploading the Docker Image

We use docker images to run the jobs inside the ray cluster. We provide a convieninet make target make cluster_docker to build the docker image and upload it to GCP artifact registry for various different regions. This make target performs the following steps:

Build the docker image inside Dockerfile.cluster
Tag the docker image with the different region names
Push the docker images to respective GCP artifact registries

Setting Up the Cluster

Install Marin with TPU support:
```
uv sync --all-packages --extra=tpu
```

Set up your GCP environment:

# Configure Docker for GCP artifact registry
gcloud auth configure-docker <region>-docker.pkg.dev

Start the cluster (replace $REGION with your desired region, e.g., us-central2):
```
ray up -y infra/marin-$REGION.yaml
```

Executing a Job on the cluster

For running any meaningful job on the cluster, you will have to make a GCS bucket and also set the MARIN_PREFIX environment variable to the bucket name, e.g. gs://marin-us-central2. Follow the Prepare a Storage Bucket tutorial so soft delete is disabled and IAM defaults are correct. We use this bucket to store all the checkpoints, datasets, and other artifacts.

If you haven't setup your SSH keys yet, use

make dev_setup

to setup the required keys for accessing the Marin Ray clusters.

Using Ray Submit

Connect to the cluster using ray dashboard:
```
ray dashboard infra/marin-$REGION.yaml
```
In another terminal, submit a job using ray submit:
```
export RAY_ADDRESS=http://localhost:8265
ray job submit --working-dir . -- python experiments/tutorials/hello_world.py
```
Note that RAY_ADDRESS needs to use the http:// URL specifier when connecting via the dashboard.

Using Ray Run

Connect to the cluster using ray dashboard:
```
ray dashboard infra/marin-$REGION.yaml
```
We have made a thin wrapper on top of ray submit called ray-run which can be used to easily specify additional dependency extras/packages and environment variables (apart from those specified in the Dockerfile or cluster config). Ray run ensures the following:
- All the dependencies in pyproject.toml are installed
- Some necessary environment defaults are set directly in ray_run.py
- src directories of submodules are added to PYTHONPATH.

You can use --env_vars and --extra to specify additional environment variables and dependency extras/packages. For example, to run a job with an additional package (async-lru) and the environment variable WANDB_API_KEY, you can use:

uv run lib/marin/src/marin/run/ray_run.py --env_vars WANDB_API_KEY ${WANDB_API_KEY} --extra async-lru -- python experiments/tutorials/hello_world.py

Managing the Cluster

Using the cluster management script

Most cluster management for Ray & GCP is handled by scripts/ray/cluster.py. You can specify a path to either a config file or a region to operate on e.g. eu-west4.

# Basic cluster operations
uv run scripts/ray/cluster.py --config infra/marin-$REGION.yaml start-cluster
uv run scripts/ray/cluster.py --config infra/marin-$REGION.yaml stop-cluster
uv run scripts/ray/cluster.py --config infra/marin-$REGION.yaml restart-cluster
uv run scripts/ray/cluster.py --config infra/marin-$REGION.yaml list-workers

# List and update configurations
uv run scripts/ray/cluster.py list-configs
uv run scripts/ray/cluster.py update-configs

# SSH operations
uv run scripts/ray/cluster.py --config infra/marin-$REGION.yaml ssh-head
uv run scripts/ray/cluster.py ssh-tpu 10.130.1.40

# Job operations
uv run scripts/ray/cluster.py --config infra/marin-$REGION.yaml list-jobs
uv run scripts/ray/cluster.py --config infra/marin-$REGION.yaml submit-job "python experiments/tutorials/hello_world.py"
uv run scripts/ray/cluster.py --config infra/marin-$REGION.yaml wait-job <JOB_ID> --show-logs

# Monitoring and logs
uv run scripts/ray/cluster.py --config infra/marin-$REGION.yaml show-logs --tail 100
uv run scripts/ray/cluster.py --config infra/marin-$REGION.yaml dashboard

Legacy Commands (still supported)

The original Ray commands still work but are less convenient:

# Monitor cluster status
ray status infra/marin-$REGION.yaml

# SSH into head node
ray attach infra/marin-$REGION.yaml

# View autoscaler logs
ray exec infra/marin-$REGION.yaml "tail -n 100 -f /tmp/ray/session_latest/logs/monitor*"

# Stop the cluster
ray down -y infra/marin-$REGION.yaml

Best Practices

Data Location: Use a bucket in the same region as your cluster to minimize data transfer costs.
Preemption Handling: If your compute is preemptible, meaning VMs can be shut down at any time. Design your jobs to:
- Be idempotent (can run multiple times safely)
- Handle interruptions gracefully
- Save checkpoints regularly
Resource Management:
- Avoid running heavy computation on the head node
- Break jobs into tasks that complete within 1-10 minutes
- Use appropriate TPU types for your workload (v4-8 for v4 clusters, v5e-1 for v5e clusters)

Troubleshooting

If you encounter issues:

Cluster Not Scaling: Check autoscaler logs for errors:

ray exec infra/marin-$REGION.yaml "tail -n 100 -f /tmp/ray/session_latest/logs/monitor*"

Worker Connection Issues: If workers aren't connecting, try:
- Checking GCP Console for VM/TPU status
- Reviewing Ray dashboard on the head node
- Restarting the cluster if needed
Cluster State Issues: If the cluster state becomes inconsistent:
- Bring down the cluster completely
- Delete any orphaned resources in GCP Console
- Bring the cluster back up

Next Steps

Try running a basic training job to verify your setup
Explore the Language Modeling Pipeline to understand how Marin uses TPUs for training
Read the Ray documentation for more details on cluster management
Levanter Ray autoscaling setup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting up a TPU Cluster

Prerequisites

TPU Cluster Architecture

Cluster Configuration

Building and Uploading the Docker Image

Setting Up the Cluster

Executing a Job on the cluster

Using Ray Submit

Using Ray Run

Managing the Cluster

Using the cluster management script

Legacy Commands (still supported)

Best Practices

Troubleshooting

Next Steps

FilesExpand file tree

tpu-cluster-setup.md

Latest commit

History

tpu-cluster-setup.md

File metadata and controls

Setting up a TPU Cluster

Prerequisites

TPU Cluster Architecture

Cluster Configuration

Building and Uploading the Docker Image

Setting Up the Cluster

Executing a Job on the cluster

Using Ray Submit

Using Ray Run

Managing the Cluster

Using the cluster management script

Legacy Commands (still supported)

Best Practices

Troubleshooting

Next Steps