Many Marin and Levanter workflows expect a durable object store for checkpoints, dataset shards, logs, and executor outputs.
This tutorial walks through setting up a Google Cloud Storage (GCS) bucket that you can reference via MARIN_PREFIX or trainer.checkpointer.base_path.
- Running local GPU or TPU experiments that write checkpoints to
gs://...paths. - Launching TPU jobs with
scripts/ray/cluster.pyor Ray clusters, where every worker streams artifacts to a shared prefix. - Hosting tokenized datasets or compilation caches that multiple jobs should reuse.
If you only run experiments locally with local_store/ you can skip this, but migrating to GCS early prevents churn later.
Pick a region that matches your compute (e.g., us-central2 for v4/v5e TPUs or us-west4 for west-coast GPUs). Using the same region keeps egress costs low and improves throughput. Bucket names are global, so choose something descriptive like gs://marin-<team>-us-central2.
For the storage class, decide between:
- Standard: Lowest latency and predictable performance; slightly higher cost but ideal if training jobs read/write checkpoints frequently.
- Autoclass: Google automatically moves objects to colder tiers if they sit idle, which can cut storage costs but occasionally delays reads when objects are thawed. Use this if you mostly archive checkpoints and don't mind rare rehydration pauses.
Marin will attempt to prevent cross-region egress by raising an error in training jobs that write to a different region than the compute, but it's best to avoid that situation entirely.
!!! warning
Avoid multi-region buckets (e.g., us or us-west) because they incur higher costs and have more complex performance characteristics. Single-region buckets are cheaper and more predictable for Marin workloads.
PROJECT_ID=your-gcp-project
BUCKET=gs://marin-yourteam-us-central2
REGION=us-central2
# Create the bucket with uniform access and no public exposure.
gcloud storage buckets create "$BUCKET" \
--project "$PROJECT_ID" \
--location "$REGION" \
--uniform-bucket-level-access \
--default-storage-class=STANDARD # add --enable-autoclass to enable automated tiering when you can tolerate slower cold reads
# Grant yourself (or a service account) Storage Admin if needed.
gcloud storage buckets add-iam-policy-binding "$BUCKET" \
--member="user:you@example.com" \
--role="roles/storage.objectAdmin"Uniform bucket-level access ensures IAM policies apply consistently; keep the bucket private unless you intentionally publish checkpoints.
!!! warning Disabling soft delete is critical to avoid runaway storage costs. Marin creates many large, short-lived files that should be deleted immediately. Of course, disabling soft delete means you cannot recover deleted files, so consider implementing lifecycle rules or replication for backups if needed.
GCS enables soft delete by default on new buckets. That feature retains deleted objects for at least seven days, which quickly explodes storage usage for Marin/Levanter workloads because training jobs constantly create and remove multi-gigabyte checkpoints and compilation caches. Disable soft delete immediately after creating the bucket:
# Permanently disable soft delete for this bucket.
gcloud storage buckets update "$BUCKET" --clear-soft-delete
# Optional: verify that the policy is cleared.
gcloud storage buckets describe "$BUCKET" \
--format="value(soft_delete_policy)"Clearing the policy ensures that once a training job deletes temporary files they disappear immediately, preventing runaway storage bills. You can still enable backups via lifecycle rules or replication if you need recovery.
For intermediate checkpoints and other short-lived data, Marin provides dedicated scratch buckets named marin-tmp-{region} (one per region). These buckets have lifecycle rules that automatically delete objects based on a ttl=Nd/ path prefix — for example, objects stored under gs://marin-tmp-us-central2/ttl=3d/my-job/ are deleted after 3 days.
Supported TTLs: 1, 2, 3, 4, 5, 6, 7, 14, and 30 days.
To provision or update all scratch buckets (create if missing, disable soft delete, apply lifecycle rules):
uv run infra/configure_temp_buckets.py
# Preview without applying changes:
uv run infra/configure_temp_buckets.py --dry-run
# Target a single bucket:
uv run infra/configure_temp_buckets.py --bucket marin-tmp-us-central2For non-scratch buckets, you can still set up lifecycle rules manually. For example, delete files under a prefix after seven days:
{
"rule": [
{
"action": {"type": "Delete"},
"condition": {"age": 7, "matchesPrefix": ["tmp/"]}
}
]
}Save this as lifecycle.json and apply it:
gcloud storage buckets update "$BUCKET" --lifecycle-file=lifecycle.jsonAdjust prefixes to match how your experiments organize outputs.
Set the bucket as your default prefix whenever you run tutorials:
export MARIN_PREFIX=$BUCKET
export WANDB_PROJECT=marin
export WANDB_ENTITY=your-entityFor Levanter configs, point the checkpointer to the same bucket:
trainer:
checkpointer:
base_path: "$BUCKET/your-run"Commit these defaults in .levanter.yaml or .envrc so every launch script uses the same location.
- Re-run
gcloud storage buckets describemonthly to confirm soft delete stays disabled. - Use
gcloud storage ls --buckets --soft-deletedto ensure no surprise buckets exist in soft-delete state. - Monitor storage costs in Cloud Monitoring or set up alerts when the bucket exceeds an expected size.
With this setup you have a clean, low-overhead bucket tailor-made for Marin and Levanter experiments without the surprise bills that soft delete can cause.