Skip to content

Global domain archival URI is immutable across clusters, creating history-loss risk during multi-region failover #8010

@Ramkumar92

Description

@Ramkumar92

In a multi-cluster / multi-region Cadence deployment using global domains, the domain-level archival URI is replicated across clusters and becomes effectively immutable after it is first set.

That means a global domain created in region A with history_uri=s3://bucket-a and visibility_uri=s3://bucket-a continues to use those same URIs after failover to region B. There is no supported way to switch the domain to a region-local archival bucket after failover.

This is especially risky for S3-based archival. If the original archival bucket or its regional access path is unavailable from the failover region, history archival can fail during the failover window. The current archival workflow then proceeds to delete workflow history even after archival upload has failed, which creates a real risk of permanent archival data loss.

Steps to Reproduce / How to Trigger

  1. Set up Cadence with at least two clusters in different AWS regions and enable global domains.
  2. Enable history archival with S3.
  3. Register a global domain with region A’s bucket:
cadence --address <frontend-a>:7933 --do test-global domain register \
  --global_domain true \
  --history_archival_status enabled \
  --history_uri "s3://bucket-region-a" \
  --visibility_archival_status enabled \
  --visibility_uri "s3://bucket-region-a"
  1. Confirm the domain replicates to region B.
  2. Fail over the global domain so region B becomes active.
  3. Simulate region A bucket unavailability from region B. For example:
    • block egress from region B to the original bucket endpoint, or
    • use an outage scenario where the original bucket / access path is unavailable.
  4. Close workflows so they enter the retention cleanup + archival path.
  5. Observe history archival attempts against the original URI from region B.

Expected Behavior

  • A global domain should have a safe multi-region archival.
  • If Cadence expects a single URI for global domains, there should be a documented and supported multi-region mechanism that works for S3 failover scenarios.

Actual Behavior

  • Once the archival URI is set for a domain, it cannot be changed to a different bucket.
  • The same archival URI is replicated to all clusters for the global domain.
  • After failover, the new active region still attempts to archive to the original region’s bucket/URI.
  • If archival upload keeps failing, the archiver logs:
failed to archive history, will move on to deleting history without archiving
  • The workflow then proceeds to delete history, which can cause permanent loss of archival history for closed workflows during failover or bucket unavailability.

Additional attempted workaround:

We also tried using an S3 Multi-Region Access Point ARN as the domain archival URI:

cadence --address <frontend>:7933 --do domain-test domain register \
  --history_uri "s3://arn:aws:s3::710914175400:accesspoint/alias.mrap" \
  --visibility_uri "s3://arn:aws:s3::710914175400:accesspoint/alias.mrap" \
  --history_archival_status enabled \
  --visibility_archival_status enabled

This failed during URI parsing with:

parse "s3://arn:aws:s3::710914175400:accesspoint/m9bh5d4up9e61.mrap": invalid port ":accesspoint" after host

Why This Matters

  • Global domains are specifically used for cross-cluster / cross-region failover.
  • S3 archival URIs being immutable and global to the domain make regional failover unsafe unless all regions can always write to the same archival target.
  • During the exact scenario where failover matters most, archival can become unavailable.
  • Because failed history archival is followed by deletion, this is more than a missing feature; it is a durability bug for archival in multi-region failover scenarios.

Suggested Fix Directions

  • Allow per-cluster archival URI overrides for global domains.
  • Allow controlled archival URI migration after initial enablement.
  • Do not delete history when archival upload has failed.
  • If S3 MRAP is intended to be supported, accept MRAP-compatible URIs and avoid validation flows that assume standard bucket semantics only.

Environment

  • Cadence server version: v1.4.0
  • Cadence SDK language and version (if applicable): Java/4.0.0
  • Cadence web version (if applicable): 4.X
  • DB & version: Postgres 18
  • Scale: Multi-cluster / multi-region global domain deployment
  • Archival backend: S3
  • Topology: AWS multi-region failover setup

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugneeds-infoNeeds additional information from the reporter

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions