Skip to content

feat: Raft-based High Availability using Apache Ratis #3730

@robfrank

Description

@robfrank

Summary

Redesign the ArcadeDB High Availability stack using Apache Ratis (Raft consensus protocol) to replace the current custom leader-follower replication. This provides stronger consistency guarantees, automatic leader election, and proven consensus semantics.

Motivation

The current HA implementation uses a custom replication protocol that has known limitations around split-brain prevention, quorum enforcement, and failure recovery. Raft is a well-understood consensus algorithm with formal correctness proofs, and Apache Ratis is a mature, Apache 2.0-licensed Java implementation used in production by Apache Ozone and other systems.

Design

New Module: ha-raft

A self-contained Maven module (arcadedb-ha-raft) that plugs into the existing server via ServiceLoader discovery. Key components:

Component Responsibility
RaftHAPlugin Server plugin entry point, config validation, ServiceLoader registration
RaftHAServer Wraps Ratis RaftServer + RaftClient, peer list parsing, leader election
RaftReplicatedDatabase Wraps LocalDatabase, intercepts TX commit and schema changes, submits to Raft log
ArcadeStateMachine Ratis StateMachine — applies committed log entries (WAL replay + schema ops)
RaftLogEntryCodec Serializes TX (WAL deltas) and SCHEMA entries for the Raft log
SnapshotManager Checksum-based incremental snapshot for new node bootstrap
ClusterMonitor Replication lag tracking and health reporting
GetClusterHandler HTTP /api/v1/cluster endpoint for cluster status

Configuration

New GlobalConfiguration properties:

  • arcadedb.ha.implementation — selects HA backend (raft or legacy)
  • arcadedb.ha.raft.port — gRPC port for Raft inter-node communication (default: 2434)
  • arcadedb.ha.raft.persistStorage — persist Raft log to disk for crash recovery
  • arcadedb.ha.raft.snapshotThreshold — log entries before triggering snapshot compaction
  • arcadedb.ha.replicationLagWarning — threshold (ms) to warn about replication lag
  • arcadedb.ha.serverList — static peer list in format host:raftPort:httpPort[,...]
  • arcadedb.ha.quorum — quorum policy (majority, all, none)

How It Works

  1. Startup: Each server starts a Ratis RaftServer with the configured peer list. Ratis handles leader election automatically.
  2. Writes: RaftReplicatedDatabase intercepts commit() and command() calls, serializes the WAL delta or schema operation, and submits it to the Raft log via RaftClient.
  3. Replication: Ratis replicates the log entry to a majority of peers. Once committed, ArcadeStateMachine.applyTransaction() replays the entry on each node.
  4. Leader forwarding: Follower nodes forward write commands to the leader via authenticated HTTP, transparent to the client.
  5. Reads: Served locally from any node (eventual consistency for reads).
  6. Failure recovery: Crashed nodes replay the Raft log on restart. Nodes that fall too far behind receive a snapshot via SnapshotManager.

Quorum & Split-Brain Prevention

With quorum=majority, a 3-node cluster tolerates 1 failure. If the leader loses contact with the majority, it automatically steps down (Raft protocol guarantee). The majority partition elects a new leader. No split-brain is possible — Raft's term-based voting prevents two leaders in the same term.

Implementation Status

Development is on the ha-redesign branch. Current state:

Core Implementation (complete)

  • Raft log entry serialization for TX and schema operations
  • State machine with WAL replay and schema application
  • Cluster monitor with lag tracking
  • Peer list parsing (supports host:raftPort:httpPort:priority format)
  • Snapshot manager with checksum-based incremental sync
  • Leader command forwarding via authenticated HTTP
  • Plugin discovery via ServiceLoader
  • Cluster status API (/api/v1/cluster)

Test Coverage

Unit tests (25 in ha-raft):

  • RaftHAServerTest — peer list parsing, display names, cluster token, separator handling
  • ArcadeStateMachineTest — state machine entry application
  • RaftLogEntryCodecTest — serialization round-trips
  • ClusterMonitorTest — lag tracking
  • SnapshotManagerTest — snapshot lifecycle
  • RaftHAPluginTest — config validation
  • ConfigValidationTest — configuration edge cases

Integration tests (14 in ha-raft):

  • 2-node and 3-node replication
  • Schema replication across cluster
  • Leader failover and re-election
  • Replica crash and recovery (Raft log replay)
  • Leader crash and recovery with DatabaseComparator verification
  • Quorum loss detection
  • Split-brain prevention (3-node and 5-node clusters)
  • Full snapshot resync for lagging nodes

End-to-end container tests (9 in e2e-ha):

  • SimpleHaScenarioIT — 2-node basic replication
  • ThreeInstancesScenarioIT — 3-node cluster with data verification
  • LeaderFailoverIT — leader crash, re-election, data consistency
  • NetworkPartitionIT — leader partition, follower partition, no-quorum scenarios
  • NetworkPartitionRecoveryIT — partition healing with Raft log catch-up
  • SplitBrainIT — split-brain prevention verification
  • RollingRestartIT — zero-downtime rolling restarts with data persistence
  • NetworkDelayIT — cluster behavior under latency (via Toxiproxy)
  • PacketLossIT — cluster behavior under packet loss (via Toxiproxy)

Remaining Work

  • Performance benchmarking vs current HA implementation
  • Read consistency options (leader reads, follower reads with staleness bound)
  • Dynamic cluster membership (add/remove nodes without restart)
  • Metrics integration (Ratis metrics → Prometheus)
  • Documentation and migration guide from legacy HA
  • Studio UI updates for Raft cluster status

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions