feat: Raft-based High Availability using Apache Ratis

## Summary

Redesign the ArcadeDB High Availability stack using [Apache Ratis](https://ratis.apache.org/) (Raft consensus protocol) to replace the current custom leader-follower replication. This provides stronger consistency guarantees, automatic leader election, and proven consensus semantics.

## Motivation

The current HA implementation uses a custom replication protocol that has known limitations around split-brain prevention, quorum enforcement, and failure recovery. Raft is a well-understood consensus algorithm with formal correctness proofs, and Apache Ratis is a mature, Apache 2.0-licensed Java implementation used in production by Apache Ozone and other systems.

## Design

### New Module: `ha-raft`

A self-contained Maven module (`arcadedb-ha-raft`) that plugs into the existing server via `ServiceLoader` discovery. Key components:

| Component | Responsibility |
|---|---|
| **RaftHAPlugin** | Server plugin entry point, config validation, ServiceLoader registration |
| **RaftHAServer** | Wraps Ratis `RaftServer` + `RaftClient`, peer list parsing, leader election |
| **RaftReplicatedDatabase** | Wraps `LocalDatabase`, intercepts TX commit and schema changes, submits to Raft log |
| **ArcadeStateMachine** | Ratis `StateMachine` — applies committed log entries (WAL replay + schema ops) |
| **RaftLogEntryCodec** | Serializes TX (WAL deltas) and SCHEMA entries for the Raft log |
| **SnapshotManager** | Checksum-based incremental snapshot for new node bootstrap |
| **ClusterMonitor** | Replication lag tracking and health reporting |
| **GetClusterHandler** | HTTP `/api/v1/cluster` endpoint for cluster status |

### Configuration

New `GlobalConfiguration` properties:

- `arcadedb.ha.implementation` — selects HA backend (`raft` or legacy)
- `arcadedb.ha.raft.port` — gRPC port for Raft inter-node communication (default: 2434)
- `arcadedb.ha.raft.persistStorage` — persist Raft log to disk for crash recovery
- `arcadedb.ha.raft.snapshotThreshold` — log entries before triggering snapshot compaction
- `arcadedb.ha.replicationLagWarning` — threshold (ms) to warn about replication lag
- `arcadedb.ha.serverList` — static peer list in format `host:raftPort:httpPort[,...]`
- `arcadedb.ha.quorum` — quorum policy (`majority`, `all`, `none`)

### How It Works

1. **Startup**: Each server starts a Ratis `RaftServer` with the configured peer list. Ratis handles leader election automatically.
2. **Writes**: `RaftReplicatedDatabase` intercepts `commit()` and `command()` calls, serializes the WAL delta or schema operation, and submits it to the Raft log via `RaftClient`.
3. **Replication**: Ratis replicates the log entry to a majority of peers. Once committed, `ArcadeStateMachine.applyTransaction()` replays the entry on each node.
4. **Leader forwarding**: Follower nodes forward write commands to the leader via authenticated HTTP, transparent to the client.
5. **Reads**: Served locally from any node (eventual consistency for reads).
6. **Failure recovery**: Crashed nodes replay the Raft log on restart. Nodes that fall too far behind receive a snapshot via `SnapshotManager`.

### Quorum & Split-Brain Prevention

With `quorum=majority`, a 3-node cluster tolerates 1 failure. If the leader loses contact with the majority, it automatically steps down (Raft protocol guarantee). The majority partition elects a new leader. No split-brain is possible — Raft's term-based voting prevents two leaders in the same term.

## Implementation Status

Development is on the `ha-redesign` branch. Current state:

### Core Implementation (complete)
- Raft log entry serialization for TX and schema operations
- State machine with WAL replay and schema application
- Cluster monitor with lag tracking
- Peer list parsing (supports `host:raftPort:httpPort:priority` format)
- Snapshot manager with checksum-based incremental sync
- Leader command forwarding via authenticated HTTP
- Plugin discovery via `ServiceLoader`
- Cluster status API (`/api/v1/cluster`)

### Test Coverage

**Unit tests** (25 in `ha-raft`):
- `RaftHAServerTest` — peer list parsing, display names, cluster token, separator handling
- `ArcadeStateMachineTest` — state machine entry application
- `RaftLogEntryCodecTest` — serialization round-trips
- `ClusterMonitorTest` — lag tracking
- `SnapshotManagerTest` — snapshot lifecycle
- `RaftHAPluginTest` — config validation
- `ConfigValidationTest` — configuration edge cases

**Integration tests** (14 in `ha-raft`):
- 2-node and 3-node replication
- Schema replication across cluster
- Leader failover and re-election
- Replica crash and recovery (Raft log replay)
- Leader crash and recovery with `DatabaseComparator` verification
- Quorum loss detection
- Split-brain prevention (3-node and 5-node clusters)
- Full snapshot resync for lagging nodes

**End-to-end container tests** (9 in `e2e-ha`):
- `SimpleHaScenarioIT` — 2-node basic replication
- `ThreeInstancesScenarioIT` — 3-node cluster with data verification
- `LeaderFailoverIT` — leader crash, re-election, data consistency
- `NetworkPartitionIT` — leader partition, follower partition, no-quorum scenarios
- `NetworkPartitionRecoveryIT` — partition healing with Raft log catch-up
- `SplitBrainIT` — split-brain prevention verification
- `RollingRestartIT` — zero-downtime rolling restarts with data persistence
- `NetworkDelayIT` — cluster behavior under latency (via Toxiproxy)
- `PacketLossIT` — cluster behavior under packet loss (via Toxiproxy)

### Remaining Work
- [ ] Performance benchmarking vs current HA implementation
- [ ] Read consistency options (leader reads, follower reads with staleness bound)
- [ ] Dynamic cluster membership (add/remove nodes without restart)
- [ ] Metrics integration (Ratis metrics → Prometheus)
- [ ] Documentation and migration guide from legacy HA
- [ ] Studio UI updates for Raft cluster status


Component	Responsibility
RaftHAPlugin	Server plugin entry point, config validation, ServiceLoader registration
RaftHAServer	Wraps Ratis `RaftServer` + `RaftClient`, peer list parsing, leader election
RaftReplicatedDatabase	Wraps `LocalDatabase`, intercepts TX commit and schema changes, submits to Raft log
ArcadeStateMachine	Ratis `StateMachine` — applies committed log entries (WAL replay + schema ops)
RaftLogEntryCodec	Serializes TX (WAL deltas) and SCHEMA entries for the Raft log
SnapshotManager	Checksum-based incremental snapshot for new node bootstrap
ClusterMonitor	Replication lag tracking and health reporting
GetClusterHandler	HTTP `/api/v1/cluster` endpoint for cluster status

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Raft-based High Availability using Apache Ratis #3730

Summary

Motivation

Design

New Module: `ha-raft`

Configuration

How It Works

Quorum & Split-Brain Prevention

Implementation Status

Core Implementation (complete)

Test Coverage

Remaining Work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

feat: Raft-based High Availability using Apache Ratis #3730

Description

Summary

Motivation

Design

New Module: ha-raft

Configuration

How It Works

Quorum & Split-Brain Prevention

Implementation Status

Core Implementation (complete)

Test Coverage

Remaining Work

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

New Module: `ha-raft`