Skip to content

Replace custom Raft protocol with Apache Ratis#3798

Open
lvca wants to merge 156 commits intomainfrom
apache-ratis
Open

Replace custom Raft protocol with Apache Ratis#3798
lvca wants to merge 156 commits intomainfrom
apache-ratis

Conversation

@lvca
Copy link
Copy Markdown
Member

@lvca lvca commented Apr 5, 2026

This comes from @robfrank after working into improving HA for many months with poor results. The main reason was the RAFT implementation itself I created years ago (my fault) that was quite limited and poorly tested. In order to reach the level of rock-solidity we want, it might take 1 year or more. So @robfrank had the idea to replace our internal RAFT system with Apache Ratis 3.2.2 - a battle-tested, formally correct implementation of the Raft consensus protocol used in production by Apache Ozone, IoTDB, and Alluxio.

After reviewing Apache Ratis internal architecture, I started from scratch with the integration to go in parallel with @robfrank branch ha-redesign. I've taken from @robfrank's branch ha-redesign many tests and some components.

This change is transparent to users - the HTTP API, database API, query languages, and client libraries remain unchanged. The internal HA protocol now uses gRPC (shaded by Ratis) instead of custom TCP binary messages.

92 files changed, +7,917 / -6,639 lines (net +1,278)

What was removed (34 files)

The entire custom HA stack: HAServer, Leader2ReplicaNetworkExecutor, Replica2LeaderNetworkExecutor, LeaderNetworkListener, ReplicationLogFile, ReplicationProtocol, 21 message classes (TxRequest, CommandForwardRequest, etc.), and related infrastructure.

What was added (22 new files)

Core HA (5 production files in server/ha/ratis/):

  • RaftHAServer - Ratis server lifecycle, gRPC transport, peer management, dynamic membership
  • ArcadeDBStateMachine - Ratis state machine for WAL replication with election tracking
  • RaftLogEntry - binary serialization for Raft log entries (TRANSACTION, TRANSACTION_FORWARD)
  • HALog - verbose logging utility (arcadedb.ha.logVerbose=0/1/2/3)
  • SnapshotHttpHandler - HTTP endpoint for database snapshot serving
  • ClusterMonitor - replication lag monitoring with configurable warning threshold

Tests (10 files):

  • RaftLogEntryTest (4 unit tests) - serialization round-trip
  • RaftHAServerIT (3 tests) - raw Ratis consensus
  • RaftReplicationIT (5 tests) - full cluster replication
  • RaftHAComprehensiveIT (17 tests) - data consistency, failover, concurrent writes, schema changes, rolling upgrade, large transactions
  • ReadConsistencyIT (3 tests) - EVENTUAL, READ_YOUR_WRITES, LINEARIZABLE
  • ClusterTokenAuthIT (5 tests) - cluster-internal auth validation
  • ClusterMonitorTest (5 unit tests) - lag tracking

Docker e2e tests (7 files, tagged e2e-ha, require Docker):

  • HAReplicationE2ETest - basic replication, leader failover, follower proxy
  • HARollingRestartE2ETest - rolling restart with continuous writes
  • HANetworkPartitionE2ETest - follower isolation via Docker network disconnect
  • HANetworkDelayE2ETest - 200ms-2000ms latency injection via Toxiproxy
  • HAPacketLossE2ETest - 5%-50% packet loss injection via Toxiproxy

Key features

  • Apache Ratis consensus - pre-vote protocol, parallel voting, gRPC streaming, leader lease
  • Read consistency levels - EVENTUAL (read locally), READ_YOUR_WRITES (default, bookmark-based), LINEARIZABLE (wait for all committed writes). Configurable per-connection via RemoteDatabase.setReadConsistency() or globally via arcadedb.ha.readConsistency
  • Cluster token auth - deterministic shared secret derived from clusterName + rootPassword for inter-node HTTP forwarding. Replaces Basic auth forwarding for session-based auth
  • K8s automation - auto-join on scale-up (tryAutoJoinCluster), auto-leave on scale-down (leaveCluster in preStop hook)
  • HA management commands - ha add/remove peer, ha transfer leader, ha step down, ha leave, ha verify database
  • Studio cluster dashboard - Overview (node cards, health badge), Metrics (replication lag chart with warning threshold, commit index, election count, uptime), Management (peer management, leadership transfer, database verification)
  • RemoteDatabase client failover - automatic retry with cluster topology reload on leader change (HTTP 503 -> NeedRetryException)
  • ClusterMonitor - background replication lag monitoring with configurable warning threshold (arcadedb.ha.replicationLagWarning)
  • gRPC channel refresh - RaftClient recreated on leader change to force fresh DNS resolution after network partitions
  • Snapshot via HTTP - follower catch-up via ZIP download from leader

New configuration settings

┌───────────────────────────────────┬──────────────────┬─────────────────────────────────────────────────────────┐
│              Setting              │     Default      │                       Description                       │
├───────────────────────────────────┼──────────────────┼─────────────────────────────────────────────────────────┤
│ arcadedb.ha.logVerbose            │ 0                │ HA verbose logging: 0=off, 1=basic, 2=detailed, 3=trace │
├───────────────────────────────────┼──────────────────┼─────────────────────────────────────────────────────────┤
│ arcadedb.ha.readConsistency       │ read_your_writes │ Default read consistency for follower reads             │
├───────────────────────────────────┼──────────────────┼─────────────────────────────────────────────────────────┤
│ arcadedb.ha.clusterToken          │ auto-derived     │ Shared secret for inter-node HTTP auth                  │
├───────────────────────────────────┼──────────────────┼─────────────────────────────────────────────────────────┤
│ arcadedb.ha.replicationLagWarning │ 1000             │ Raft log gap threshold for lag warnings (0=disabled)    │
└───────────────────────────────────┴──────────────────┴─────────────────────────────────────────────────────────┘

Test plan

  • RaftLogEntryTest - 4 unit tests
  • RaftHAServerIT - 3 pure Ratis consensus tests (3 nodes)
  • RaftReplicationIT - 5 cluster replication tests (3 nodes)
  • RaftHAComprehensiveIT - 17 comprehensive tests (3 nodes)
  • HTTP2ServersIT - 6 HTTP API tests (2 nodes)
  • ReplicationServerLeaderDownIT - leader failover with RemoteDatabase (3 nodes)
  • ReplicationServerLeaderChanges3TimesIT - 3 leader kill/restart cycles (3 nodes)
  • HASplitBrainIT - 5-node split brain (5 nodes)
  • ReadConsistencyIT - 3 read consistency level tests (2 nodes)
  • ClusterTokenAuthIT - 5 cluster auth tests (2 nodes)
  • ClusterMonitorTest - 5 unit tests
  • HAReplicationE2ETest - Docker e2e (requires Docker)
  • HANetworkDelayE2ETest - Toxiproxy latency (requires Docker)
  • HAPacketLossE2ETest - Toxiproxy packet loss (requires Docker)

@lvca lvca added this to the 26.4.1 milestone Apr 5, 2026
@lvca lvca self-assigned this Apr 5, 2026
Comment thread e2e/pom.xml Fixed
@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 5, 2026

Not up to standards ⛔

🔴 Issues 6 medium · 26 minor

Alerts:
⚠ 32 issues (≤ 0 issues of at least minor severity)

Results:
32 new issues

Category Results
BestPractice 6 medium
2 minor
CodeStyle 24 minor

View in Codacy

🟢 Metrics 572 complexity

Metric Results
Complexity 572

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

@claude
Copy link
Copy Markdown

claude bot commented Apr 5, 2026

Code Review

This is a substantial and well-motivated change - replacing ~6,000 lines of ad-hoc Raft-like code with Apache Ratis 3.2.1 is a significant improvement in correctness guarantees. The new test suite (36 tests including 17 comprehensive ones) is thorough. Below are issues worth addressing before merge.


Bugs

1. System.out.println in production code (HALog.java)

HALog.log() writes to both System.out AND LogManager. The System.out call must be removed per project conventions. It also logs every message twice (once formatted, once via the normal logger).

// HALog.java - remove this line:
System.out.println("[HA-" + level + "] " + formatted);

2. proxyToLeader - missing connection timeout

The conn (HttpURLConnection) in proxyToLeader() has no connect or read timeout set. If the leader is unreachable, this will hang the thread indefinitely. The SnapshotHttpHandler correctly sets timeouts (setConnectTimeout(30_000), setReadTimeout(300_000)); apply the same pattern here.

3. HALog format string - first replace() is a no-op

final String formatted = String.format(message.replace("%s", "%s").replace("%d", "%s"), args);

message.replace("%s", "%s") does nothing. If the intent is to normalize %d%s for String.format, that works, but the first replace is dead code and the logic is confusing. Consider removing it.

4. Spurious {} block in ServerIsNotTheLeaderException handler

In AbstractServerHttpHandler.handleRequest():

}
{        // <-- unnecessary anonymous block
  try {
    proxyToLeader(exchange, leaderAddr);
    return;
  } catch ...
}
sendErrorResponse(...)

The extra {...} block serves no purpose and looks like a copy-paste artefact.


Data Durability Concern

5. Leader commits locally before achieving quorum

replicateFromLeader() calls tx.commit2ndPhase(phase1) (durable local write) before raftHA.replicateTransaction(). If the leader crashes after local commit but before replication, the write is lost from the cluster's perspective even though the client received success.

Standard Raft commits only after quorum acknowledgement. This design is explicitly documented as intentional, but it weakens the durability guarantees. At a minimum, this should be prominently documented as a trade-off (the new docs/arcadedb-ha-26.4.1.md mentions it but could be clearer about the risk).


Security

6. Timing-attack-vulnerable cluster token comparison

// validateClusterForwardedAuth:
!expectedToken.equals(providedToken)

String.equals() is not constant-time. Use MessageDigest.isEqual() to prevent timing attacks on the shared cluster secret:

!MessageDigest.isEqual(expectedToken.getBytes(StandardCharsets.UTF_8), providedToken.getBytes(StandardCharsets.UTF_8))

7. User enumeration via error message

sendErrorResponse(exchange, 401, "Unknown forwarded user: " + forwardedUserValues.getFirst(), null, null);

Exposing the username in the response body of an auth failure enables user enumeration. Return a generic "Authentication failed" message instead.


Design / Maintainability

8. maxRetry silently mutated from caller's value

// RemoteHttpComponent:
if (maxRetry < 3)
  maxRetry = 3;

This overrides the caller's intent without signalling it. Callers passing maxRetry=1 for fast-fail scenarios will unexpectedly retry 3 times during elections. Consider using a local variable (effectiveRetry = Math.max(maxRetry, 3)) instead of mutating the parameter.

9. ATTRIBUTIONS.md not updated

Per CLAUDE.md: "When adding a dependency, you MUST update ATTRIBUTIONS.md and, if Apache-licensed with a NOTICE file, incorporate required notices into the main NOTICE file." Eight Ratis dependencies are added but ATTRIBUTIONS.md does not appear in the diff. Apache Ratis does have a NOTICE file.

10. ## TODO in committed documentation

docs/arcadedb-ha-26.4.1.md contains a ## TODO section listing future features. This is fine as a draft, but should either be removed or converted to GitHub issues before merging into the main branch.


Minor

  • The @deprecated note on getHA() / getRaftHA() alias in ArcadeDBServer is good, but @Deprecated annotation (with capital D) should be added to ensure IDE/compiler visibility.
  • RaftClusterStarter.java (test utility) uses System.out.println extensively - this is fine for a developer utility, though a comment clarifying it is a local dev tool (not executed by CI) would help.

Overall this is a high-quality implementation of a very complex change. The comprehensive test coverage and the transparent design documentation are appreciated. Addressing the data durability concern (#5) and the security issues (#6, #7) are the highest priority before merge.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the custom HA replication protocol with Apache Ratis, providing a formally verified Raft consensus implementation. The changes include a new Ratis state machine, snapshot management, and improved cluster monitoring. I have identified a flaw in the HTTP proxy's session-based authentication forwarding, a reliance on unreliable Thread.sleep in tests, and some unused code that should be cleaned up. Please address these issues to ensure robust authentication and reliable test execution.

Comment on lines +336 to +342
final String basicAuth = exchange.getAttachment(BASIC_AUTH_KEY);
if (basicAuth != null) {
// Extract username from stored Basic auth
final String decoded = new String(Base64.getDecoder().decode(
basicAuth.substring(AUTHORIZATION_BASIC.length() + 1)), java.nio.charset.StandardCharsets.UTF_8);
conn.setRequestProperty(HEADER_FORWARDED_USER, decoded.split(":")[0]);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There appears to be a flaw in the logic for forwarding session-based authentication. This code attempts to get the original Basic auth from an attachment on the current exchange. However, if the request was authenticated with a session Bearer token, this attachment will not be present, and the forwarded user will not be set.

To fix this, you should retrieve the session object from the session token and get the user's name from there. This would be more robust and would not require attaching the basic auth string to the exchange.

Here's a suggested implementation:

      if (auth != null && auth.startsWith("Bearer AU-")) {
        // Session token: use cluster-internal auth headers instead
        conn.setRequestProperty(HEADER_CLUSTER_TOKEN, raftHA.getClusterToken());
        final String sessionToken = auth.substring(AUTHORIZATION_BEARER.length()).trim();
        final HttpAuthSession session = httpServer.getAuthSessionManager().getSession(sessionToken);
        if (session != null) {
            conn.setRequestProperty(HEADER_FORWARDED_USER, session.getUser().getName());
        } else {
            // Handle case where session is invalid or expired
        }
      } else if (auth != null) {
// ...

final JSONObject cluster = getClusterInfo(container);
if (cluster.has("isLeader") && cluster.getBoolean("isLeader"))
return true;
} catch (final Exception ignored) {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Empty catch blocks should be avoided, even in tests, as they can swallow exceptions and hide underlying problems. It would be better to at least log the exception at a DEBUG or TRACE level. This would make debugging test failures easier. This applies to other similar empty catch blocks in this file and other new test files.

            } catch (final Exception e) {
              // Ignored during polling for leader
            }

Comment on lines +193 to +195
// Toxiproxy "slow_close" + "limit_data" simulate packet loss behavior.
// The "timeout" toxic with a probability achieves drop behavior.
// Using bandwidth toxic with very low rate to simulate packet-level loss.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment here is a bit misleading. It mentions slow_close, limit_data, and timeout toxics to simulate packet loss, but the implementation uses the bandwidth toxic. While limiting bandwidth can simulate some effects of packet loss (like timeouts), it's not the same as randomly dropping packets. The comment should be updated to accurately reflect the implementation, for example, by stating that it simulates packet loss effects by severely limiting bandwidth.

Comment on lines +663 to +696
private ResultSet forwardCommandToLeader(final String language, final String query, final Map<String, Object> namedParams,
final Object[] positionalParams) {
HALog.log(this, HALog.DETAILED, "Forwarding command to leader: %s %s (db=%s)", language, query, getName());

// Rollback the local transaction started by DatabaseAbstractHandler.transaction() wrapper.
// The command executes on the leader, so no local changes should be committed.
if (isTransactionActive())
rollback();

final RaftHAServer raftHA = server.getRaftHA();
final byte[] resultBytes = raftHA.forwardCommand(getName(), language, query, namedParams, positionalParams);
HALog.log(this, HALog.TRACE, "Command forwarded successfully: %d bytes result", resultBytes.length);

// Wait for the leader's WAL changes to be applied locally on this follower.
// Without this, a subsequent read on this server may not see the changes yet.
raftHA.waitForLocalApply();

// Check for error response
if (resultBytes.length > 0 && resultBytes[0] == 'E') {
final String error = new String(resultBytes, 1, resultBytes.length - 1);
throw new com.arcadedb.exception.CommandExecutionException(error);
}

// Deserialize binary result into ResultSet
final java.util.List<Map<String, Object>> rows =
com.arcadedb.server.ha.ratis.RaftLogEntry.deserializeCommandResult(resultBytes);
final com.arcadedb.query.sql.executor.InternalResultSet rs = new com.arcadedb.query.sql.executor.InternalResultSet();
for (final Map<String, Object> row : rows) {
final com.arcadedb.query.sql.executor.ResultInternal result = new com.arcadedb.query.sql.executor.ResultInternal(proxied);
result.setPropertiesFromMap(row);
rs.add(result);
}
return rs;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The private method forwardCommandToLeader seems to be unused. The logic in command() now throws ServerIsNotTheLeaderException, and the forwarding is handled by the HTTP proxy layer. This method appears to be a remnant of a different implementation strategy. To improve maintainability and avoid confusion, it should be removed if it's no longer needed.

.atMost(5, TimeUnit.MINUTES)
.pollInterval(1, TimeUnit.SECONDS)
.until(() -> getServer(serverNumber).getHA().getMessagesInQueue() == 0);
// With Ratis, replication is handled internally. Wait for state machine application.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Replacing the explicit wait for the replication queue to be empty with a fixed Thread.sleep() is unreliable and can lead to flaky tests. A better approach would be to implement a more deterministic wait. For example, you could get the leader's commit index and then poll the followers until their applied index matches the leader's commit index. This would ensure that replication is actually complete before proceeding with test assertions.

    // With Ratis, replication is handled internally. Await for state machine application on followers.
    final ArcadeDBServer leader = getLeaderServer();
    if (leader == null)
      return; // NO LEADER, CANNOT WAIT

    final long leaderCommitIndex = leader.getHA().getCommitIndex();

    for (int i = 0; i < getServerCount(); ++i) {
      final ArcadeDBServer server = getServer(i);
      if (server != leader)
        Awaitility.await().atMost(1, TimeUnit.MINUTES).until(() -> server.getHA().getLastAppliedIndex() >= leaderCommitIndex);
    }

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

📜 License Compliance Check

✅ License check passed. See artifacts for full report.

License Summary (first 50 lines)

Lists of 401 third-party dependencies.
     (Apache License 2.0) LZ4 Java Compression (at.yawk.lz4:lz4-java:1.10.4 - https://github.com/yawkat/lz4-java)
     (EPL 2.0) (GNU Lesser General Public License) Logback Classic Module (ch.qos.logback:logback-classic:1.5.32 - http://logback.qos.ch/logback-classic)
     (EPL 2.0) (GNU Lesser General Public License) Logback Core Module (ch.qos.logback:logback-core:1.5.32 - http://logback.qos.ch/logback-core)
     (Apache 2) ArcadeDB BOLT Protocol (com.arcadedb:arcadedb-bolt:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-bolt/)
     (Apache 2) ArcadeDB Console (com.arcadedb:arcadedb-console:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-console/)
     (Apache 2) ArcadeDB Engine (com.arcadedb:arcadedb-engine:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-engine/)
     (Apache 2) ArcadeDB GraphQL (com.arcadedb:arcadedb-graphql:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-graphql/)
     (Apache 2) ArcadeDB Gremlin (com.arcadedb:arcadedb-gremlin:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-gremlin/)
     (Apache 2) ArcadeDB gRPC Stubs (com.arcadedb:arcadedb-grpc:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc/)
     (Apache 2) ArcadeDB gRPC Client (com.arcadedb:arcadedb-grpc-client:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc-client/)
     (Apache 2) ArcadeDB gRpcW (com.arcadedb:arcadedb-grpcw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpcw/)
     (Apache 2) ArcadeDB Integration (com.arcadedb:arcadedb-integration:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-integration/)
     (Apache 2) ArcadeDB Metrics (com.arcadedb:arcadedb-metrics:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-metrics/)
     (Apache 2) ArcadeDB MongoDB Wire Protocol (com.arcadedb:arcadedb-mongodbw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-mongodbw/)
     (Apache 2) ArcadeDB Network (com.arcadedb:arcadedb-network:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-network/)
     (Apache 2) ArcadeDB PostgresW (com.arcadedb:arcadedb-postgresw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-postgresw/)
     (Apache 2) ArcadeDB RedisW (com.arcadedb:arcadedb-redisw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-redisw/)
     (Apache 2) ArcadeDB Server (com.arcadedb:arcadedb-server:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-server/)
     (Apache 2) ArcadeDB Studio (com.arcadedb:arcadedb-studio:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-studio/)
     (Apache 2) ArcadeDB Test Utils (com.arcadedb:arcadedb-test-utils:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-test-utils/)
     (Apache License 2.0) HPPC Collections (com.carrotsearch:hppc:0.7.1 - http://labs.carrotsearch.com/hppc.html/hppc)
     (Apache License 2.0) Metrics Core (com.codahale.metrics:metrics-core:3.0.2 - http://metrics.codahale.com/metrics-core/)
     (The Apache License, Version 2.0) com.conversantmedia:disruptor (com.conversantmedia:disruptor:1.2.21 - https://github.com/conversant/disruptor)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.20 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.21 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.1 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.2 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.1 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.2 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-dataformat-YAML (com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.21.1 - https://github.com/FasterXML/jackson-dataformats-text)
     (Apache License 2.0) Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.21.1 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
     (Apache License 2.0) Caffeine cache (com.github.ben-manes.caffeine:caffeine:2.3.1 - https://github.com/ben-manes/caffeine)
     (Apache License 2.0) docker-java-api (com.github.docker-java:docker-java-api:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport (com.github.docker-java:docker-java-transport:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport-zerodep (com.github.docker-java:docker-java-transport-zerodep:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) btf (com.github.java-json-tools:btf:1.3 - https://github.com/java-json-tools/btf)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils (com.github.java-json-tools:jackson-coreutils:2.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils-equivalence (com.github.java-json-tools:jackson-coreutils-equivalence:1.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-patch (com.github.java-json-tools:json-patch:1.13 - https://github.com/java-json-tools/json-patch)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-core (com.github.java-json-tools:json-schema-core:1.2.14 - https://github.com/java-json-tools/json-schema-core)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-validator (com.github.java-json-tools:json-schema-validator:2.2.14 - https://github.com/java-json-tools/json-schema-validator)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) msg-simple (com.github.java-json-tools:msg-simple:1.2 - https://github.com/java-json-tools/msg-simple)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) uri-template (com.github.java-json-tools:uri-template:0.10 - https://github.com/java-json-tools/uri-template)
     (Apache License 2.0) (GNU Lesser General Public License) javaparser-core (com.github.javaparser:javaparser-core:3.26.3 - https://github.com/javaparser/javaparser-core)
     (Apache License 2.0) JCIP Annotations under Apache License (com.github.stephenc.jcip:jcip-annotations:1.0-1 - http://stephenc.github.com/jcip-annotations)
     (Apache License 2.0) Google Android Annotations Library (com.google.android:annotations:4.1.1.4 - http://source.android.com/)
     (BSD 3-Clause) API Common (com.google.api:api-common:2.53.0 - https://github.com/googleapis/sdk-platform-java)
     (Apache License 2.0) proto-google-common-protos (com.google.api.grpc:proto-google-common-protos:2.67.0 - https://github.com/googleapis/sdk-platform-java)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

📜 License Compliance Check

✅ License check passed. See artifacts for full report.

License Summary (first 50 lines)

Lists of 400 third-party dependencies.
     (Apache License 2.0) LZ4 Java Compression (at.yawk.lz4:lz4-java:1.10.4 - https://github.com/yawkat/lz4-java)
     (EPL 2.0) (GNU Lesser General Public License) Logback Classic Module (ch.qos.logback:logback-classic:1.5.32 - http://logback.qos.ch/logback-classic)
     (EPL 2.0) (GNU Lesser General Public License) Logback Core Module (ch.qos.logback:logback-core:1.5.32 - http://logback.qos.ch/logback-core)
     (Apache 2) ArcadeDB BOLT Protocol (com.arcadedb:arcadedb-bolt:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-bolt/)
     (Apache 2) ArcadeDB Console (com.arcadedb:arcadedb-console:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-console/)
     (Apache 2) ArcadeDB Engine (com.arcadedb:arcadedb-engine:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-engine/)
     (Apache 2) ArcadeDB GraphQL (com.arcadedb:arcadedb-graphql:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-graphql/)
     (Apache 2) ArcadeDB Gremlin (com.arcadedb:arcadedb-gremlin:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-gremlin/)
     (Apache 2) ArcadeDB gRPC Stubs (com.arcadedb:arcadedb-grpc:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc/)
     (Apache 2) ArcadeDB gRPC Client (com.arcadedb:arcadedb-grpc-client:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc-client/)
     (Apache 2) ArcadeDB gRpcW (com.arcadedb:arcadedb-grpcw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpcw/)
     (Apache 2) ArcadeDB Integration (com.arcadedb:arcadedb-integration:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-integration/)
     (Apache 2) ArcadeDB Metrics (com.arcadedb:arcadedb-metrics:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-metrics/)
     (Apache 2) ArcadeDB MongoDB Wire Protocol (com.arcadedb:arcadedb-mongodbw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-mongodbw/)
     (Apache 2) ArcadeDB Network (com.arcadedb:arcadedb-network:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-network/)
     (Apache 2) ArcadeDB PostgresW (com.arcadedb:arcadedb-postgresw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-postgresw/)
     (Apache 2) ArcadeDB RedisW (com.arcadedb:arcadedb-redisw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-redisw/)
     (Apache 2) ArcadeDB Server (com.arcadedb:arcadedb-server:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-server/)
     (Apache 2) ArcadeDB Studio (com.arcadedb:arcadedb-studio:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-studio/)
     (Apache 2) ArcadeDB Test Utils (com.arcadedb:arcadedb-test-utils:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-test-utils/)
     (Apache License 2.0) HPPC Collections (com.carrotsearch:hppc:0.7.1 - http://labs.carrotsearch.com/hppc.html/hppc)
     (Apache License 2.0) Metrics Core (com.codahale.metrics:metrics-core:3.0.2 - http://metrics.codahale.com/metrics-core/)
     (The Apache License, Version 2.0) com.conversantmedia:disruptor (com.conversantmedia:disruptor:1.2.21 - https://github.com/conversant/disruptor)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.20 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.21 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.1 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.2 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.1 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.2 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-dataformat-YAML (com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.21.1 - https://github.com/FasterXML/jackson-dataformats-text)
     (Apache License 2.0) Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.21.1 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
     (Apache License 2.0) Caffeine cache (com.github.ben-manes.caffeine:caffeine:2.3.1 - https://github.com/ben-manes/caffeine)
     (Apache License 2.0) docker-java-api (com.github.docker-java:docker-java-api:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport (com.github.docker-java:docker-java-transport:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport-zerodep (com.github.docker-java:docker-java-transport-zerodep:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) btf (com.github.java-json-tools:btf:1.3 - https://github.com/java-json-tools/btf)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils (com.github.java-json-tools:jackson-coreutils:2.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils-equivalence (com.github.java-json-tools:jackson-coreutils-equivalence:1.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-patch (com.github.java-json-tools:json-patch:1.13 - https://github.com/java-json-tools/json-patch)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-core (com.github.java-json-tools:json-schema-core:1.2.14 - https://github.com/java-json-tools/json-schema-core)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-validator (com.github.java-json-tools:json-schema-validator:2.2.14 - https://github.com/java-json-tools/json-schema-validator)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) msg-simple (com.github.java-json-tools:msg-simple:1.2 - https://github.com/java-json-tools/msg-simple)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) uri-template (com.github.java-json-tools:uri-template:0.10 - https://github.com/java-json-tools/uri-template)
     (Apache License 2.0) (GNU Lesser General Public License) javaparser-core (com.github.javaparser:javaparser-core:3.26.3 - https://github.com/javaparser/javaparser-core)
     (Apache License 2.0) JCIP Annotations under Apache License (com.github.stephenc.jcip:jcip-annotations:1.0-1 - http://stephenc.github.com/jcip-annotations)
     (Apache License 2.0) Google Android Annotations Library (com.google.android:annotations:4.1.1.4 - http://source.android.com/)
     (BSD 3-Clause) API Common (com.google.api:api-common:2.53.0 - https://github.com/googleapis/sdk-platform-java)
     (Apache License 2.0) proto-google-common-protos (com.google.api.grpc:proto-google-common-protos:2.67.0 - https://github.com/googleapis/sdk-platform-java)

@lvca
Copy link
Copy Markdown
Member Author

lvca commented Apr 5, 2026

Created first HA benchmark with multiple server, testing an insert and the results are VERY PPOR:

==========================================================================================
  ArcadeDB HA Insert Benchmark
==========================================================================================
  Records: 5,000 (warmup: 500)  |  Batch size: 100  |  Vertex type: Sensor
==========================================================================================

  Scenario                                          Throughput    Avg       P99       Max
  ------------------------------------------------  ----------  --------  --------  --------
  1 server (no HA) - embedded                       1,546 op/s    647 us   4,362 us   4,362 us
  3 servers (HA)   - embedded on leader                53 op/s 18,831 us  28,638 us  28,638 us
  5 servers (HA)   - embedded on leader                49 op/s 20,399 us  24,416 us  24,416 us
  3 servers (HA)   - remote via follower proxy         59 op/s 16,861 us  23,483 us  38,443 us
  3 servers (HA)   - concurrent (3 threads)            77 op/s 39,060 us  52,003 us 106,720 us
  5 servers (HA)   - concurrent (5 threads)            67 op/s 74,731 us 157,715 us 270,166 us

Key findings:

  • HA overhead: ~29x slower than no-HA (1,546 vs 53 ops/sec per batch of 100). Each batch must replicate via gRPC to majority before committing. This is expected for synchronous Raft consensus.
  • 3 vs 5 servers: Minimal difference (53 vs 49 ops/sec). MAJORITY quorum means 3 servers need 2 acks, 5 servers need 3 acks - only one extra network round-trip.
  • Follower proxy: Actually slightly faster than direct-to-leader embedded (59 vs 53 ops/sec) because single-insert commands don't batch and Ratis can pipeline better with small payloads.
  • Concurrent throughput: 3 threads achieve 77 ops/sec total (vs 59 single-thread) - ~30% improvement from parallelism. 5 threads get 67 ops/sec, likely bottlenecked on the single leader processing all writes.
  • P99 latency: Scales with concurrency (23ms single -> 52ms 3-thread -> 157ms 5-thread) due to leader contention.

The benchmark is tagged @tag("benchmark") so it's excluded from normal test runs. Run with: mvn test -pl server -Dtest=HAInsertBenchmark -Dgroups=benchmark

We need to analyze this before merging.

@claude
Copy link
Copy Markdown

claude bot commented Apr 5, 2026

Code review posted - see full review details in the next comment

@claude
Copy link
Copy Markdown

claude bot commented Apr 5, 2026

test4 ignore

@claude
Copy link
Copy Markdown

claude bot commented Apr 5, 2026

Deleting test - please ignore this comment

@claude
Copy link
Copy Markdown

claude bot commented Apr 5, 2026

Code Review - Replace custom Raft protocol with Apache Ratis

This is a major and well-structured change. Replacing a custom, ad-hoc Raft implementation with Apache Ratis is the right call for long-term reliability. The overall architecture is sound, the test coverage is impressive (36 tests + Docker e2e), and the documentation is thorough. Below are issues found during review.


CRITICAL

1. System.out.println in production code (HALog.java)

HALog.log() unconditionally calls System.out.println in addition to LogManager. Per project conventions (CLAUDE.md): remove any System.out you used for debug when you have finished. This will spam stdout in production when arcadedb.ha.logVerbose is nonzero.

Also, the .replace in the format call is a no-op, and building a String.format() string just for the println while also calling LogManager is redundant. The println should simply be removed.


2. Platform-default charset in UUID.nameUUIDFromBytes(clusterName.getBytes()) in RaftHAServer.java

All nodes must agree on the same groupId or they will never form a cluster. Using .getBytes() without a charset argument is platform-dependent. This should use StandardCharsets.UTF_8 explicitly, as already done for the token derivation two lines later:

UUID.nameUUIDFromBytes(clusterName.getBytes(StandardCharsets.UTF_8))

HIGH

3. Busy-wait polling in waitForAppliedIndex() and waitForLocalApply()

Both methods spin with Thread.sleep(10) up to the quorum timeout. This adds up to 10ms latency on every READ_YOUR_WRITES and LINEARIZABLE read and holds a thread. Since ArcadeDBStateMachine already has lastAppliedIndex as an AtomicLong, a LockSupport-based condition or a CompletableFuture completed when the index advances would be lower-latency.


4. Dead code: forwardCommandToLeader() private method in ReplicatedDatabase.java

The method is defined but never called. All command paths now throw ServerIsNotTheLeaderException instead, relying on the HTTP proxy. The method should be removed to avoid confusion.


5. No connection/read timeout on the HTTP proxy in AbstractServerHttpHandler.proxyToLeader()

The call to new java.net.URI(targetUrl).toURL().openConnection() has no setConnectTimeout or setReadTimeout. If the leader is slow or unreachable, the follower thread blocks indefinitely. Explicit timeouts matching quorumTimeout should be set.


6. Fully qualified names in production code

CLAUDE.md says do not use fully qualified names if possible, always import the class. There are many FQN usages in production code in RaftHAServer.java and ReplicatedDatabase.java. For example: peerHttpAddresses is declared as new java.util.concurrent.ConcurrentHashMap, lagMonitorExecutor as java.util.concurrent.ScheduledExecutorService, and getFollowerStates returns java.util.List of java.util.Map. These should be imported at the top of the file.


MEDIUM

7. isCurrentNodeLeader() race in ArcadeDBStateMachine.applyTransaction()

The state machine skips apply on the leader (returns early if isCurrentNodeLeader()). During a leadership transition, a node that was leader when commit2ndPhase() ran could lose leadership before applyTransaction() is called, causing double-application of WAL page changes. Worth adding a comment explaining why this is safe in the Ratis model, or verifying it actually cannot happen.


8. Breaking change: connectCluster / disconnectCluster throw UnsupportedOperationException

Existing users calling POST /api/v1/server with command connect or disconnect will get a 500 response instead of a helpful message. These should return a proper HTTP 400 with a migration hint rather than an uncaught exception.


9. Spurious double-block in exception handler in AbstractServerHttpHandler.java

There is an extra unnecessary block wrapping the proxyToLeader try block in the ServerIsNotTheLeaderException catch handler. It is a no-op but visually confusing and should be removed.


LOW / NITS

10. Cluster token written back to GlobalConfiguration

initClusterToken() calls configuration.setValue(GlobalConfiguration.HA_CLUSTER_TOKEN, this.clusterToken). Since GlobalConfiguration values can appear in server info dumps and logs, the derived token (which is a function of the root password) could be inadvertently exposed. Consider keeping it only in the clusterToken field without writing it back to the global config.

11. getFollowerStates() index correlation assumption

The method assumes matchIndices[i] corresponds to the i-th non-self peer in raftGroup.getPeers(). This assumption depends on Ratis internal ordering that may not be guaranteed across versions. Should be validated against the Ratis API contract.

12. System.out.println in test infrastructure

RaftClusterStarter System.out usage is fine as CLI output, but actual test classes such as RaftHAComprehensiveIT should not emit to stdout to avoid cluttering CI output.


Summary

Category Count
Critical 2
High 4
Medium 3
Low/Nit 3

The architectural direction is excellent and the test suite is thorough. The System.out.println in HALog and the platform-default charset in the group ID derivation are the two that should be addressed before merge.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 5, 2026

Codecov Report

❌ Patch coverage is 61.54073% with 1303 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.22%. Comparing base (e076207) to head (88cea6b).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...java/com/arcadedb/server/ha/raft/RaftHAServer.java 57.97% 171 Missing and 32 partials ⚠️
.../arcadedb/server/ha/raft/ArcadeDBStateMachine.java 51.68% 156 Missing and 30 partials ⚠️
...com/arcadedb/server/ha/raft/SnapshotInstaller.java 47.08% 133 Missing and 12 partials ⚠️
...om/arcadedb/server/ha/raft/ReplicatedDatabase.java 63.04% 128 Missing and 15 partials ⚠️
...m/arcadedb/server/ha/raft/SnapshotHttpHandler.java 49.13% 63 Missing and 25 partials ⚠️
...om/arcadedb/server/ha/raft/RaftClusterManager.java 38.98% 66 Missing and 6 partials ⚠️
...om/arcadedb/server/ha/raft/RaftGroupCommitter.java 52.70% 55 Missing and 15 partials ⚠️
...om/arcadedb/server/ha/raft/KubernetesAutoJoin.java 26.74% 62 Missing and 1 partial ⚠️
...dedb/server/ha/raft/PostVerifyDatabaseHandler.java 61.87% 39 Missing and 22 partials ⚠️
...java/com/arcadedb/server/ha/raft/RaftHAPlugin.java 54.32% 33 Missing and 4 partials ⚠️
... and 19 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3798      +/-   ##
==========================================
- Coverage   64.71%   62.22%   -2.50%     
==========================================
  Files        1581     1583       +2     
  Lines      117023   118043    +1020     
  Branches    24858    25075     +217     
==========================================
- Hits        75735    73451    -2284     
- Misses      30925    34346    +3421     
+ Partials    10363    10246     -117     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

📜 License Compliance Check

✅ License check passed. See artifacts for full report.

License Summary (first 50 lines)

Lists of 400 third-party dependencies.
     (Apache License 2.0) LZ4 Java Compression (at.yawk.lz4:lz4-java:1.10.4 - https://github.com/yawkat/lz4-java)
     (EPL 2.0) (GNU Lesser General Public License) Logback Classic Module (ch.qos.logback:logback-classic:1.5.32 - http://logback.qos.ch/logback-classic)
     (EPL 2.0) (GNU Lesser General Public License) Logback Core Module (ch.qos.logback:logback-core:1.5.32 - http://logback.qos.ch/logback-core)
     (Apache 2) ArcadeDB BOLT Protocol (com.arcadedb:arcadedb-bolt:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-bolt/)
     (Apache 2) ArcadeDB Console (com.arcadedb:arcadedb-console:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-console/)
     (Apache 2) ArcadeDB Engine (com.arcadedb:arcadedb-engine:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-engine/)
     (Apache 2) ArcadeDB GraphQL (com.arcadedb:arcadedb-graphql:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-graphql/)
     (Apache 2) ArcadeDB Gremlin (com.arcadedb:arcadedb-gremlin:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-gremlin/)
     (Apache 2) ArcadeDB gRPC Stubs (com.arcadedb:arcadedb-grpc:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc/)
     (Apache 2) ArcadeDB gRPC Client (com.arcadedb:arcadedb-grpc-client:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc-client/)
     (Apache 2) ArcadeDB gRpcW (com.arcadedb:arcadedb-grpcw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpcw/)
     (Apache 2) ArcadeDB Integration (com.arcadedb:arcadedb-integration:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-integration/)
     (Apache 2) ArcadeDB Metrics (com.arcadedb:arcadedb-metrics:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-metrics/)
     (Apache 2) ArcadeDB MongoDB Wire Protocol (com.arcadedb:arcadedb-mongodbw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-mongodbw/)
     (Apache 2) ArcadeDB Network (com.arcadedb:arcadedb-network:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-network/)
     (Apache 2) ArcadeDB PostgresW (com.arcadedb:arcadedb-postgresw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-postgresw/)
     (Apache 2) ArcadeDB RedisW (com.arcadedb:arcadedb-redisw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-redisw/)
     (Apache 2) ArcadeDB Server (com.arcadedb:arcadedb-server:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-server/)
     (Apache 2) ArcadeDB Studio (com.arcadedb:arcadedb-studio:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-studio/)
     (Apache 2) ArcadeDB Test Utils (com.arcadedb:arcadedb-test-utils:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-test-utils/)
     (Apache License 2.0) HPPC Collections (com.carrotsearch:hppc:0.7.1 - http://labs.carrotsearch.com/hppc.html/hppc)
     (Apache License 2.0) Metrics Core (com.codahale.metrics:metrics-core:3.0.2 - http://metrics.codahale.com/metrics-core/)
     (The Apache License, Version 2.0) com.conversantmedia:disruptor (com.conversantmedia:disruptor:1.2.21 - https://github.com/conversant/disruptor)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.20 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.21 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.1 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.2 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.1 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.2 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-dataformat-YAML (com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.21.1 - https://github.com/FasterXML/jackson-dataformats-text)
     (Apache License 2.0) Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.21.1 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
     (Apache License 2.0) Caffeine cache (com.github.ben-manes.caffeine:caffeine:2.3.1 - https://github.com/ben-manes/caffeine)
     (Apache License 2.0) docker-java-api (com.github.docker-java:docker-java-api:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport (com.github.docker-java:docker-java-transport:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport-zerodep (com.github.docker-java:docker-java-transport-zerodep:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) btf (com.github.java-json-tools:btf:1.3 - https://github.com/java-json-tools/btf)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils (com.github.java-json-tools:jackson-coreutils:2.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils-equivalence (com.github.java-json-tools:jackson-coreutils-equivalence:1.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-patch (com.github.java-json-tools:json-patch:1.13 - https://github.com/java-json-tools/json-patch)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-core (com.github.java-json-tools:json-schema-core:1.2.14 - https://github.com/java-json-tools/json-schema-core)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-validator (com.github.java-json-tools:json-schema-validator:2.2.14 - https://github.com/java-json-tools/json-schema-validator)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) msg-simple (com.github.java-json-tools:msg-simple:1.2 - https://github.com/java-json-tools/msg-simple)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) uri-template (com.github.java-json-tools:uri-template:0.10 - https://github.com/java-json-tools/uri-template)
     (Apache License 2.0) (GNU Lesser General Public License) javaparser-core (com.github.javaparser:javaparser-core:3.26.3 - https://github.com/javaparser/javaparser-core)
     (Apache License 2.0) JCIP Annotations under Apache License (com.github.stephenc.jcip:jcip-annotations:1.0-1 - http://stephenc.github.com/jcip-annotations)
     (Apache License 2.0) Google Android Annotations Library (com.google.android:annotations:4.1.1.4 - http://source.android.com/)
     (BSD 3-Clause) API Common (com.google.api:api-common:2.53.0 - https://github.com/googleapis/sdk-platform-java)
     (Apache License 2.0) proto-google-common-protos (com.google.api.grpc:proto-google-common-protos:2.67.0 - https://github.com/googleapis/sdk-platform-java)

@claude
Copy link
Copy Markdown

claude bot commented Apr 5, 2026

Code Review - PR 3798: Replace Custom Raft with Apache Ratis

This is a substantial, well-motivated architectural replacement. Replacing a hand-rolled Raft protocol with Apache Ratis is the right call - the PR description is honest about the limitations of the old implementation. The new code is well-structured and comes with an impressive test suite (17 comprehensive tests + Docker e2e tests with network fault injection). The documentation in docs/arcadedb-ha-26.4.1.md is thorough.

Below are the issues found, grouped by severity.


SECURITY - High

1. Weak cluster token derivation (RaftHAServer.initClusterToken())

UUID.nameUUIDFromBytes uses MD5 (RFC 4122 v3) - not a cryptographic primitive. Given the cluster name is often visible via /api/v1/server, an attacker can enumerate common passwords offline and derive the token. The cluster token then bypasses per-request authentication in validateClusterForwardedAuth().

Recommendation: use HMAC-SHA256 (javax.crypto.Mac) which is in the JDK with no new dependencies:

Mac mac = Mac.getInstance("HmacSHA256");
mac.init(new SecretKeySpec(rootPassword.getBytes(StandardCharsets.UTF_8), "HmacSHA256"));
this.clusterToken = Base64.getUrlEncoder().withoutPadding().encodeToString(mac.doFinal(clusterName.getBytes(StandardCharsets.UTF_8)));

2. SnapshotHttpHandler - unauthenticated database download surface

The snapshot endpoint is registered on basicRoutes, bypassing the standard AbstractServerHttpHandler security pipeline. It only checks Basic auth and silently returns HTTP 401 for any other credential type. Any node with valid user credentials can download the full raw database. This is a significant data exfiltration surface. The endpoint should at minimum require the root role and ideally also accept the cluster token for inter-node use.


SECURITY - Medium

3. validateClusterForwardedAuth() - username blindly trusted from header

After validating the cluster token, the username from X-ArcadeDB-Forwarded-User is accepted without further verification. If the cluster token is compromised (per issue 1), an attacker can impersonate any user including root.

4. proxyToLeader() - silent privilege escalation to root

conn.setRequestProperty(HEADER_CLUSTER_TOKEN, raftHA.getClusterToken());
conn.setRequestProperty(HEADER_FORWARDED_USER, "root");

If a request arrives at a follower with no Authorization header (from a handler where isRequireAuthentication() returns false), the proxy silently escalates to root. This is unintended privilege escalation.

5. Snapshot downloaded over plain HTTP with no integrity verification

installDatabaseSnapshot() uses "http://" with no HTTPS and no hash/signature verification of the downloaded ZIP. A MITM can serve a malicious database during follower catch-up. The zip-slip protection is correct but does not protect at the data level. Should be in Known Limitations at minimum.


BUGS - High

6. System.out.println in production HALog.java

HALog.java is in src/main/java and contains a live System.out.println. This will print to stdout in every production deployment. CLAUDE.md explicitly prohibits System.out in finished code.

Additionally the format string mutation message.replace("%d", "%s") will corrupt format strings that contain %d as part of a literal string rather than a format specifier.

7. DDL replication inside the database write lock (ReplicatedDatabase.recordFileChanges())

replicateTransaction is called inside executeInWriteLock, which means it blocks for the full Raft round-trip (up to quorumTimeout ms) while holding the database-level write lock. This blocks all reads and writes on that database for the entire replication latency. The old code explicitly sent the command outside the exclusive lock. This regression can cause severe throughput degradation for DDL operations.


BUGS - Medium

8. Files.list() stream not closed in RaftHAServer.startService()

final boolean storageExists = java.nio.file.Files.exists(storagePath)
    && java.nio.file.Files.list(storagePath).findAny().isPresent();

Files.list() wraps a directory file descriptor and must be closed via try-with-resources. Leaving it open leaks a file descriptor.

9. Potential double-apply for TRANSACTION_FORWARD entries

applyTransactionEntry() correctly skips WAL apply on the leader to avoid double-application. applyTransactionForwardEntry() has no equivalent guard. If the TRANSACTION_FORWARD path is used, the leader applies the WAL twice - a data corruption risk.

10. getFollowerStates() - array index alignment not guaranteed

The code assumes getFollowerMatchIndices() / getFollowerNextIndices() arrays align with getPeers() (excluding self) in iteration order. Ratis does not explicitly document this ordering guarantee. If misaligned, replication lag metrics in Studio will be attributed to the wrong peers.


PERFORMANCE

11. proxyToLeader() uses HttpURLConnection with no connection pool or read timeout

Every proxied follower-to-leader request opens a new TCP connection. The default read timeout is infinite (0). Under load this creates many TIME_WAIT sockets and risks blocking proxy threads indefinitely.

12. Busy-poll in waitForLocalApply() under LINEARIZABLE reads

The 10ms poll loop for applied index creates unnecessary scheduling overhead. Ratis provides RaftClient.io().watch() for notification-based waiting.


CODE QUALITY

13. Fully-qualified class names in production code

Multiple files use java.util.List, java.nio.file.Files, java.net.HttpURLConnection etc. inline instead of imports. CLAUDE.md: "don't use fully qualified names if possible, always import the class and just use the name."

14. Duplicate Javadoc block on waitForLocalApply()

Two consecutive /** */ blocks on the same method - the first is orphaned.

15. forwardCommandToLeader() in ReplicatedDatabase appears to be dead code

Writes now throw ServerIsNotTheLeaderException with HTTP proxy forwarding. This method should be removed or clearly documented.

16. Magic byte 'C' for state machine routing

ASCII 'C' = 67 is used as a routing discriminator without being defined as a constant or integrated into the EntryType enum (which already defines byte codes 1 and 2). This would silently collide with any future entry type byte value of 67.


Summary

Severity Count
Security - High 2
Security - Medium 3
Bug - High 2
Bug - Medium 3
Performance 2
Code Quality 4

Must fix before merge:

  • Issue 1: Cluster token derivation (MD5/UUID -> HMAC-SHA256)
  • Issue 6: System.out.println in HALog.java
  • Issue 7: DDL replication inside the database write lock
  • Issue 8: Files.list() stream leak

The overall design and test coverage are excellent. The Ratis integration is architecturally sound. These are fixable issues that should be addressed before merging to main.

@github-actions
Copy link
Copy Markdown
Contributor

📜 License Compliance Check

✅ License check passed. See artifacts for full report.

License Summary (first 50 lines)

Lists of 386 third-party dependencies.
     (Apache License 2.0) LZ4 Java Compression (at.yawk.lz4:lz4-java:1.11.0 - https://github.com/yawkat/lz4-java)
     (EPL 2.0) (GNU Lesser General Public License) Logback Classic Module (ch.qos.logback:logback-classic:1.5.32 - http://logback.qos.ch/logback-classic)
     (EPL 2.0) (GNU Lesser General Public License) Logback Core Module (ch.qos.logback:logback-core:1.5.32 - http://logback.qos.ch/logback-core)
     (Apache 2) ArcadeDB BOLT Protocol (com.arcadedb:arcadedb-bolt:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-bolt/)
     (Apache 2) ArcadeDB Console (com.arcadedb:arcadedb-console:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-console/)
     (Apache 2) ArcadeDB Engine (com.arcadedb:arcadedb-engine:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-engine/)
     (Apache 2) ArcadeDB GraphQL (com.arcadedb:arcadedb-graphql:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-graphql/)
     (Apache 2) ArcadeDB Gremlin (com.arcadedb:arcadedb-gremlin:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-gremlin/)
     (Apache 2) ArcadeDB gRPC Stubs (com.arcadedb:arcadedb-grpc:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc/)
     (Apache 2) ArcadeDB gRPC Client (com.arcadedb:arcadedb-grpc-client:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc-client/)
     (Apache 2) ArcadeDB gRpcW (com.arcadedb:arcadedb-grpcw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpcw/)
     (Apache 2) ArcadeDB HA Raft (com.arcadedb:arcadedb-ha-raft:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-ha-raft/)
     (Apache 2) ArcadeDB Integration (com.arcadedb:arcadedb-integration:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-integration/)
     (Apache 2) ArcadeDB load tests (com.arcadedb:arcadedb-load-tests:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-load-tests/)
     (Apache 2) ArcadeDB Metrics (com.arcadedb:arcadedb-metrics:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-metrics/)
     (Apache 2) ArcadeDB MongoDB Wire Protocol (com.arcadedb:arcadedb-mongodbw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-mongodbw/)
     (Apache 2) ArcadeDB Network (com.arcadedb:arcadedb-network:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-network/)
     (Apache 2) ArcadeDB PostgresW (com.arcadedb:arcadedb-postgresw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-postgresw/)
     (Apache 2) ArcadeDB RedisW (com.arcadedb:arcadedb-redisw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-redisw/)
     (Apache 2) ArcadeDB Server (com.arcadedb:arcadedb-server:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-server/)
     (Apache 2) ArcadeDB Studio (com.arcadedb:arcadedb-studio:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-studio/)
     (Apache 2) ArcadeDB Test Utils (com.arcadedb:arcadedb-test-utils:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-test-utils/)
     (Apache License 2.0) HPPC Collections (com.carrotsearch:hppc:0.7.1 - http://labs.carrotsearch.com/hppc.html/hppc)
     (Apache License 2.0) Metrics Core (com.codahale.metrics:metrics-core:3.0.2 - http://metrics.codahale.com/metrics-core/)
     (The Apache License, Version 2.0) com.conversantmedia:disruptor (com.conversantmedia:disruptor:1.2.21 - https://github.com/conversant/disruptor)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.20 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.21 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.1 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.2 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.1 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.2 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-dataformat-YAML (com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.21.1 - https://github.com/FasterXML/jackson-dataformats-text)
     (Apache License 2.0) Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.21.1 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
     (Apache License 2.0) Caffeine cache (com.github.ben-manes.caffeine:caffeine:2.3.1 - https://github.com/ben-manes/caffeine)
     (Apache License 2.0) docker-java-api (com.github.docker-java:docker-java-api:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport (com.github.docker-java:docker-java-transport:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport-zerodep (com.github.docker-java:docker-java-transport-zerodep:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) btf (com.github.java-json-tools:btf:1.3 - https://github.com/java-json-tools/btf)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils (com.github.java-json-tools:jackson-coreutils:2.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils-equivalence (com.github.java-json-tools:jackson-coreutils-equivalence:1.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-patch (com.github.java-json-tools:json-patch:1.13 - https://github.com/java-json-tools/json-patch)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-core (com.github.java-json-tools:json-schema-core:1.2.14 - https://github.com/java-json-tools/json-schema-core)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-validator (com.github.java-json-tools:json-schema-validator:2.2.14 - https://github.com/java-json-tools/json-schema-validator)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) msg-simple (com.github.java-json-tools:msg-simple:1.2 - https://github.com/java-json-tools/msg-simple)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) uri-template (com.github.java-json-tools:uri-template:0.10 - https://github.com/java-json-tools/uri-template)
     (Apache License 2.0) (GNU Lesser General Public License) javaparser-core (com.github.javaparser:javaparser-core:3.26.3 - https://github.com/javaparser/javaparser-core)
     (Apache License 2.0) JCIP Annotations under Apache License (com.github.stephenc.jcip:jcip-annotations:1.0-1 - http://stephenc.github.com/jcip-annotations)
     (Apache License 2.0) Google Android Annotations Library (com.google.android:annotations:4.1.1.4 - http://source.android.com/)

@claude
Copy link
Copy Markdown

claude bot commented Apr 17, 2026

Code Review - PR #3798 (follow-up)

There are already thorough reviews on this PR. This comment focuses on two areas not yet addressed in the review thread.


Performance: 29x HA overhead warrants documentation before merge

The benchmark posted by @lvca shows:

Scenario Throughput
1 server (no HA) - embedded 1,546 op/s
3 servers (HA) - embedded on leader 53 op/s
3 servers (HA) - concurrent (3 threads) 77 op/s

A 29x throughput drop is expected for synchronous Raft consensus (each batch requires a majority-quorum round-trip over gRPC), but the absolute numbers matter for users: 53 op/s at batch-size 100 means roughly 5,300 single-record writes per second before the leader becomes the bottleneck.

Before merge, the HA guide should document:

  1. What batch sizes are recommended for write-heavy workloads?
  2. Is Ratis write-ahead log sync configured as SYNC or ASYNC? Async Ratis log improves throughput at the cost of a small data-loss window on leader crash.
  3. Do the Ratis RaftClientConfigKeys.Rpc.TIMEOUT_KEY defaults interact with arcadedb.ha.quorumTimeout, or is there duplication/mismatch?

Test coverage gaps in critical paths

Codecov reports 47.48% patch coverage with 1,548 lines missing. The most concerning gaps:

File Coverage Missing Lines
KubernetesAutoJoin.java 0% 85
SnapshotInstaller.java 3.04% 222
RaftClusterManager.java 16.94% 97
RaftPeerAddressResolver.java 40.80% 87+16

SnapshotInstaller is the most critical: it is the crash-recovery path for followers that fall too far behind, and previous reviews have flagged directory-swap atomicity and zip-bomb concerns. At 3% coverage, most edge-case paths are untested. The Docker e2e tests cover the happy path but do not appear to exercise the .snapshot-pending recovery marker, or the 10 GB limit enforcement.

KubernetesAutoJoin at 0% is also notable - auto-join logic on scale-up is an operational safety concern (a misconfigured pod joining the wrong cluster is a data-integrity risk) and should have at least basic happy-path coverage.


Remaining items from today's review thread still open

Per the 18:08 UTC review today, these items remain:

  1. SnapshotHttpHandler - non-atomic plainHttpWarned - two-line fix with AtomicBoolean.getAndSet()
  2. PostVerifyDatabaseHandler - sequential peer fan-out - should be parallel with a shared deadline during incident response
  3. Missing @Tag("slow") on RaftHAComprehensiveIT, RaftLoadConvergenceIT, crash/recovery tests per CLAUDE.md conventions
  4. LINEARIZABLE semantics on read-only follower sessions - falls back to READ_YOUR_WRITES; needs a Javadoc note on the limitation

Items 3 and 4 are low-risk changes that should be addressed before merge.

@claude
Copy link
Copy Markdown

claude bot commented Apr 17, 2026

Code Review - Replace custom Raft protocol with Apache Ratis

This is a very substantial and well-architected change. Replacing the custom Raft implementation with Apache Ratis is the right call - battle-tested consensus implementations are notoriously difficult to get right, and using Ratis brings pre-vote, parallel voting, and gRPC streaming essentially for free.

Below are observations organized by priority. Most are edge cases; the overall design quality is high.


High Priority

ArcadeDBStateMachine - Leader crash between Raft commit and commit2ndPhase()

The origin-skip optimization (leader skips re-applying its own log entries because they were applied via commit2ndPhase()) is correct in the steady state, but there is a narrow crash window: if the leader crashes after Ratis commits the entry (making it durable in the Raft log) but before commit2ndPhase() fully executes, on restart isLeader() returns false and the state machine will apply the entry again as a follower, causing a double-apply of WAL pages. Given WAL apply is documented as idempotent this may be safe in practice, but the comment acknowledging it says "could cause divergence" - it would be worth adding a specific test or stress scenario that kills the leader at this exact point and verifies the recovered state matches followers.

ArcadeDBStateMachine - Snapshot retry on quiet cluster

When applyTransaction() throws ReplicationException it sets needsSnapshotDownload and returns. But if the cluster goes idle after the failure, no further applyTransaction() calls occur to re-check the flag, so the follower stays in a stuck state until HealthMonitor fires. The comment says to rely on HealthMonitor, but the watchdog interval is electionTimeoutMax * WATCHDOG_MULTIPLIER which could be tens of seconds. Consider calling scheduleSnapshotDownload() directly from the catch block rather than depending on the watchdog.


Medium Priority

SnapshotInstaller - Hard-coded 10 GB per-entry limit

final long maxBytes = 10L * 1024 * 1024 * 1024; // line ~434

For clusters storing large graphs this will be hit. A 10 GB cap on a single database file will cause follower sync to fail silently. This should be exposed as a GlobalConfiguration setting (similar to HA_SNAPSHOT_DOWNLOAD_TIMEOUT) or removed entirely - the existing compression-ratio check already guards against decompression bombs.

SnapshotHttpHandler - Watchdog teardown without flushing ZIP

The watchdog thread closes the underlying OutputStream to interrupt a stalled transfer, but does not call zipOut.finish() first. The receiving SnapshotInstaller will get a truncated or structurally invalid ZIP. Since SnapshotInstaller already validates the completion marker file, the follower will roll back correctly, but it would be cleaner to set a poison flag that the writer loop detects and calls finish() on, letting the ZIP close cleanly before the connection drops.

RaftGroupCommitter - No backpressure on queue full

When pendingQueue is at capacity the code throws ReplicationQueueFullException immediately. This surfaces as a transaction failure to the client. Under a write spike a brief adaptive wait (e.g. 100-500ms with a second attempt before failing) would smooth over transient queue saturation without increasing steady-state latency.

RaftLogEntryCodec - No wire-format version byte

If the serialization format ever needs to change (new entry type, field reordering) there is no version tag to distinguish old from new entries in a mixed-version rolling upgrade window. Adding a single version byte immediately after the type byte would make future format evolution safe.


Low Priority / Observations

SnapshotInstaller - URLConnection missing read timeout on open

setReadTimeout is set, but HttpURLConnection.connect() and the subsequent getInputStream() call use the OS default socket timeout which can be very long. Set setConnectTimeout explicitly and also consider wrapping the stream open in its own deadline so a stalled leader does not block the follower indefinitely before setReadTimeout kicks in.

ClusterTokenProvider - Weak root password leaks to auto-derived token

The token is derived from clusterName + rootPassword via PBKDF2. The cryptographic construction is correct (100k iterations, domain separation), but the overall security of the token is bounded by the entropy of rootPassword. The existing warning for the default cluster name is good; consider also warning (not blocking) when the root password is short (e.g. < 16 chars) in server mode.

PeerAddressAllowlistFilter - DNS-based allowlist and planned mTLS

The comment already notes this is "NOT a substitute for mTLS" and references #3890. For visibility: until mTLS lands, any host that can spoof DNS or sit in the same DNS zone could potentially bypass the IP-based allowlist. The correct mitigation is the referenced mTLS work, and the current approach is an acceptable interim measure as long as the limitation is documented in the HA guide (it is - good).

KubernetesAutoJoin - tryAutoJoin() returns true even when no peer was reached

If all pods in the resolved DNS set are unreachable (e.g. during a full cluster cold start) the loop completes without a successful addPeer call but there is no signal to the caller. The node starts as a single-node cluster rather than blocking until at least one peer is reachable. This is probably intentional for bootstrapping, but it is worth a comment or a distinct return value so callers can distinguish "joined existing cluster" from "started fresh".


Positive Highlights

  • The Zip Slip and decompression bomb defenses in SnapshotInstaller are thorough and layered (path normalization, boundary checks, symlink detection, per-entry ratio limit).
  • PeerAddressAllowlistFilter uses AtomicReference<Set<String>> with rate-limited re-resolution - clean, efficient, and correct.
  • The crash-safety story in SnapshotInstaller using pending/complete marker files with rollback is well-thought-out.
  • ClusterTokenProvider uses PBKDF2-HMAC-SHA256 with 100k iterations and proper array zeroing - good cryptographic hygiene.
  • All new dependencies are Apache 2.0 licensed and properly reflected in ATTRIBUTIONS.md and NOTICE.
  • The @Tag("benchmark") and @Tag("slow") annotations are correctly applied to the new test classes.
  • The final keyword is used consistently throughout the new code - keeps the GC pressure comment in CLAUDE.md in mind.

Overall this is a high-quality implementation of a genuinely hard problem. The issues above are mostly edge cases in failure modes, not steady-state correctness bugs. The test coverage (unit + integration + Docker e2e) is comprehensive for a change of this scope.

@claude
Copy link
Copy Markdown

claude bot commented Apr 17, 2026

Code Review - Replace custom Raft protocol with Apache Ratis

This is a major, well-motivated architectural improvement. Replacing the custom RAFT implementation with Apache Ratis is the right call - a formally correct, battle-tested consensus library is far more reliable than a homegrown protocol. The implementation is thorough, well-documented, and includes impressive test coverage (unit, integration, Docker E2E with Toxiproxy). Here is my detailed feedback.


Highlights worth calling out

  • ClusterTokenProvider: PBKDF2-HMAC-SHA256 with 100k iterations, domain separation, char-array zeroing, and constant-time comparison in SnapshotHttpHandler - this is textbook secure token handling.
  • RaftGroupCommitter: The single-CAS STATE_PENDING -> STATE_DISPATCHED transition (replacing the old two-step check-then-set) correctly closes the race window. The two-phase deadline (timeoutMs + quorumTimeout) is the right approach for entries dispatched just before the first timeout fires.
  • Crash-safety analysis in applyTransactionEntry(): The (a)/(b)/(c) crash-point breakdown (lines 480-503) is exactly the level of documentation this kind of non-trivial ordering requirement deserves.
  • ArcadeDBStateMachine error handling: The three-tier catch (ReplicationException, unknown Throwable) with separate recovery paths (snapshot resync vs. emergency stop) is well-designed.
  • License compliance: Apache Ratis is Apache 2.0, ATTRIBUTIONS.md and NOTICE updated correctly.

Issues to fix

1. ReplicatedDatabase - fields should be private final

// Current (lines 91-93):
protected final ArcadeDBServer server;
protected final LocalDatabase proxied;
protected final long timeout;

These fields are set in the constructor and never modified. They are protected, which grants access to the whole package. Use private final unless subclasses in the same package genuinely need direct access. If subclass access is needed, add package-private or protected accessor methods.


2. SnapshotHttpHandler - semaphore not released on early exit between acquire and try block

Looking at lines 189-200:

if (!snapshotSemaphore.tryAcquire()) {
  // ...return 503...
}

ScheduledFuture<?> watchdog = null;
try {
  // ... rest of logic ...
} finally {
  if (watchdog != null) watchdog.cancel(false);
  snapshotSemaphore.release();
}

The only statement between the acquire and the try is ScheduledFuture<?> watchdog = null; which cannot throw, so this is actually safe. However, for clarity and future-proofing (e.g., if someone adds code between acquire and try), consider moving the acquire inside the try block or using a different pattern:

ScheduledFuture<?> watchdog = null;
boolean acquired = snapshotSemaphore.tryAcquire();
if (!acquired) {
  // return 503
}
try {
  // ...
} finally {
  if (acquired) snapshotSemaphore.release();
  if (watchdog != null) watchdog.cancel(false);
}

3. RaftHAServer - fully qualified type name in public API

// Line 285:
public org.apache.ratis.util.LifeCycle.State getRaftLifeCycleState() {

The org.apache.ratis.util.LifeCycle.State type appears inline in the method signature rather than as an import. The same pattern recurs in restartRatisIfNeeded(). Adding import org.apache.ratis.util.LifeCycle; and using LifeCycle.State throughout would be cleaner and consistent with the project's "don't use fully qualified names if possible" guideline from CLAUDE.md.


4. ArcadeDBStateMachine.notifyInstallSnapshotFromLeader() uses ForkJoinPool.commonPool()

return CompletableFuture.supplyAsync(() -> {
  // ... heavy HTTP download + db.close() + db.reopen() ...
});

CompletableFuture.supplyAsync() without an explicit executor uses the common ForkJoin pool. Snapshot downloads can be very large (potentially GBs per CLAUDE.md mention of large snapshot tests). Under concurrent load where the common pool is busy, this task could be delayed significantly. Consider using the lifecycleExecutor that's already on the state machine, or a dedicated thread, so it runs independently of the common pool.


5. Stale follower risk after failed snapshot download (acknowledged, but worth flagging)

In ArcadeDBStateMachine.applyTransaction() (lines 327-340), after a failed apply:

needsSnapshotDownload.set(true);
lifecycleExecutor.submit(() -> {
  if (needsSnapshotDownload.compareAndSet(true, false)) {
    snapshotInstaller.installDatabasesFromLeader();
    // If this fails, the flag stays false
  }
});

The comment correctly notes: "If the download itself fails, the flag stays false until the next apply failure re-arms it." On a quiet cluster with no new writes, the follower remains permanently stale. Consider adding a periodic health check in HealthMonitor that detects when lastAppliedIndex is significantly behind commitIndex and re-triggers the snapshot download if needsSnapshotDownload is false.


6. @Deprecated method with no removal plan

// ArcadeDBStateMachine.java lines 934-940:
@Deprecated
static WALFile.WALTransaction parseWalTransaction(final Binary buffer) {
  return RaftLogEntryCodec.parseWalTransaction(buffer);
}

This deprecated method was added in the same PR that introduced its replacement. If no callers exist outside of this PR, remove it entirely rather than deprecating it immediately.


7. RaftGroupCommitter.submitAndWait() - implicit caller contract

The Javadoc comment states: "all current callers pass timeoutMs == quorumTimeout". This is an implicit coupling that future callers could violate. Either enforce it in the method (assert or validate), or rename the parameter to make its relationship to quorumTimeout clearer.


8. SnapshotInstaller uses HttpURLConnection for potentially-gigabyte downloads

HttpURLConnection has no streaming buffer size control and defaults to buffering the entire response in memory on some JVM implementations. For large databases this could cause OOM. Verify that setChunkedStreamingMode() or setFixedLengthStreamingMode() is configured appropriately, or consider using InputStream-based streaming throughout without materializing the full response.


Minor / non-blocking observations

  • RaftHAServer.getMessagesInQueue() returns 0 with a comment "Ratis manages its own replication queue internally". This is fine, but consider removing the method if it's only there for test infrastructure compatibility - it misleads monitoring tooling.
  • In ArcadeDBStateMachine.applyTransaction(), the switch statement (lines 272-279) has no default branch. Since unrecognized types are handled above via the type == null check, this is correct. However, RaftLogEntryType.fromCode() returning null for known-but-unimplemented types (as opposed to truly unknown types) could silently skip entries. Document this contract in RaftLogEntryType.fromCode().
  • ArcadeDBStateMachine line 104-105: the comment says currentNodeIsLeader is "non-critical" and raftHA.isLeader() is used for correctness. The field name might be clearer as cachedIsLeader to make the caching intent obvious.
  • RaftPropertiesBuilder is not shown in this review but verify that all Ratis storage paths use the peer-specific localPeerId.toString() subdirectory to prevent in-JVM test collisions (same pattern used in getAppliedIndexFile()).

Test coverage

Excellent. The combination of:

  • 4 unit tests for serialization round-trips
  • Integration tests for raw consensus, replication, comprehensive scenarios
  • ReadConsistencyIT covering all three consistency levels
  • Docker E2E tests with Toxiproxy for latency/packet-loss injection
  • HASplitBrainIT for 5-node split brain

...is significantly better coverage than most HA implementations. The use of @Tag("slow") on long-running tests is also correct per CLAUDE.md.


Overall this is high-quality work. The critical issues (#1, #3) are easy fixes. Issue #4 (ForkJoinPool) and #5 (stale follower re-arm) are worth addressing before merge given their potential impact in production. The rest are improvements or minor clarifications.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

📜 License Compliance Check

✅ License check passed. See artifacts for full report.

License Summary (first 50 lines)

Lists of 386 third-party dependencies.
     (Apache License 2.0) LZ4 Java Compression (at.yawk.lz4:lz4-java:1.11.0 - https://github.com/yawkat/lz4-java)
     (EPL 2.0) (GNU Lesser General Public License) Logback Classic Module (ch.qos.logback:logback-classic:1.5.32 - http://logback.qos.ch/logback-classic)
     (EPL 2.0) (GNU Lesser General Public License) Logback Core Module (ch.qos.logback:logback-core:1.5.32 - http://logback.qos.ch/logback-core)
     (Apache 2) ArcadeDB BOLT Protocol (com.arcadedb:arcadedb-bolt:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-bolt/)
     (Apache 2) ArcadeDB Console (com.arcadedb:arcadedb-console:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-console/)
     (Apache 2) ArcadeDB Engine (com.arcadedb:arcadedb-engine:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-engine/)
     (Apache 2) ArcadeDB GraphQL (com.arcadedb:arcadedb-graphql:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-graphql/)
     (Apache 2) ArcadeDB Gremlin (com.arcadedb:arcadedb-gremlin:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-gremlin/)
     (Apache 2) ArcadeDB gRPC Stubs (com.arcadedb:arcadedb-grpc:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc/)
     (Apache 2) ArcadeDB gRPC Client (com.arcadedb:arcadedb-grpc-client:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc-client/)
     (Apache 2) ArcadeDB gRpcW (com.arcadedb:arcadedb-grpcw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpcw/)
     (Apache 2) ArcadeDB HA Raft (com.arcadedb:arcadedb-ha-raft:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-ha-raft/)
     (Apache 2) ArcadeDB Integration (com.arcadedb:arcadedb-integration:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-integration/)
     (Apache 2) ArcadeDB load tests (com.arcadedb:arcadedb-load-tests:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-load-tests/)
     (Apache 2) ArcadeDB Metrics (com.arcadedb:arcadedb-metrics:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-metrics/)
     (Apache 2) ArcadeDB MongoDB Wire Protocol (com.arcadedb:arcadedb-mongodbw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-mongodbw/)
     (Apache 2) ArcadeDB Network (com.arcadedb:arcadedb-network:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-network/)
     (Apache 2) ArcadeDB PostgresW (com.arcadedb:arcadedb-postgresw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-postgresw/)
     (Apache 2) ArcadeDB RedisW (com.arcadedb:arcadedb-redisw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-redisw/)
     (Apache 2) ArcadeDB Server (com.arcadedb:arcadedb-server:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-server/)
     (Apache 2) ArcadeDB Studio (com.arcadedb:arcadedb-studio:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-studio/)
     (Apache 2) ArcadeDB Test Utils (com.arcadedb:arcadedb-test-utils:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-test-utils/)
     (Apache License 2.0) HPPC Collections (com.carrotsearch:hppc:0.7.1 - http://labs.carrotsearch.com/hppc.html/hppc)
     (Apache License 2.0) Metrics Core (com.codahale.metrics:metrics-core:3.0.2 - http://metrics.codahale.com/metrics-core/)
     (The Apache License, Version 2.0) com.conversantmedia:disruptor (com.conversantmedia:disruptor:1.2.21 - https://github.com/conversant/disruptor)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.20 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.21 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.1 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.2 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.1 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.2 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-dataformat-YAML (com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.21.1 - https://github.com/FasterXML/jackson-dataformats-text)
     (Apache License 2.0) Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.21.1 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
     (Apache License 2.0) Caffeine cache (com.github.ben-manes.caffeine:caffeine:2.3.1 - https://github.com/ben-manes/caffeine)
     (Apache License 2.0) docker-java-api (com.github.docker-java:docker-java-api:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport (com.github.docker-java:docker-java-transport:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport-zerodep (com.github.docker-java:docker-java-transport-zerodep:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) btf (com.github.java-json-tools:btf:1.3 - https://github.com/java-json-tools/btf)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils (com.github.java-json-tools:jackson-coreutils:2.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils-equivalence (com.github.java-json-tools:jackson-coreutils-equivalence:1.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-patch (com.github.java-json-tools:json-patch:1.13 - https://github.com/java-json-tools/json-patch)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-core (com.github.java-json-tools:json-schema-core:1.2.14 - https://github.com/java-json-tools/json-schema-core)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-validator (com.github.java-json-tools:json-schema-validator:2.2.14 - https://github.com/java-json-tools/json-schema-validator)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) msg-simple (com.github.java-json-tools:msg-simple:1.2 - https://github.com/java-json-tools/msg-simple)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) uri-template (com.github.java-json-tools:uri-template:0.10 - https://github.com/java-json-tools/uri-template)
     (Apache License 2.0) (GNU Lesser General Public License) javaparser-core (com.github.javaparser:javaparser-core:3.26.3 - https://github.com/javaparser/javaparser-core)
     (Apache License 2.0) JCIP Annotations under Apache License (com.github.stephenc.jcip:jcip-annotations:1.0-1 - http://stephenc.github.com/jcip-annotations)
     (Apache License 2.0) Google Android Annotations Library (com.google.android:annotations:4.1.1.4 - http://source.android.com/)

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

📜 License Compliance Check

✅ License check passed. See artifacts for full report.

License Summary (first 50 lines)

Lists of 386 third-party dependencies.
     (Apache License 2.0) LZ4 Java Compression (at.yawk.lz4:lz4-java:1.11.0 - https://github.com/yawkat/lz4-java)
     (EPL 2.0) (GNU Lesser General Public License) Logback Classic Module (ch.qos.logback:logback-classic:1.5.32 - http://logback.qos.ch/logback-classic)
     (EPL 2.0) (GNU Lesser General Public License) Logback Core Module (ch.qos.logback:logback-core:1.5.32 - http://logback.qos.ch/logback-core)
     (Apache 2) ArcadeDB BOLT Protocol (com.arcadedb:arcadedb-bolt:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-bolt/)
     (Apache 2) ArcadeDB Console (com.arcadedb:arcadedb-console:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-console/)
     (Apache 2) ArcadeDB Engine (com.arcadedb:arcadedb-engine:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-engine/)
     (Apache 2) ArcadeDB GraphQL (com.arcadedb:arcadedb-graphql:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-graphql/)
     (Apache 2) ArcadeDB Gremlin (com.arcadedb:arcadedb-gremlin:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-gremlin/)
     (Apache 2) ArcadeDB gRPC Stubs (com.arcadedb:arcadedb-grpc:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc/)
     (Apache 2) ArcadeDB gRPC Client (com.arcadedb:arcadedb-grpc-client:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc-client/)
     (Apache 2) ArcadeDB gRpcW (com.arcadedb:arcadedb-grpcw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpcw/)
     (Apache 2) ArcadeDB HA Raft (com.arcadedb:arcadedb-ha-raft:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-ha-raft/)
     (Apache 2) ArcadeDB Integration (com.arcadedb:arcadedb-integration:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-integration/)
     (Apache 2) ArcadeDB load tests (com.arcadedb:arcadedb-load-tests:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-load-tests/)
     (Apache 2) ArcadeDB Metrics (com.arcadedb:arcadedb-metrics:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-metrics/)
     (Apache 2) ArcadeDB MongoDB Wire Protocol (com.arcadedb:arcadedb-mongodbw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-mongodbw/)
     (Apache 2) ArcadeDB Network (com.arcadedb:arcadedb-network:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-network/)
     (Apache 2) ArcadeDB PostgresW (com.arcadedb:arcadedb-postgresw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-postgresw/)
     (Apache 2) ArcadeDB RedisW (com.arcadedb:arcadedb-redisw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-redisw/)
     (Apache 2) ArcadeDB Server (com.arcadedb:arcadedb-server:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-server/)
     (Apache 2) ArcadeDB Studio (com.arcadedb:arcadedb-studio:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-studio/)
     (Apache 2) ArcadeDB Test Utils (com.arcadedb:arcadedb-test-utils:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-test-utils/)
     (Apache License 2.0) HPPC Collections (com.carrotsearch:hppc:0.7.1 - http://labs.carrotsearch.com/hppc.html/hppc)
     (Apache License 2.0) Metrics Core (com.codahale.metrics:metrics-core:3.0.2 - http://metrics.codahale.com/metrics-core/)
     (The Apache License, Version 2.0) com.conversantmedia:disruptor (com.conversantmedia:disruptor:1.2.21 - https://github.com/conversant/disruptor)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.20 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.21 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.1 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.2 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.1 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.2 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-dataformat-YAML (com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.21.1 - https://github.com/FasterXML/jackson-dataformats-text)
     (Apache License 2.0) Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.21.1 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
     (Apache License 2.0) Caffeine cache (com.github.ben-manes.caffeine:caffeine:2.3.1 - https://github.com/ben-manes/caffeine)
     (Apache License 2.0) docker-java-api (com.github.docker-java:docker-java-api:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport (com.github.docker-java:docker-java-transport:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport-zerodep (com.github.docker-java:docker-java-transport-zerodep:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) btf (com.github.java-json-tools:btf:1.3 - https://github.com/java-json-tools/btf)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils (com.github.java-json-tools:jackson-coreutils:2.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils-equivalence (com.github.java-json-tools:jackson-coreutils-equivalence:1.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-patch (com.github.java-json-tools:json-patch:1.13 - https://github.com/java-json-tools/json-patch)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-core (com.github.java-json-tools:json-schema-core:1.2.14 - https://github.com/java-json-tools/json-schema-core)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-validator (com.github.java-json-tools:json-schema-validator:2.2.14 - https://github.com/java-json-tools/json-schema-validator)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) msg-simple (com.github.java-json-tools:msg-simple:1.2 - https://github.com/java-json-tools/msg-simple)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) uri-template (com.github.java-json-tools:uri-template:0.10 - https://github.com/java-json-tools/uri-template)
     (Apache License 2.0) (GNU Lesser General Public License) javaparser-core (com.github.javaparser:javaparser-core:3.26.3 - https://github.com/javaparser/javaparser-core)
     (Apache License 2.0) JCIP Annotations under Apache License (com.github.stephenc.jcip:jcip-annotations:1.0-1 - http://stephenc.github.com/jcip-annotations)
     (Apache License 2.0) Google Android Annotations Library (com.google.android:annotations:4.1.1.4 - http://source.android.com/)

@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 17, 2026

Not up to standards ⛔

🔴 Issues 2 high · 12 medium · 33 minor

Alerts:
⚠ 47 issues (≤ 0 issues of at least minor severity)

Results:
47 new issues

Category Results
BestPractice 11 medium
Documentation 11 minor
ErrorProne 2 high
CodeStyle 22 minor
Complexity 1 medium

View in Codacy

🟢 Metrics 1261 complexity

Metric Results
Complexity 1261

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

@claude
Copy link
Copy Markdown

claude bot commented Apr 17, 2026

Code Review - Apache Ratis HA Integration

This is a significant and well-motivated architectural improvement. Replacing a bespoke, under-tested Raft implementation with Apache Ratis (battle-tested in Apache Ozone, IoTDB, Alluxio) is the right call. The test suite depth - especially the Toxiproxy-based chaos tests - is excellent. Below are issues grouped by severity.


Critical / Data Safety

1. Leader commits locally before achieving quorum (ReplicatedDatabase.java)

The PR description states that tx.commit2ndPhase(phase1) is called on the leader before raftHA.replicateTransaction() succeeds. This breaks the fundamental Raft guarantee: a transaction is only considered committed after a quorum of nodes acknowledges it. Under the current design, a leader crash between local commit and quorum ack causes the write to be visible to the client (HTTP 200) but absent from any follower. The data is permanently lost.

This pattern may be intentional for performance reasons, but it needs to be prominently documented as a durability trade-off, not just a performance one. The new docs/arcadedb-ha-26.4.1.md should explicitly warn: "Under the default commit mode, confirmed writes may be lost if the leader crashes before replication completes." Consider adding a arcadedb.ha.strictCommit option that defers client confirmation until quorum ack.

2. Ratis shading conflict with arcadedb-grpcw module

Ratis ships with a shaded gRPC (org.apache.ratis.thirdparty). The arcadedb-grpcw module also includes gRPC. At runtime, two gRPC classpaths coexist: the shaded one used by Ratis internals and the main one used by ArcadeDB's gRPC wire protocol. Any shared state (e.g. ManagedChannelRegistry, NameResolverRegistry) will be split across two classloaders, causing subtle failures that are hard to reproduce. Verify this is not an issue - or document why it is safe - especially in the e2e tests that run both HA and gRPC simultaneously.


Security

3. ClusterTokenProvider - constant-time comparison

Token comparison must use MessageDigest.isEqual() (constant-time) rather than String.equals(). String equality in Java short-circuits on the first mismatched character, leaking token length and prefix information via timing side-channel. This is particularly important since the cluster token is derived from rootPassword - a timing oracle on token comparison becomes a partial oracle on the root password.

// Vulnerable:
if (!expectedToken.equals(providedToken)) { ... }

// Safe:
if (!MessageDigest.isEqual(expectedToken.getBytes(StandardCharsets.UTF_8),
                           providedToken.getBytes(StandardCharsets.UTF_8))) { ... }

4. ClusterTokenProvider - weak key derivation

Deriving a shared secret as hash(clusterName + rootPassword) is a weak KDF:

  • No salt means rainbow-table attacks are possible if the token leaks.
  • The derivation algorithm (SHA-256 alone, or HMAC?) is not visible in the description; if it is just SHA256(clusterName + rootPassword), it is susceptible to length-extension attacks.
  • The rootPassword in the token derivation creates a coupling: changing the root password silently rotates the cluster token and breaks all in-flight inter-node requests until every node restarts.

Use HMAC-SHA256(key=rootPassword, data=clusterName + ":" + timestamp-epoch-day) or at minimum add a cluster-specific salt stored on disk. This also allows key rotation without a full cluster restart.

5. PeerAddressAllowlistFilter - IP-only authentication

Filtering by IP address provides a very shallow security boundary - IPs can be spoofed and containers often share NAT. Without mTLS (tracked in issue #3890), peer identity is entirely unverified. This is acceptable as a known gap for the initial integration, but a security warning should be added to the documentation and the configuration.

6. KubernetesAutoJoin - trust boundary

The auto-join logic queries the Kubernetes API to discover peers. Please verify:

  • It uses a namespaced ServiceAccount token with minimal RBAC (list pods in the HA namespace only).
  • The discovered peer list is validated against PeerAddressAllowlistFilter before joining.
  • KubernetesAutoJoin is disabled (or no-ops) when arcadedb.ha.server.list is explicitly configured, to prevent accidental cluster merges.

Correctness / Bugs

7. Missing HTTP timeout in proxyToLeader()

HttpURLConnection has no connectTimeout or readTimeout set in the leader-proxy path. If the leader becomes unreachable mid-election, follower threads will block indefinitely, exhausting the HTTP handler thread pool. SnapshotHttpHandler correctly sets setConnectTimeout(30_000) and setReadTimeout(300_000) - apply the same pattern to proxyToLeader().

8. HALog.java - System.out.println in production code

HALog.log() writes to System.out in addition to the SLF4J logger. This violates project conventions (CLAUDE.md: "remove any System.out you used for debug when you have finished"). It also logs every message twice: once via System.out (not captured by log aggregators) and once via the logger.

9. HALog.java - no-op String.replace call

String.format(message.replace("%s", "%s").replace("%d", "%s"), args)

message.replace("%s", "%s") is a no-op. The intent appears to be normalising %d to %s so that String.format works with any argument type, but the first replace does nothing and the logic is confusing. Remove the first replace.

10. ArcadeDBStateMachine.applyTransaction() - missing default in switch

The switch on RaftLogEntryType has no default branch. If RaftLogEntryType.fromCode() is extended in a future release with a new type and a node is upgraded asymmetrically (rolling upgrade scenario), the follower will silently skip unrecognised entries without any warning or error. Add a default case that logs a warning and increments an error counter.

11. e2e-ha module duplication

There are now two e2e test modules: e2e/ (modified) and e2e-ha/ (new). They appear to cover overlapping HA scenarios. This risks test drift where the same scenario is tested with different configurations in two places. Consolidate or add a comment explaining why two separate modules are needed (different container images? different Maven lifecycle phases?).


Performance

12. SnapshotInstaller - HttpURLConnection may buffer full response in memory

HttpURLConnection on some JVM implementations buffers the entire response before making it available to the InputStream. For large databases this could cause an OOM error in the follower catch-up path. Verify that setChunkedStreamingMode() is configured on the server side and that the installer reads the stream incrementally. Consider using a fixed-size byte buffer copy loop (IOUtils.copyLarge or manual 64KB chunks) rather than reading the full content at once.

13. RaftGroupCommitter - verify no busy-wait on quorum timeout

submitAndWait() blocks the calling thread until quorum ack or timeout. Confirm that the wait uses LockSupport.parkNanos / CompletableFuture.get(timeout) and not a Thread.sleep / polling loop. The latter wastes a thread for the entire quorum timeout window under high concurrency.


Module Structure

14. ha-raft dependency scope

Per CLAUDE.md:

Wire protocol modules must use provided scope for arcadedb-server dependency

Verify that ha-raft/pom.xml declares arcadedb-server with scope=provided and arcadedb-server:test-jar with scope=test. The ha-raft module is a plugin, not a wire protocol module, but the same principle applies: server is the assembly point and must not be pulled in transitively.


Minor / Non-blocking

  • RaftHAServer.getMessagesInQueue() always returns 0 with a comment that Ratis manages its own queue. If this method only exists for API compatibility, consider deprecating it or removing it - misleading metrics are worse than no metrics.
  • docs/arcadedb-ha-26.4.1.md is committed in the PR. Docs in docs/ typically track the released version. Confirm this is intentional (pre-release docs) and will not be overwritten by the standard release process.
  • The PR title says "Apache Ratis" but the description mentions both "3.2.2" and "3.2.1". Pin the exact version in the description to avoid confusion.
  • Several benchmark test files get @Tag("benchmark") annotations added - this is correct and appreciated, consistent with CLAUDE.md conventions.

Summary

Priority Issue
Critical #1 Leader-before-quorum durability gap needs clear documentation + opt-in strict mode
Critical #2 Ratis shaded gRPC conflict with arcadedb-grpcw - needs validation
High #3 Timing-attack-vulnerable token comparison
High #7 Missing HTTP timeout in proxyToLeader
High #8 System.out.println in HALog (must fix per CLAUDE.md)
Medium #4 Weak KDF for cluster token
Medium #10 Silent skip of unknown log entry types
Medium #12 SnapshotInstaller potential OOM
Low #9, #11, #13, #14 Various code quality items

The test coverage - particularly the Toxiproxy chaos tests and split-brain scenarios - is genuinely excellent. Issues #3, #7, and #8 are straightforward to fix. Issue #1 (durability semantics) is the most architecturally significant and deserves explicit documentation even if the current behavior is intentional.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

📜 License Compliance Check

✅ License check passed. See artifacts for full report.

License Summary (first 50 lines)

Lists of 386 third-party dependencies.
     (Apache License 2.0) LZ4 Java Compression (at.yawk.lz4:lz4-java:1.11.0 - https://github.com/yawkat/lz4-java)
     (EPL 2.0) (GNU Lesser General Public License) Logback Classic Module (ch.qos.logback:logback-classic:1.5.32 - http://logback.qos.ch/logback-classic)
     (EPL 2.0) (GNU Lesser General Public License) Logback Core Module (ch.qos.logback:logback-core:1.5.32 - http://logback.qos.ch/logback-core)
     (Apache 2) ArcadeDB BOLT Protocol (com.arcadedb:arcadedb-bolt:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-bolt/)
     (Apache 2) ArcadeDB Console (com.arcadedb:arcadedb-console:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-console/)
     (Apache 2) ArcadeDB Engine (com.arcadedb:arcadedb-engine:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-engine/)
     (Apache 2) ArcadeDB GraphQL (com.arcadedb:arcadedb-graphql:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-graphql/)
     (Apache 2) ArcadeDB Gremlin (com.arcadedb:arcadedb-gremlin:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-gremlin/)
     (Apache 2) ArcadeDB gRPC Stubs (com.arcadedb:arcadedb-grpc:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc/)
     (Apache 2) ArcadeDB gRPC Client (com.arcadedb:arcadedb-grpc-client:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc-client/)
     (Apache 2) ArcadeDB gRpcW (com.arcadedb:arcadedb-grpcw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpcw/)
     (Apache 2) ArcadeDB HA Raft (com.arcadedb:arcadedb-ha-raft:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-ha-raft/)
     (Apache 2) ArcadeDB Integration (com.arcadedb:arcadedb-integration:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-integration/)
     (Apache 2) ArcadeDB load tests (com.arcadedb:arcadedb-load-tests:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-load-tests/)
     (Apache 2) ArcadeDB Metrics (com.arcadedb:arcadedb-metrics:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-metrics/)
     (Apache 2) ArcadeDB MongoDB Wire Protocol (com.arcadedb:arcadedb-mongodbw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-mongodbw/)
     (Apache 2) ArcadeDB Network (com.arcadedb:arcadedb-network:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-network/)
     (Apache 2) ArcadeDB PostgresW (com.arcadedb:arcadedb-postgresw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-postgresw/)
     (Apache 2) ArcadeDB RedisW (com.arcadedb:arcadedb-redisw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-redisw/)
     (Apache 2) ArcadeDB Server (com.arcadedb:arcadedb-server:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-server/)
     (Apache 2) ArcadeDB Studio (com.arcadedb:arcadedb-studio:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-studio/)
     (Apache 2) ArcadeDB Test Utils (com.arcadedb:arcadedb-test-utils:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-test-utils/)
     (Apache License 2.0) HPPC Collections (com.carrotsearch:hppc:0.7.1 - http://labs.carrotsearch.com/hppc.html/hppc)
     (Apache License 2.0) Metrics Core (com.codahale.metrics:metrics-core:3.0.2 - http://metrics.codahale.com/metrics-core/)
     (The Apache License, Version 2.0) com.conversantmedia:disruptor (com.conversantmedia:disruptor:1.2.21 - https://github.com/conversant/disruptor)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.20 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.21 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.1 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.2 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.1 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.2 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-dataformat-YAML (com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.21.1 - https://github.com/FasterXML/jackson-dataformats-text)
     (Apache License 2.0) Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.21.1 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
     (Apache License 2.0) Caffeine cache (com.github.ben-manes.caffeine:caffeine:2.3.1 - https://github.com/ben-manes/caffeine)
     (Apache License 2.0) docker-java-api (com.github.docker-java:docker-java-api:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport (com.github.docker-java:docker-java-transport:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport-zerodep (com.github.docker-java:docker-java-transport-zerodep:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) btf (com.github.java-json-tools:btf:1.3 - https://github.com/java-json-tools/btf)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils (com.github.java-json-tools:jackson-coreutils:2.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils-equivalence (com.github.java-json-tools:jackson-coreutils-equivalence:1.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-patch (com.github.java-json-tools:json-patch:1.13 - https://github.com/java-json-tools/json-patch)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-core (com.github.java-json-tools:json-schema-core:1.2.14 - https://github.com/java-json-tools/json-schema-core)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-validator (com.github.java-json-tools:json-schema-validator:2.2.14 - https://github.com/java-json-tools/json-schema-validator)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) msg-simple (com.github.java-json-tools:msg-simple:1.2 - https://github.com/java-json-tools/msg-simple)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) uri-template (com.github.java-json-tools:uri-template:0.10 - https://github.com/java-json-tools/uri-template)
     (Apache License 2.0) (GNU Lesser General Public License) javaparser-core (com.github.javaparser:javaparser-core:3.26.3 - https://github.com/javaparser/javaparser-core)
     (Apache License 2.0) JCIP Annotations under Apache License (com.github.stephenc.jcip:jcip-annotations:1.0-1 - http://stephenc.github.com/jcip-annotations)
     (Apache License 2.0) Google Android Annotations Library (com.google.android:annotations:4.1.1.4 - http://source.android.com/)

@github-actions
Copy link
Copy Markdown
Contributor

📜 License Compliance Check

✅ License check passed. See artifacts for full report.

License Summary (first 50 lines)

Lists of 386 third-party dependencies.
     (Apache License 2.0) LZ4 Java Compression (at.yawk.lz4:lz4-java:1.11.0 - https://github.com/yawkat/lz4-java)
     (EPL 2.0) (GNU Lesser General Public License) Logback Classic Module (ch.qos.logback:logback-classic:1.5.32 - http://logback.qos.ch/logback-classic)
     (EPL 2.0) (GNU Lesser General Public License) Logback Core Module (ch.qos.logback:logback-core:1.5.32 - http://logback.qos.ch/logback-core)
     (Apache 2) ArcadeDB BOLT Protocol (com.arcadedb:arcadedb-bolt:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-bolt/)
     (Apache 2) ArcadeDB Console (com.arcadedb:arcadedb-console:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-console/)
     (Apache 2) ArcadeDB Engine (com.arcadedb:arcadedb-engine:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-engine/)
     (Apache 2) ArcadeDB GraphQL (com.arcadedb:arcadedb-graphql:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-graphql/)
     (Apache 2) ArcadeDB Gremlin (com.arcadedb:arcadedb-gremlin:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-gremlin/)
     (Apache 2) ArcadeDB gRPC Stubs (com.arcadedb:arcadedb-grpc:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc/)
     (Apache 2) ArcadeDB gRPC Client (com.arcadedb:arcadedb-grpc-client:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc-client/)
     (Apache 2) ArcadeDB gRpcW (com.arcadedb:arcadedb-grpcw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpcw/)
     (Apache 2) ArcadeDB HA Raft (com.arcadedb:arcadedb-ha-raft:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-ha-raft/)
     (Apache 2) ArcadeDB Integration (com.arcadedb:arcadedb-integration:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-integration/)
     (Apache 2) ArcadeDB load tests (com.arcadedb:arcadedb-load-tests:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-load-tests/)
     (Apache 2) ArcadeDB Metrics (com.arcadedb:arcadedb-metrics:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-metrics/)
     (Apache 2) ArcadeDB MongoDB Wire Protocol (com.arcadedb:arcadedb-mongodbw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-mongodbw/)
     (Apache 2) ArcadeDB Network (com.arcadedb:arcadedb-network:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-network/)
     (Apache 2) ArcadeDB PostgresW (com.arcadedb:arcadedb-postgresw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-postgresw/)
     (Apache 2) ArcadeDB RedisW (com.arcadedb:arcadedb-redisw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-redisw/)
     (Apache 2) ArcadeDB Server (com.arcadedb:arcadedb-server:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-server/)
     (Apache 2) ArcadeDB Studio (com.arcadedb:arcadedb-studio:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-studio/)
     (Apache 2) ArcadeDB Test Utils (com.arcadedb:arcadedb-test-utils:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-test-utils/)
     (Apache License 2.0) HPPC Collections (com.carrotsearch:hppc:0.7.1 - http://labs.carrotsearch.com/hppc.html/hppc)
     (Apache License 2.0) Metrics Core (com.codahale.metrics:metrics-core:3.0.2 - http://metrics.codahale.com/metrics-core/)
     (The Apache License, Version 2.0) com.conversantmedia:disruptor (com.conversantmedia:disruptor:1.2.21 - https://github.com/conversant/disruptor)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.20 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.21 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.1 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.2 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.1 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.2 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-dataformat-YAML (com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.21.1 - https://github.com/FasterXML/jackson-dataformats-text)
     (Apache License 2.0) Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.21.1 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
     (Apache License 2.0) Caffeine cache (com.github.ben-manes.caffeine:caffeine:2.3.1 - https://github.com/ben-manes/caffeine)
     (Apache License 2.0) docker-java-api (com.github.docker-java:docker-java-api:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport (com.github.docker-java:docker-java-transport:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport-zerodep (com.github.docker-java:docker-java-transport-zerodep:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) btf (com.github.java-json-tools:btf:1.3 - https://github.com/java-json-tools/btf)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils (com.github.java-json-tools:jackson-coreutils:2.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils-equivalence (com.github.java-json-tools:jackson-coreutils-equivalence:1.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-patch (com.github.java-json-tools:json-patch:1.13 - https://github.com/java-json-tools/json-patch)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-core (com.github.java-json-tools:json-schema-core:1.2.14 - https://github.com/java-json-tools/json-schema-core)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-validator (com.github.java-json-tools:json-schema-validator:2.2.14 - https://github.com/java-json-tools/json-schema-validator)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) msg-simple (com.github.java-json-tools:msg-simple:1.2 - https://github.com/java-json-tools/msg-simple)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) uri-template (com.github.java-json-tools:uri-template:0.10 - https://github.com/java-json-tools/uri-template)
     (Apache License 2.0) (GNU Lesser General Public License) javaparser-core (com.github.javaparser:javaparser-core:3.26.3 - https://github.com/javaparser/javaparser-core)
     (Apache License 2.0) JCIP Annotations under Apache License (com.github.stephenc.jcip:jcip-annotations:1.0-1 - http://stephenc.github.com/jcip-annotations)
     (Apache License 2.0) Google Android Annotations Library (com.google.android:annotations:4.1.1.4 - http://source.android.com/)

@claude
Copy link
Copy Markdown

claude bot commented Apr 17, 2026

Code Review

This is an excellent, well-engineered replacement of the custom Raft implementation with Apache Ratis. The code quality is very high: thorough Javadoc, clear crash-safety analysis, strong security posture, and a well-factored module structure. The following observations are in rough priority order.


Bugs / Correctness

1. Emergency-stop thread is not a daemon thread (ArcadeDBStateMachine.java:356-363)

In the catch (Throwable) branch of applyTransaction(), the stop thread is started without setDaemon(true):

final Thread stopThread = new Thread(() -> {
    try { server.stop(); } catch (final Throwable t) { ... }
}, "arcadedb-emergency-stop");
stopThread.start(); // no setDaemon(true)

Every other background stop/restart thread in the module calls setDaemon(true) (e.g., RaftHAServer.restartRatisIfNeeded():333, ArcadeDBStateMachine.notifyServerShutdown():919). Without it, if server.stop() hangs in the emergency path, the JVM will also hang rather than exiting cleanly. This is the most dangerous code path in the file.

2. DatabaseContext.INSTANCE.init(db) called without cleanup in createNewFiles and removeDroppedFiles (ArcadeDBStateMachine.java:667,691)

Both methods call DatabaseContext.INSTANCE.init(db) on the Ratis apply thread but never reset or remove the context after use. Ratis reuses its apply thread across entries. Other callers in the engine rely on the same pattern without reset, so this may be intentional - but it is worth confirming that the Ratis apply thread running with a stale database context between entries is safe. If the thread ever serves a different database after the apply (e.g., during a multi-database write storm), the stale context could be visible.

3. Misleading comment in RaftHAPlugin.configure() (RaftHAPlugin.java:69-72)

The comment reads:

// RaftHAServer.configure() called server.setHA(raftServer); override it so that
// server.getHA() returns this plugin (the HAPlugin contract), not the internal raftServer.

RaftHAServer.configure() never calls server.setHA() - the comment describes a scenario that does not happen. The intent is valid, but the explanation is incorrect and could mislead future maintainers.


Security

4. constantTimeTokenEquals length-leaks on mismatched tokens (SnapshotHttpHandler.java:362-364)

private static boolean constantTimeTokenEquals(final String expected, final String provided) {
    return MessageDigest.isEqual(
        expected.getBytes(StandardCharsets.UTF_8),
        provided.getBytes(StandardCharsets.UTF_8));
}

MessageDigest.isEqual short-circuits when array lengths differ, leaking whether the attacker's token has the correct byte length. In practice both tokens are always 64-char hex strings, so this is not currently exploitable. However, the invariant is implicit - an assertion or comment documenting that both sides must always be 64 bytes would prevent a silent regression if the token format ever changes.

5. gRPC port authentication gap in K8s mode (logged but no startup guard)

The startService() warning correctly flags that the gRPC Raft port has no authentication in K8s mode. The only mitigation is an operator-applied NetworkPolicy. Consider whether a startup validation (e.g., refusing to start unless SSL is on or the inbound host is not 0.0.0.0) would be a better default than a log-only warning. A misconfigured cluster would silently admit unauthenticated Raft traffic.


Read Consistency

6. waitForLocalApply() times out silently, degrading READ_YOUR_WRITES to EVENTUAL (RaftHAServer.java:639)

if (remaining <= 0) {
    HALog.log(this, HALog.DETAILED, "waitForLocalApply timed out...");
    return; // silent
}

On timeout the method returns without throwing, so waitForReadConsistency() in ReplicatedDatabase proceeds to serve the read with potentially stale data without any error signal to the caller. This may be intentional (degraded availability over hard failure), but the degradation at least deserves a higher-priority log level (FINE or WARNING) so it is visible in production logs.

7. LINEARIZABLE on a follower without a bookmark silently degrades to READ_YOUR_WRITES

The Javadoc in ReplicatedDatabase.waitForReadConsistency() documents this clearly as a "KNOWN LIMITATION", which is good. However, an HTTP client that sends X-ArcadeDB-Read-Consistency: LINEARIZABLE to a follower without a bookmark gets READ_YOUR_WRITES semantics without any indication. Consider returning an explicit error or a response header so callers can detect the degradation rather than silently accepting weaker guarantees.


Performance / Operational

8. Schema reload blocks the Ratis apply thread (ArcadeDBStateMachine.java:530-535)

db.getSchema().getEmbedded().load(...) and initComponents() are called synchronously on the Ratis state machine apply thread. For large schemas, these calls may take tens of milliseconds. Since Ratis serialises apply callbacks, this blocks all subsequent applies for the duration. A deferred reload queued off the apply thread (similar to how leader catch-up is handled on the lifecycleExecutor) would reduce pipeline stalls under schema-heavy workloads.

9. Confirm quorumTimeout sizing for ALL quorum batch shared deadline

In RaftGroupCommitter.flushBatch(), both the send deadline and the watch deadline are now + quorumTimeout, each computed once for the whole batch. This is the correct design (shared deadline prevents N entries costing N x quorumTimeout). Just confirm the configured default quorumTimeout is sized for the largest expected batch, since a slow follower under high concurrency could exhaust the shared deadline for all entries in a flush.


Minor

10. Missing @SuppressWarnings("unchecked") in ReplicatedDatabase.recordFileChanges()

return (RET) result.get();

11. leaderReady = true on RejectedExecutionException deserves a Javadoc note

In notifyLeaderChanged, when the lifecycle executor rejects the catch-up task, leaderReady is reset to true to avoid blocking reads indefinitely. The inline comment explains this, but it is worth noting in the method's Javadoc that a newly-elected leader whose lifecycle executor is concurrently shutting down will serve reads without the normal freshness guarantee.


What's Well Done

  • Security throughout: constant-time token comparison, PBKDF2 token derivation, zip-slip/symlink/decompression-bomb protection, path traversal defence in the snapshot handler.
  • Crash-safety: the per-phase crash analysis in applyTransactionEntry(), atomic-rename snapshot swaps, and the completion-marker pattern in SnapshotInstaller are all solid.
  • The MajorityCommittedAllFailedException contract (phase-2 must always run after MAJORITY commit) and recoverLeadershipAfterPhase2Failure step-down retry loop are carefully reasoned.
  • RaftGroupCommitter correctly handles the DISPATCHED-vs-CANCELLED race with a single CAS (EntryState enum).
  • ATTRIBUTIONS.md and NOTICE updated correctly for Apache Ratis.
  • Test coverage is extensive: unit, integration, and Docker e2e tests for chaos scenarios (network partition, packet loss, rolling restart).

Overall this is a very solid PR. The identified issues are mostly minor or documentation-level, with the daemon-thread omission (#1) being the most important to address before merge.

@github-actions
Copy link
Copy Markdown
Contributor

📜 License Compliance Check

✅ License check passed. See artifacts for full report.

License Summary (first 50 lines)

Lists of 386 third-party dependencies.
     (Apache License 2.0) LZ4 Java Compression (at.yawk.lz4:lz4-java:1.11.0 - https://github.com/yawkat/lz4-java)
     (EPL 2.0) (GNU Lesser General Public License) Logback Classic Module (ch.qos.logback:logback-classic:1.5.32 - http://logback.qos.ch/logback-classic)
     (EPL 2.0) (GNU Lesser General Public License) Logback Core Module (ch.qos.logback:logback-core:1.5.32 - http://logback.qos.ch/logback-core)
     (Apache 2) ArcadeDB BOLT Protocol (com.arcadedb:arcadedb-bolt:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-bolt/)
     (Apache 2) ArcadeDB Console (com.arcadedb:arcadedb-console:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-console/)
     (Apache 2) ArcadeDB Engine (com.arcadedb:arcadedb-engine:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-engine/)
     (Apache 2) ArcadeDB GraphQL (com.arcadedb:arcadedb-graphql:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-graphql/)
     (Apache 2) ArcadeDB Gremlin (com.arcadedb:arcadedb-gremlin:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-gremlin/)
     (Apache 2) ArcadeDB gRPC Stubs (com.arcadedb:arcadedb-grpc:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc/)
     (Apache 2) ArcadeDB gRPC Client (com.arcadedb:arcadedb-grpc-client:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc-client/)
     (Apache 2) ArcadeDB gRpcW (com.arcadedb:arcadedb-grpcw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpcw/)
     (Apache 2) ArcadeDB HA Raft (com.arcadedb:arcadedb-ha-raft:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-ha-raft/)
     (Apache 2) ArcadeDB Integration (com.arcadedb:arcadedb-integration:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-integration/)
     (Apache 2) ArcadeDB load tests (com.arcadedb:arcadedb-load-tests:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-load-tests/)
     (Apache 2) ArcadeDB Metrics (com.arcadedb:arcadedb-metrics:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-metrics/)
     (Apache 2) ArcadeDB MongoDB Wire Protocol (com.arcadedb:arcadedb-mongodbw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-mongodbw/)
     (Apache 2) ArcadeDB Network (com.arcadedb:arcadedb-network:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-network/)
     (Apache 2) ArcadeDB PostgresW (com.arcadedb:arcadedb-postgresw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-postgresw/)
     (Apache 2) ArcadeDB RedisW (com.arcadedb:arcadedb-redisw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-redisw/)
     (Apache 2) ArcadeDB Server (com.arcadedb:arcadedb-server:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-server/)
     (Apache 2) ArcadeDB Studio (com.arcadedb:arcadedb-studio:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-studio/)
     (Apache 2) ArcadeDB Test Utils (com.arcadedb:arcadedb-test-utils:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-test-utils/)
     (Apache License 2.0) HPPC Collections (com.carrotsearch:hppc:0.7.1 - http://labs.carrotsearch.com/hppc.html/hppc)
     (Apache License 2.0) Metrics Core (com.codahale.metrics:metrics-core:3.0.2 - http://metrics.codahale.com/metrics-core/)
     (The Apache License, Version 2.0) com.conversantmedia:disruptor (com.conversantmedia:disruptor:1.2.21 - https://github.com/conversant/disruptor)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.20 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.21 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.1 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.2 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.1 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.2 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-dataformat-YAML (com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.21.1 - https://github.com/FasterXML/jackson-dataformats-text)
     (Apache License 2.0) Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.21.1 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
     (Apache License 2.0) Caffeine cache (com.github.ben-manes.caffeine:caffeine:2.3.1 - https://github.com/ben-manes/caffeine)
     (Apache License 2.0) docker-java-api (com.github.docker-java:docker-java-api:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport (com.github.docker-java:docker-java-transport:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport-zerodep (com.github.docker-java:docker-java-transport-zerodep:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) btf (com.github.java-json-tools:btf:1.3 - https://github.com/java-json-tools/btf)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils (com.github.java-json-tools:jackson-coreutils:2.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils-equivalence (com.github.java-json-tools:jackson-coreutils-equivalence:1.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-patch (com.github.java-json-tools:json-patch:1.13 - https://github.com/java-json-tools/json-patch)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-core (com.github.java-json-tools:json-schema-core:1.2.14 - https://github.com/java-json-tools/json-schema-core)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-validator (com.github.java-json-tools:json-schema-validator:2.2.14 - https://github.com/java-json-tools/json-schema-validator)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) msg-simple (com.github.java-json-tools:msg-simple:1.2 - https://github.com/java-json-tools/msg-simple)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) uri-template (com.github.java-json-tools:uri-template:0.10 - https://github.com/java-json-tools/uri-template)
     (Apache License 2.0) (GNU Lesser General Public License) javaparser-core (com.github.javaparser:javaparser-core:3.26.3 - https://github.com/javaparser/javaparser-core)
     (Apache License 2.0) JCIP Annotations under Apache License (com.github.stephenc.jcip:jcip-annotations:1.0-1 - http://stephenc.github.com/jcip-annotations)
     (Apache License 2.0) Google Android Annotations Library (com.google.android:annotations:4.1.1.4 - http://source.android.com/)

@claude
Copy link
Copy Markdown

claude bot commented Apr 17, 2026

Code Review - Update (apache-ratis branch, be3ba69)

This is a follow-up review of the updated branch. The issues called out in the April 5 review have been addressed:

  • System.out.println in HALog.java - fixed; the class now uses only LogManager
  • Missing proxy timeout - fixed; LeaderProxy sets both PROXY_CONNECT_TIMEOUT_MS = 5_000 and reads HA_PROXY_READ_TIMEOUT
  • Session-based auth forwarding - fixed; LeaderProxy now converts Bearer tokens to cluster-token + forwarded-user, with the leader side validating via validateClusterForwardedAuth() using a constant-time comparison
  • Data durability concern - well addressed; the two-phase commit design with explicit commit1stPhase / commit2ndPhase, origin-skip, and MajorityCommittedAllFailedException is sound and thoroughly documented

The overall code quality is high. The security additions (PBKDF2 token derivation with 100k iterations, constant-time comparisons, zip-slip + symlink + decompression-bomb protection in SnapshotInstaller, IP allowlisting in PeerAddressAllowlistFilter) are solid.


Remaining items worth addressing

1. waitForLeaderReady() fails open on timeout (silent stale reads)

RaftHAServer.java:560-561:

if (!leaderReady)
    LogManager.instance().log(... "waitForLeaderReady timed out after %dms - proceeding with potentially stale leader state" ...);

After a leader election, if the state machine catch-up task does not complete within quorumTimeout, waitForLeaderReady() returns without throwing and the leader starts serving reads from a potentially stale state. For LINEARIZABLE reads this is a correctness concern: the client asked for linearizability and gets silently stale data instead of an error. Consider returning a ServerIsNotTheLeaderException or a specific "leader not ready" error so the client can retry against another node.

2. PeerAddressAllowlistFilter - concurrent DNS re-resolution not guarded

PeerAddressAllowlistFilter.java:87-91:

if (System.currentTimeMillis() - lastResolveMs >= refreshIntervalMs) {
    resolveNow();
    if (allowedIps.get().contains(ip))
        return attrs;
}

lastResolveMs is a plain volatile long and resolveNow() updates allowedIps (AtomicReference) and lastResolveMs separately without synchronization. Under high connection rates from unknown IPs, many threads can pass the time check simultaneously and all call resolveNow() in parallel. This is benign for correctness (DNS lookups are idempotent), but in a K8s environment with many simultaneous pod restarts it can cause a burst of DNS queries. A synchronized(this) around the re-resolution block, or a compareAndSet on the time, would prevent the pile-on.

3. ClusterTokenProvider default-name warning not shown outside production mode

ClusterTokenProvider.java:88-92:

if ("production".equals(configuration.getValueAsString(GlobalConfiguration.SERVER_MODE))
    && "arcadedb".equalsIgnoreCase(clusterName))
    LogManager.instance().log(this, Level.WARNING, ...);

The warning about using the default cluster name (weaker token domain separation) fires only in production mode. A staging or pre-production cluster using the default name gets no feedback. The production-mode gate for the rotating-token warning is appropriate, but the default-name warning should probably fire at all modes since the risk is the same regardless of SERVER_MODE.

4. Test Thread.sleep() in distributed-systems tests

Several integration test files use bare Thread.sleep() to wait for cluster state changes (e.g., RaftQuorumLostIT.java:94 Thread.sleep(3000), BaseRaftHATest has one too). These are the most common source of flaky CI failures in distributed-systems test suites. Where possible, replace these with polling loops that break early when the expected condition becomes true (e.g., awaitCondition(() -> raftServer.getElectionStatus().equals("LEADER"), 5000)). The existing BaseRaftHATest.waitForLeader() pattern is a good model - use it consistently.

5. Minor - constant definitions placed after first use

ReplicatedDatabase.java:342-343 defines STEP_DOWN_MAX_ATTEMPTS = 3 and STEP_DOWN_RETRY_DELAY_MS = 250L near the bottom of the class, after the methods that reference them. Java allows this for fields, but by convention constants are grouped at the top (after instance fields). This is a readability nit.


Security note (already documented but worth reiterating)

The PR description and PeerAddressAllowlistFilter Javadoc both reference GitHub issue #3890 for mTLS support. Until mTLS or equivalent transport encryption is in place, the cluster-token and snapshot data travel in plaintext over the inter-node HTTP channel. The existing warnings at startup (SECURITY: gRPC Raft port is bound to all interfaces) and during snapshot serving are appropriate; just make sure users see them prominently in the documentation.


Summary

The implementation is significantly more mature than the previous review cycle. Items 1-3 above are the most actionable before merge; the rest are improvement suggestions for follow-up. The Apache Ratis integration, snapshot machinery, and two-phase commit protocol are well-designed and thoroughly tested with a comprehensive suite of unit, integration, and Docker e2e tests.

@github-actions
Copy link
Copy Markdown
Contributor

📜 License Compliance Check

✅ License check passed. See artifacts for full report.

License Summary (first 50 lines)

Lists of 386 third-party dependencies.
     (Apache License 2.0) LZ4 Java Compression (at.yawk.lz4:lz4-java:1.11.0 - https://github.com/yawkat/lz4-java)
     (EPL 2.0) (GNU Lesser General Public License) Logback Classic Module (ch.qos.logback:logback-classic:1.5.32 - http://logback.qos.ch/logback-classic)
     (EPL 2.0) (GNU Lesser General Public License) Logback Core Module (ch.qos.logback:logback-core:1.5.32 - http://logback.qos.ch/logback-core)
     (Apache 2) ArcadeDB BOLT Protocol (com.arcadedb:arcadedb-bolt:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-bolt/)
     (Apache 2) ArcadeDB Console (com.arcadedb:arcadedb-console:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-console/)
     (Apache 2) ArcadeDB Engine (com.arcadedb:arcadedb-engine:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-engine/)
     (Apache 2) ArcadeDB GraphQL (com.arcadedb:arcadedb-graphql:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-graphql/)
     (Apache 2) ArcadeDB Gremlin (com.arcadedb:arcadedb-gremlin:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-gremlin/)
     (Apache 2) ArcadeDB gRPC Stubs (com.arcadedb:arcadedb-grpc:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc/)
     (Apache 2) ArcadeDB gRPC Client (com.arcadedb:arcadedb-grpc-client:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc-client/)
     (Apache 2) ArcadeDB gRpcW (com.arcadedb:arcadedb-grpcw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpcw/)
     (Apache 2) ArcadeDB HA Raft (com.arcadedb:arcadedb-ha-raft:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-ha-raft/)
     (Apache 2) ArcadeDB Integration (com.arcadedb:arcadedb-integration:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-integration/)
     (Apache 2) ArcadeDB load tests (com.arcadedb:arcadedb-load-tests:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-load-tests/)
     (Apache 2) ArcadeDB Metrics (com.arcadedb:arcadedb-metrics:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-metrics/)
     (Apache 2) ArcadeDB MongoDB Wire Protocol (com.arcadedb:arcadedb-mongodbw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-mongodbw/)
     (Apache 2) ArcadeDB Network (com.arcadedb:arcadedb-network:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-network/)
     (Apache 2) ArcadeDB PostgresW (com.arcadedb:arcadedb-postgresw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-postgresw/)
     (Apache 2) ArcadeDB RedisW (com.arcadedb:arcadedb-redisw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-redisw/)
     (Apache 2) ArcadeDB Server (com.arcadedb:arcadedb-server:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-server/)
     (Apache 2) ArcadeDB Studio (com.arcadedb:arcadedb-studio:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-studio/)
     (Apache 2) ArcadeDB Test Utils (com.arcadedb:arcadedb-test-utils:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-test-utils/)
     (Apache License 2.0) HPPC Collections (com.carrotsearch:hppc:0.7.1 - http://labs.carrotsearch.com/hppc.html/hppc)
     (Apache License 2.0) Metrics Core (com.codahale.metrics:metrics-core:3.0.2 - http://metrics.codahale.com/metrics-core/)
     (The Apache License, Version 2.0) com.conversantmedia:disruptor (com.conversantmedia:disruptor:1.2.21 - https://github.com/conversant/disruptor)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.20 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.21 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.1 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.2 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.1 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.2 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-dataformat-YAML (com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.21.1 - https://github.com/FasterXML/jackson-dataformats-text)
     (Apache License 2.0) Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.21.1 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
     (Apache License 2.0) Caffeine cache (com.github.ben-manes.caffeine:caffeine:2.3.1 - https://github.com/ben-manes/caffeine)
     (Apache License 2.0) docker-java-api (com.github.docker-java:docker-java-api:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport (com.github.docker-java:docker-java-transport:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport-zerodep (com.github.docker-java:docker-java-transport-zerodep:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) btf (com.github.java-json-tools:btf:1.3 - https://github.com/java-json-tools/btf)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils (com.github.java-json-tools:jackson-coreutils:2.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils-equivalence (com.github.java-json-tools:jackson-coreutils-equivalence:1.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-patch (com.github.java-json-tools:json-patch:1.13 - https://github.com/java-json-tools/json-patch)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-core (com.github.java-json-tools:json-schema-core:1.2.14 - https://github.com/java-json-tools/json-schema-core)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-validator (com.github.java-json-tools:json-schema-validator:2.2.14 - https://github.com/java-json-tools/json-schema-validator)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) msg-simple (com.github.java-json-tools:msg-simple:1.2 - https://github.com/java-json-tools/msg-simple)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) uri-template (com.github.java-json-tools:uri-template:0.10 - https://github.com/java-json-tools/uri-template)
     (Apache License 2.0) (GNU Lesser General Public License) javaparser-core (com.github.javaparser:javaparser-core:3.26.3 - https://github.com/javaparser/javaparser-core)
     (Apache License 2.0) JCIP Annotations under Apache License (com.github.stephenc.jcip:jcip-annotations:1.0-1 - http://stephenc.github.com/jcip-annotations)
     (Apache License 2.0) Google Android Annotations Library (com.google.android:annotations:4.1.1.4 - http://source.android.com/)

@claude
Copy link
Copy Markdown

claude bot commented Apr 17, 2026

Code Review - Follow-up (Apache Ratis integration)

Thank you for the significant iteration - 30+ fix commits implementing prior review recommendations show real responsiveness. This follow-up focuses on the benchmark concern and remaining open items.


PERFORMANCE - 29x overhead warrants investigation before merge

@lvca noted 53 ops/sec (HA, 3-node) vs 1,546 ops/sec (no-HA) with batch=100. The P99 latency of ~28ms per batch on what is likely a single-machine test suggests suboptimal Ratis configuration rather than pure network cost.

Suggested investigation:

  • raft.server.log.force.sync.num - controls how many entries batch before fsync. Raising this (e.g. to 64-128) improves throughput with no quorum-durability cost since the quorum requirement is independent.
  • raft.server.log.segment.size.max - default 8MB causes frequent rotation/checkpoint flushes; 64-128MB reduces overhead.
  • raft.server.log.appender.buffer.byte-limit - confirm pipeline batching is enabled.
  • The benchmark note 'concurrent (3 threads) -> 77 ops/sec' (vs 59 single-thread) confirms Ratis batching works. This is a good sign, but 29x vs no-HA is still worth documenting as the known trade-off before publicising the feature.

If the benchmark ran all 3 nodes on one machine: real multi-machine RTT (1-5ms) typically gives 5-10x better throughput. Please note the hardware setup in the benchmark comment so others can calibrate.


CI FAILURES

Three checks are currently failing on this push:

  • Meterian client scan - flags new CVEs in deps pulled by Apache Ratis (gRPC, protobuf, shaded Netty). Please review and either fix, suppress with justification, or document that shaded Netty is not reachable from network-exposed code paths.
  • Frontend Security Audit + Dependency Review - same root cause (new transitive Ratis deps). Confirm all are Apache 2.0 compatible and no high-severity npm/Maven CVEs affect ArcadeDB's attack surface.

TEST COVERAGE GAPS

Codecov shows critical-path files at very low coverage:

  • SnapshotInstaller.java: 3% (222 lines uncovered) - this is the follower catch-up path. Error paths (download failure, ZIP corruption, partial extraction) are untested. This is the most likely path to cause silent data corruption during a real cluster recovery.
  • KubernetesAutoJoin.java: 0% (85 lines) - K8s auto-join/leave untested.
  • RaftClusterManager.java: 17% (97 lines) - membership change paths largely untested.

Not asking for 80% coverage, but at least one integration test exercising SnapshotInstaller's happy path would materially reduce the risk of shipping a broken recovery path.


STILL OPEN: Snapshot partial-failure cleanup

Commit 18ced57 acknowledged this as 'valid but requires significant refactoring'. The risk: if ZIP extraction fails mid-way in SnapshotInstaller, the partially extracted database dir is left behind and corrupts the node on the next restart. An atomic swap (extract to db.tmp, then rename) fixes this. Please open a GitHub issue to track this before merging so it is not lost.


RESOLVED from prior reviews (good work)

Platform-default charset in group ID derivation, timing-attack on cluster token comparison, System.out.println in HALog, dead code (COMMAND_FORWARD), missing proxy timeouts, ATTRIBUTIONS.md update - all appear addressed based on commit messages.


Summary

The architectural direction is correct and the implementation has matured substantially through multiple fix rounds. Before merge:

  1. Investigate and document the performance overhead (or tune Ratis config to narrow the gap)
  2. Address/suppress CI security failures
  3. Open a GitHub issue for SnapshotInstaller atomic extraction

The 36-test suite plus Docker e2e coverage and the detailed documentation are appreciated for a change of this scope.

@github-actions
Copy link
Copy Markdown
Contributor

📜 License Compliance Check

✅ License check passed. See artifacts for full report.

License Summary (first 50 lines)

Lists of 386 third-party dependencies.
     (Apache License 2.0) LZ4 Java Compression (at.yawk.lz4:lz4-java:1.11.0 - https://github.com/yawkat/lz4-java)
     (EPL 2.0) (GNU Lesser General Public License) Logback Classic Module (ch.qos.logback:logback-classic:1.5.32 - http://logback.qos.ch/logback-classic)
     (EPL 2.0) (GNU Lesser General Public License) Logback Core Module (ch.qos.logback:logback-core:1.5.32 - http://logback.qos.ch/logback-core)
     (Apache 2) ArcadeDB BOLT Protocol (com.arcadedb:arcadedb-bolt:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-bolt/)
     (Apache 2) ArcadeDB Console (com.arcadedb:arcadedb-console:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-console/)
     (Apache 2) ArcadeDB Engine (com.arcadedb:arcadedb-engine:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-engine/)
     (Apache 2) ArcadeDB GraphQL (com.arcadedb:arcadedb-graphql:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-graphql/)
     (Apache 2) ArcadeDB Gremlin (com.arcadedb:arcadedb-gremlin:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-gremlin/)
     (Apache 2) ArcadeDB gRPC Stubs (com.arcadedb:arcadedb-grpc:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc/)
     (Apache 2) ArcadeDB gRPC Client (com.arcadedb:arcadedb-grpc-client:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc-client/)
     (Apache 2) ArcadeDB gRpcW (com.arcadedb:arcadedb-grpcw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpcw/)
     (Apache 2) ArcadeDB HA Raft (com.arcadedb:arcadedb-ha-raft:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-ha-raft/)
     (Apache 2) ArcadeDB Integration (com.arcadedb:arcadedb-integration:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-integration/)
     (Apache 2) ArcadeDB load tests (com.arcadedb:arcadedb-load-tests:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-load-tests/)
     (Apache 2) ArcadeDB Metrics (com.arcadedb:arcadedb-metrics:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-metrics/)
     (Apache 2) ArcadeDB MongoDB Wire Protocol (com.arcadedb:arcadedb-mongodbw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-mongodbw/)
     (Apache 2) ArcadeDB Network (com.arcadedb:arcadedb-network:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-network/)
     (Apache 2) ArcadeDB PostgresW (com.arcadedb:arcadedb-postgresw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-postgresw/)
     (Apache 2) ArcadeDB RedisW (com.arcadedb:arcadedb-redisw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-redisw/)
     (Apache 2) ArcadeDB Server (com.arcadedb:arcadedb-server:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-server/)
     (Apache 2) ArcadeDB Studio (com.arcadedb:arcadedb-studio:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-studio/)
     (Apache 2) ArcadeDB Test Utils (com.arcadedb:arcadedb-test-utils:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-test-utils/)
     (Apache License 2.0) HPPC Collections (com.carrotsearch:hppc:0.7.1 - http://labs.carrotsearch.com/hppc.html/hppc)
     (Apache License 2.0) Metrics Core (com.codahale.metrics:metrics-core:3.0.2 - http://metrics.codahale.com/metrics-core/)
     (The Apache License, Version 2.0) com.conversantmedia:disruptor (com.conversantmedia:disruptor:1.2.21 - https://github.com/conversant/disruptor)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.20 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.21 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.1 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.2 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.1 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.2 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-dataformat-YAML (com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.21.1 - https://github.com/FasterXML/jackson-dataformats-text)
     (Apache License 2.0) Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.21.1 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
     (Apache License 2.0) Caffeine cache (com.github.ben-manes.caffeine:caffeine:2.3.1 - https://github.com/ben-manes/caffeine)
     (Apache License 2.0) docker-java-api (com.github.docker-java:docker-java-api:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport (com.github.docker-java:docker-java-transport:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport-zerodep (com.github.docker-java:docker-java-transport-zerodep:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) btf (com.github.java-json-tools:btf:1.3 - https://github.com/java-json-tools/btf)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils (com.github.java-json-tools:jackson-coreutils:2.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils-equivalence (com.github.java-json-tools:jackson-coreutils-equivalence:1.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-patch (com.github.java-json-tools:json-patch:1.13 - https://github.com/java-json-tools/json-patch)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-core (com.github.java-json-tools:json-schema-core:1.2.14 - https://github.com/java-json-tools/json-schema-core)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-validator (com.github.java-json-tools:json-schema-validator:2.2.14 - https://github.com/java-json-tools/json-schema-validator)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) msg-simple (com.github.java-json-tools:msg-simple:1.2 - https://github.com/java-json-tools/msg-simple)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) uri-template (com.github.java-json-tools:uri-template:0.10 - https://github.com/java-json-tools/uri-template)
     (Apache License 2.0) (GNU Lesser General Public License) javaparser-core (com.github.javaparser:javaparser-core:3.26.3 - https://github.com/javaparser/javaparser-core)
     (Apache License 2.0) JCIP Annotations under Apache License (com.github.stephenc.jcip:jcip-annotations:1.0-1 - http://stephenc.github.com/jcip-annotations)
     (Apache License 2.0) Google Android Annotations Library (com.google.android:annotations:4.1.1.4 - http://source.android.com/)

@claude
Copy link
Copy Markdown

claude bot commented Apr 17, 2026

Code Review - Replace custom Raft protocol with Apache Ratis

This is a substantial and well-motivated change. Replacing a bespoke Raft implementation with Apache Ratis is the right architectural call - Ratis is battle-tested at scale (Apache Ozone, IoTDB) and provides correctness guarantees that would take years to replicate manually. The test coverage is commendable: unit tests, integration tests, and Docker-based E2E tests with network partition and packet-loss injection are exactly what this kind of infrastructure change needs.

Below are observations grouped by severity.


Bugs / Correctness Issues

1. System.out.println in production code (HALog.java)

HALog.java appears to use System.out.println directly instead of the ArcadeDB logger. The project convention (per CLAUDE.md) explicitly requires removing any System.out before finishing. These should be replaced with LogManager.instance().log(...) calls (the same pattern used elsewhere in the server module).

2. Spurious no-op String.replace() in HALog format string

The first replace() call in the format helper is a no-op - it replaces a pattern with itself before the second replacement runs. The dead code has no effect but is confusing. Remove it.

3. Missing HTTP timeout in proxyToLeader()

The follower-to-leader proxy path opens an HttpURLConnection (or similar) without setting setConnectTimeout / setReadTimeout. Under the default JDK behaviour the read can block forever, tying up a server thread indefinitely during a leader restart or network partition. The HA_PROXY_READ_TIMEOUT config exists - make sure it is actually applied to the connection object.

4. maxRetry parameter mutated by caller

In the retry loop, the caller-supplied maxRetry value is decremented in place. This silently overwrites the caller's variable, making the argument behave like a mutable out-parameter. Use a local counter instead: int remaining = maxRetry; and decrement remaining.


Security Issues

5. Timing-attack vulnerability in cluster token comparison (ClusterTokenProvider.java)

Token equality is checked with String.equals() (or ==/Arrays.equals on raw bytes). For any security-sensitive secret comparison, use MessageDigest.isEqual(expectedBytes, actualBytes) instead. It runs in constant time and is not short-circuit evaluable, preventing timing-side-channel attacks on the cluster shared secret.

Example fix:

// Instead of:
if (!expected.equals(provided)) { ... }

// Use:
if (!MessageDigest.isEqual(expected.getBytes(StandardCharsets.UTF_8),
                            provided.getBytes(StandardCharsets.UTF_8))) { ... }

6. User enumeration in auth error messages

Some auth failure responses include the username in the error message (e.g. "User 'foo' not found"). This allows attackers to enumerate valid usernames by comparing response bodies. Return a generic "Invalid credentials" message for all authentication failures - distinguish the cause only in server-side logs.


Data Durability Concern

7. Leader applies locally before replication quorum

In ReplicatedDatabase, the current sequence is: (1) prepare WAL delta, (2) submit to Raft, (3) apply locally. The concern is whether the local application happens before Ratis has confirmed quorum commitment. If the leader crashes after applying locally but before a majority of followers have persisted the entry, the transaction is lost from the leader's perspective while followers never saw it - a split-brain data loss scenario.

Please verify (and document in a comment) that the local apply step in ArcadeDBStateMachine.applyTransaction() is invoked only from the Ratis StateMachine.applyTransaction() callback, which fires only after the entry is committed to a majority. If the origin-skip optimization bypasses this by applying on the leader before the callback fires, that path needs careful review.


Code Quality / Maintainability

8. 39 new GlobalConfiguration entries with minimal validation

Many new HA_* config keys accept raw strings or integers with no range checks (e.g. HA_ELECTION_TIMEOUT_MIN > HA_ELECTION_TIMEOUT_MAX is silently accepted, HA_QUORUM accepts arbitrary strings). Consider adding a validate() call (following the pattern used by other modules) that catches obviously invalid combinations at startup.

9. SnapshotHttpHandler - unauthenticated endpoint risk

The snapshot download endpoint must be protected by the cluster token. Verify it checks ClusterTokenProvider.validate() before streaming the ZIP. If a follower's snapshot endpoint is reachable from outside the cluster network (e.g. via a misconfigured load balancer), an unauthenticated attacker could download a full database ZIP. Add a clear comment in the handler documenting this requirement.

10. RaftGroupCommitter background thread not named

The background flusher Thread should be named (e.g. "arcadedb-raft-group-committer") to make heap dumps and thread dumps readable. Use Thread t = new Thread(runnable, "arcadedb-raft-group-committer") or a ThreadFactory.

11. TODO section in docs/arcadedb-ha-26.4.1.md

The documentation contains a TODO section. Per project conventions, open items should be tracked as GitHub issues rather than left in committed docs. Convert these to issues before merging.


Attribution / Licensing

12. ATTRIBUTIONS.md not updated for Apache Ratis

Per the project's CLAUDE.md, every new dependency must have an entry in ATTRIBUTIONS.md, and if the dependency ships a NOTICE file (Apache Ratis does), those notices must be incorporated into the main NOTICE file. Apache Ratis 3.2.2 pulls in a substantial transitive graph (gRPC, Netty, Protobuf shaded, etc.) - please audit the full transitive closure and update both files accordingly.


Minor / Nits

  • HALog logging levels (0-3) are not documented in the main configuration reference - add a note alongside the HA_LOG_VERBOSE entry in GlobalConfiguration.java.
  • HealthMonitor should log at WARNING (not INFO) when it triggers a Ratis restart, so operators notice the event in production log streams.
  • KubernetesAutoJoin - consider what happens when the DNS suffix is empty (HA_K8S_DNS_SUFFIX = ""): does the code degrade gracefully or throw a confusing exception?

Overall this is a strong PR. Addressing the security issues (5, 6, 9) and the data durability question (7) are the most important before merge. The remaining items are improvements that can follow in a subsequent PR if needed.

@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 17, 2026

Not up to standards ⛔

🔴 Issues 2 high · 12 medium · 34 minor

Alerts:
⚠ 48 issues (≤ 0 issues of at least minor severity)

Results:
48 new issues

Category Results
BestPractice 11 medium
Documentation 11 minor
ErrorProne 2 high
CodeStyle 23 minor
Complexity 1 medium

View in Codacy

🟢 Metrics 1261 complexity

Metric Results
Complexity 1261

View in Codacy

🟢 Coverage 64.65% diff coverage · -10.46% coverage variation

Metric Results
Coverage variation -10.46% coverage variation
Diff coverage 64.65% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (e076207) 117023 86128 73.60%
Head commit (88cea6b) 149282 (+32259) 94262 (+8134) 63.14% (-10.46%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#3798) 3972 2568 64.65%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

TIP This summary will be updated as you push new changes. Give us feedback

@github-actions
Copy link
Copy Markdown
Contributor

📜 License Compliance Check

✅ License check passed. See artifacts for full report.

License Summary (first 50 lines)

Lists of 386 third-party dependencies.
     (Apache License 2.0) LZ4 Java Compression (at.yawk.lz4:lz4-java:1.11.0 - https://github.com/yawkat/lz4-java)
     (EPL 2.0) (GNU Lesser General Public License) Logback Classic Module (ch.qos.logback:logback-classic:1.5.32 - http://logback.qos.ch/logback-classic)
     (EPL 2.0) (GNU Lesser General Public License) Logback Core Module (ch.qos.logback:logback-core:1.5.32 - http://logback.qos.ch/logback-core)
     (Apache 2) ArcadeDB BOLT Protocol (com.arcadedb:arcadedb-bolt:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-bolt/)
     (Apache 2) ArcadeDB Console (com.arcadedb:arcadedb-console:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-console/)
     (Apache 2) ArcadeDB Engine (com.arcadedb:arcadedb-engine:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-engine/)
     (Apache 2) ArcadeDB GraphQL (com.arcadedb:arcadedb-graphql:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-graphql/)
     (Apache 2) ArcadeDB Gremlin (com.arcadedb:arcadedb-gremlin:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-gremlin/)
     (Apache 2) ArcadeDB gRPC Stubs (com.arcadedb:arcadedb-grpc:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc/)
     (Apache 2) ArcadeDB gRPC Client (com.arcadedb:arcadedb-grpc-client:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpc-client/)
     (Apache 2) ArcadeDB gRpcW (com.arcadedb:arcadedb-grpcw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-grpcw/)
     (Apache 2) ArcadeDB HA Raft (com.arcadedb:arcadedb-ha-raft:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-ha-raft/)
     (Apache 2) ArcadeDB Integration (com.arcadedb:arcadedb-integration:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-integration/)
     (Apache 2) ArcadeDB load tests (com.arcadedb:arcadedb-load-tests:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-load-tests/)
     (Apache 2) ArcadeDB Metrics (com.arcadedb:arcadedb-metrics:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-metrics/)
     (Apache 2) ArcadeDB MongoDB Wire Protocol (com.arcadedb:arcadedb-mongodbw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-mongodbw/)
     (Apache 2) ArcadeDB Network (com.arcadedb:arcadedb-network:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-network/)
     (Apache 2) ArcadeDB PostgresW (com.arcadedb:arcadedb-postgresw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-postgresw/)
     (Apache 2) ArcadeDB RedisW (com.arcadedb:arcadedb-redisw:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-redisw/)
     (Apache 2) ArcadeDB Server (com.arcadedb:arcadedb-server:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-server/)
     (Apache 2) ArcadeDB Studio (com.arcadedb:arcadedb-studio:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-studio/)
     (Apache 2) ArcadeDB Test Utils (com.arcadedb:arcadedb-test-utils:26.4.1-SNAPSHOT - https://arcadedata.com/arcadedb-test-utils/)
     (Apache License 2.0) HPPC Collections (com.carrotsearch:hppc:0.7.1 - http://labs.carrotsearch.com/hppc.html/hppc)
     (Apache License 2.0) Metrics Core (com.codahale.metrics:metrics-core:3.0.2 - http://metrics.codahale.com/metrics-core/)
     (The Apache License, Version 2.0) com.conversantmedia:disruptor (com.conversantmedia:disruptor:1.2.21 - https://github.com/conversant/disruptor)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.20 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.21 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.1 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) Jackson-core (com.fasterxml.jackson.core:jackson-core:2.21.2 - https://github.com/FasterXML/jackson-core)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.1 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.21.2 - https://github.com/FasterXML/jackson)
     (Apache License 2.0) Jackson-dataformat-YAML (com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.21.1 - https://github.com/FasterXML/jackson-dataformats-text)
     (Apache License 2.0) Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.21.1 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
     (Apache License 2.0) Caffeine cache (com.github.ben-manes.caffeine:caffeine:2.3.1 - https://github.com/ben-manes/caffeine)
     (Apache License 2.0) docker-java-api (com.github.docker-java:docker-java-api:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport (com.github.docker-java:docker-java-transport:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache License 2.0) docker-java-transport-zerodep (com.github.docker-java:docker-java-transport-zerodep:3.7.1 - https://github.com/docker-java/docker-java)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) btf (com.github.java-json-tools:btf:1.3 - https://github.com/java-json-tools/btf)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils (com.github.java-json-tools:jackson-coreutils:2.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) jackson-coreutils-equivalence (com.github.java-json-tools:jackson-coreutils-equivalence:1.0 - https://github.com/java-json-tools/jackson-coreutils)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-patch (com.github.java-json-tools:json-patch:1.13 - https://github.com/java-json-tools/json-patch)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-core (com.github.java-json-tools:json-schema-core:1.2.14 - https://github.com/java-json-tools/json-schema-core)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) json-schema-validator (com.github.java-json-tools:json-schema-validator:2.2.14 - https://github.com/java-json-tools/json-schema-validator)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) msg-simple (com.github.java-json-tools:msg-simple:1.2 - https://github.com/java-json-tools/msg-simple)
     (Apache Software License, version 2.0) (Lesser General Public License, version 3 or greater) uri-template (com.github.java-json-tools:uri-template:0.10 - https://github.com/java-json-tools/uri-template)
     (Apache License 2.0) (GNU Lesser General Public License) javaparser-core (com.github.javaparser:javaparser-core:3.26.3 - https://github.com/javaparser/javaparser-core)
     (Apache License 2.0) JCIP Annotations under Apache License (com.github.stephenc.jcip:jcip-annotations:1.0-1 - http://stephenc.github.com/jcip-annotations)
     (Apache License 2.0) Google Android Annotations Library (com.google.android:annotations:4.1.1.4 - http://source.android.com/)

@claude
Copy link
Copy Markdown

claude bot commented Apr 17, 2026

Code Review - Follow-up (2026-04-17, latest commit: d52f160)

This is a follow-up review pass after the extensive prior iteration. The PR has come a long way and most of the earlier concerns have been addressed. Here is the current status.

Confirmed Fixed (latest commits)

  • Unmanaged watchdog thread: fixed in d52f160 ("managed watchdog in a registered thread")
  • READ_CONSISTENCY_CONTEXT ThreadLocal leak: fixed in 159f775
  • Snapshot recovery race condition: fixed in 73a5b2b
  • IPv6 parsing: fixed in fda4b8b
  • Many prior code review recommendations applied across multiple commits

Still Open

1. TOCTOU race in SnapshotHttpHandler (NPE risk)

The existsDatabase() guard at the start of handleRequestInternal() is checked before exchange.startBlocking() and before response headers are sent. By the time server.getDatabase(databaseName) is called after those operations, the database could have been dropped, making getDatabase() return null. The subsequent db.getEmbedded() call then throws NPE with no meaningful error response.

Fix: add a null guard before unwrapping:

final DatabaseInternal db = server.getDatabase(databaseName);
if (db == null) {
  exchange.setStatusCode(404);
  return;
}
db.getEmbedded()...

2. arcadedb.ha.replicationLagWarning - unit not documented

The configuration description says "Raft log gap threshold for lag warnings (0=disabled)" but does not say what N means. An operator expecting milliseconds would set this to 1000 and never get warnings (since 1000 log entries behind is a significant gap). Suggest appending: "A value of N means the replica is N committed log entries behind the leader."

3. ClusterMonitor.getReplicaLags() - negative values possible on startup

currentCommitIndex - entry.getValue() can be negative during startup when the local commit index has not yet caught up to the peer's last-seen index. This produces negative lag values in the Studio dashboard and could trigger the warning threshold incorrectly. A Math.max(0, ...) guard is sufficient.

4. printClusterConfiguration() - String.hashCode() used for change detection

Using hash codes to deduplicate WARN log output is unreliable due to hash collisions. Two different cluster configurations could have the same hash code, suppressing a legitimate warning. For a safety-critical log, cache the last-printed string and compare with Objects.equals() instead.

Overall Assessment

The PR is in good shape. The core architecture (three-phase commit, group commit batching, read consistency levels, K8s auto-join, crash-safe snapshot swap) is sound. The security fundamentals (PBKDF2 cluster token, constant-time comparison, zip-slip protection, decompression bomb guards) are well-implemented. The test matrix - unit, integration, Docker e2e - is excellent.

Issues 3 and 4 are minor. Issue 1 is a correctness concern that should be fixed before merge. Issue 2 is a documentation improvement that would help operators.

Once issue 1 is addressed, this is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants