feat(feature-store): add AI/ML Feature Store use case#60
Conversation
Unified ML feature store serving three domains (fraud scoring, product recommendations, predictive maintenance) from a single ArcadeDB instance. Demonstrates graph, vector, time-series, and document features with 11 query patterns across curl, Java, and JavaScript (PostgreSQL protocol). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant new AI/ML Feature Store use case, showcasing ArcadeDB's robust multi-model capabilities as a consolidated platform for machine learning feature management. It provides a practical demonstration of how a single database can efficiently handle diverse data requirements for fraud detection, product recommendations, and predictive maintenance. The implementation includes comprehensive examples with various query patterns and client integrations, supported by detailed documentation and automated CI/CD configurations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Code Review — feat(feature-store): AI/ML Feature Store use caseOverall this is a well-structured addition that follows the repo conventions closely. The schema design, data seeding, and multi-runner CI setup are all solid. A few issues worth addressing before merge: Bugs / Correctness1. Query 9 is not recursive — only fetches direct dependents
MATCH {type: Equipment, where: (equipmentId = 'eq1')}
.in('DEPENDS_ON'){as: dep}There is no 2. Query 8 hardcodes the preference vector — not actually personalized The design doc specified a Cypher query that joins the ORDER BY vectorNeighbors('Product[embedding]', [0.9, 0.1, 0.1, 0.1], 20) DESCThis only works correctly for u1 whose 3. Misleading comment in // ── Query 8: Personalized Ranking (Cypher via {cypher} prefix) ──────────────The actual implementation uses plain SQL (no 4. Query 11 creates duplicate The INSERT in step 4 has no guard against re-insertion: INSERT INTO FeatureSnapshot SET entityId = 'a4', ...Running Minor Issues5. command() {
local lang="$1" cmd="$2"
...
}
6. No The npm cache key hashes 7. CI action version comment may be incorrect uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
uses: actions/setup-node@53b83947a5a98c8d113130e565377fae1a50d02f # v6.3.0The public Nits
Summary
The core architecture, SQL patterns, and CI structure are all consistent with the rest of the repo. The two correctness issues (#1 and #2) are the main things to fix before merge. 🤖 Generated with Claude Code |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive and well-structured AI/ML Feature Store use case, complete with data, queries, and demo applications in shell, Java, and JavaScript. The addition of detailed design and implementation documents is commendable. My review identified a recurring correctness issue in one of the queries (Query 9) across all three implementations, where a property was being incorrectly accessed from a vertex instead of an edge. I've provided fixes for this. Additionally, I found a bug in the JavaScript implementation related to an unescaped variable in a query string and suggested a security enhancement for the docker-compose.yml file to avoid hardcoding credentials.
| SELECT name, failureRate, criticality | ||
| FROM ( | ||
| MATCH {type: Equipment, where: (equipmentId = 'eq1')} | ||
| .in('DEPENDS_ON'){as: dep} | ||
| RETURN dep.name AS name, dep.failureRate AS failureRate, | ||
| dep.out('DEPENDS_ON')[0].criticality AS criticality | ||
| )"""; |
There was a problem hiding this comment.
This query to find the equipment dependency chain is incorrect. The expression dep.out('DEPENDS_ON')[0].criticality attempts to get the criticality property from a vertex, but this property is on the DEPENDS_ON edge. This will return null for criticality. Furthermore, the query only finds direct dependencies, while the description implies finding all downstream equipment. A traversal is needed. The query should be updated to perform a traversal and correctly retrieve criticality from the edge. The printing logic will also need to be updated to handle the new depth column.
SELECT
dep.name AS name,
dep.failureRate AS failureRate,
e.criticality AS criticality,
$depth AS depth
FROM (
MATCH {type: Equipment, where: (equipmentId = 'eq1')}
.inE('DEPENDS_ON'){while: ($depth < 5), as: e}
.outV(){as: dep}
RETURN dep, e, $depth
)
ORDER BY depth ASC| .both('TRANSFERRED'){while: ($depth < 4), as: hop} | ||
| {type: Account, where: (flagged = true), as: flagged} | ||
| RETURN flagged.accountId AS flaggedId, $depth AS depth |
There was a problem hiding this comment.
The $depth variable in this MATCH query needs to be escaped in the JavaScript template literal. Without escaping, it will be interpreted as a variable for interpolation, which will likely cause an error or unexpected behavior. It should be written as \$depth.
| .both('TRANSFERRED'){while: ($depth < 4), as: hop} | |
| {type: Account, where: (flagged = true), as: flagged} | |
| RETURN flagged.accountId AS flaggedId, $depth AS depth | |
| .both('TRANSFERRED'){while: (\$depth < 4), as: hop} | |
| {type: Account, where: (flagged = true), as: flagged} | |
| RETURN flagged.accountId AS flaggedId, \$depth AS depth |
| const sql = ` | ||
| SELECT name, failureRate, criticality | ||
| FROM ( | ||
| MATCH {type: Equipment, where: (equipmentId = 'eq1')} | ||
| .in('DEPENDS_ON'){as: dep} | ||
| RETURN dep.name AS name, dep.failureRate AS failureRate, | ||
| dep.out('DEPENDS_ON')[0].criticality AS criticality | ||
| )`; | ||
|
|
||
| const res = await client.query(sql); | ||
| for (const row of res.rows) { | ||
| console.log(` ${String(row.name).padEnd(20)} | failureRate: ${row.failurerate} | criticality: ${row.criticality}`); | ||
| } |
There was a problem hiding this comment.
This query to find the equipment dependency chain is incorrect. The expression dep.out('DEPENDS_ON')[0].criticality attempts to get the criticality property from a vertex, but this property is on the DEPENDS_ON edge. This will return null for criticality. Furthermore, the query only finds direct dependencies, while the description implies finding all downstream equipment. A traversal is needed. The suggested code fixes the query to perform a traversal and correctly retrieve criticality from the edge, and also updates the printing logic. Note that $depth must be escaped as \$depth in template literals.
const sql = `
SELECT
dep.name AS name,
dep.failureRate AS failureRate,
e.criticality AS criticality,
\$depth AS depth
FROM (
MATCH {type: Equipment, where: (equipmentId = 'eq1')}
.inE('DEPENDS_ON'){while: (\$depth < 5), as: e}
.outV(){as: dep}
RETURN dep, e, \$depth
)
ORDER BY depth ASC`;
const res = await client.query(sql);
for (const row of res.rows) {
console.log(` ${String(row.name).padEnd(20)} | failureRate: ${row.failurerate} | criticality: ${row.criticality} | depth: ${row.depth}`);
}| SELECT name, failureRate, criticality | ||
| FROM ( | ||
| MATCH {type: Equipment, where: (equipmentId = 'eq1')} | ||
| .in('DEPENDS_ON'){as: dep} | ||
| RETURN dep.name AS name, dep.failureRate AS failureRate, | ||
| dep.out('DEPENDS_ON')[0].criticality AS criticality | ||
| ) |
There was a problem hiding this comment.
This query to find the equipment dependency chain is incorrect. The expression dep.out('DEPENDS_ON')[0].criticality attempts to get the criticality property from a vertex, but this property is on the DEPENDS_ON edge. This will return null for criticality. Furthermore, the query only finds direct dependencies, while the description implies finding all downstream equipment. A traversal is needed. Here is a corrected query that performs a traversal and correctly retrieves the criticality from the edge.
| SELECT name, failureRate, criticality | |
| FROM ( | |
| MATCH {type: Equipment, where: (equipmentId = 'eq1')} | |
| .in('DEPENDS_ON'){as: dep} | |
| RETURN dep.name AS name, dep.failureRate AS failureRate, | |
| dep.out('DEPENDS_ON')[0].criticality AS criticality | |
| ) | |
| SELECT | |
| dep.name AS name, | |
| dep.failureRate AS failureRate, | |
| e.criticality AS criticality, | |
| \$depth AS depth | |
| FROM ( | |
| MATCH {type: Equipment, where: (equipmentId = 'eq1')} | |
| .inE('DEPENDS_ON'){while: (\$depth < 5), as: e} | |
| .outV(){as: dep} | |
| RETURN dep, e, \$depth | |
| ) | |
| ORDER BY depth ASC |
| schedule: | ||
| interval: weekly | ||
| day: "sunday" | ||
| open-pull-requests-limit: 20 |
There was a problem hiding this comment.
The open-pull-requests-limit for Maven is set to 20, which is higher than the limit of 10 for other package ecosystems in this file. While this is not necessarily an issue, it's worth ensuring this is intentional. A higher limit can sometimes lead to a large number of open PRs from dependabot, which might be overwhelming to manage. If the number of Maven dependencies is large and they are updated frequently, this might be justified.
| SELECT flaggedId, depth | ||
| FROM ( | ||
| MATCH {type: Account, where: (accountId = 'a4')} | ||
| .both('TRANSFERRED'){while: ($depth < 4), as: hop} | ||
| {type: Account, where: (flagged = true), as: flagged} | ||
| RETURN flagged.accountId AS flaggedId, $depth AS depth | ||
| ) | ||
| ORDER BY depth ASC | ||
| LIMIT 1 | ||
| ``` |
There was a problem hiding this comment.
The current MATCH query is a valid way to find the shortest path, but using the dedicated shortestPath() function is generally more idiomatic and can be more performant for this specific task. It's designed to efficiently find the shortest path between nodes. Consider using it to better demonstrate best practices.
| SELECT flaggedId, depth | |
| FROM ( | |
| MATCH {type: Account, where: (accountId = 'a4')} | |
| .both('TRANSFERRED'){while: ($depth < 4), as: hop} | |
| {type: Account, where: (flagged = true), as: flagged} | |
| RETURN flagged.accountId AS flaggedId, $depth AS depth | |
| ) | |
| ORDER BY depth ASC | |
| LIMIT 1 | |
| ``` | |
| SELECT path.size() - 1 AS depth, path[path.size()-1].accountId as flaggedId | |
| FROM ( | |
| SELECT shortestPath( | |
| (SELECT FROM Account WHERE accountId = 'a4'), | |
| (SELECT FROM Account WHERE flagged = true), | |
| 'BOTH', | |
| 'TRANSFERRED' | |
| ) AS path | |
| ) | |
| WHERE path IS NOT NULL | |
| LIMIT 1 |
| CREATE EDGE TRANSFERRED FROM (SELECT FROM Account WHERE accountId = 'a4') TO (SELECT FROM Account WHERE accountId = 'a5') SET amount = 1500.00, recordedAt = '2026-03-11 04:00:00'; | ||
| CREATE EDGE TRANSFERRED FROM (SELECT FROM Account WHERE accountId = 'a6') TO (SELECT FROM Account WHERE accountId = 'a4') SET amount = 4000.00, recordedAt = '2026-03-12 01:00:00'; | ||
| CREATE EDGE TRANSFERRED FROM (SELECT FROM Account WHERE accountId = 'a3') TO (SELECT FROM Account WHERE accountId = 'a1') SET amount = 100.00, recordedAt = '2026-03-05 14:00:00'; | ||
| CREATE EDGE TRANSFERRED FROM (SELECT FROM Account WHERE accountId = 'a2') TO (SELECT FROM Account WHERE accountId = 'a1') SET amount = 250.00, recordedAt = '2026-03-06 16:00:00'; | ||
| CREATE EDGE TRANSFERRED FROM (SELECT FROM Account WHERE accountId = 'a5') TO (SELECT FROM Account WHERE accountId = 'a6') SET amount = 1800.00, recordedAt = '2026-03-13 05:00:00'; | ||
| CREATE EDGE TRANSFERRED FROM (SELECT FROM Account WHERE accountId = 'a1') TO (SELECT FROM Account WHERE accountId = 'a4') SET amount = 300.00, recordedAt = '2026-03-08 12:00:00'; | ||
| -- LINKED_DEVICE edges (a4 and a5 share devices with flagged a6) | ||
| CREATE EDGE LINKED_DEVICE FROM (SELECT FROM Account WHERE accountId = 'a4') TO (SELECT FROM Account WHERE accountId = 'a6') SET deviceId = 'dev-001'; |
There was a problem hiding this comment.
The query for finding the equipment dependency chain is incorrect. The expression dep.out('DEPENDS_ON')[0].criticality attempts to get the criticality property from a vertex, but it's defined on the DEPENDS_ON edge. This will result in a null value. The query should be updated to correctly retrieve criticality from the edge.
| CREATE EDGE TRANSFERRED FROM (SELECT FROM Account WHERE accountId = 'a4') TO (SELECT FROM Account WHERE accountId = 'a5') SET amount = 1500.00, recordedAt = '2026-03-11 04:00:00'; | |
| CREATE EDGE TRANSFERRED FROM (SELECT FROM Account WHERE accountId = 'a6') TO (SELECT FROM Account WHERE accountId = 'a4') SET amount = 4000.00, recordedAt = '2026-03-12 01:00:00'; | |
| CREATE EDGE TRANSFERRED FROM (SELECT FROM Account WHERE accountId = 'a3') TO (SELECT FROM Account WHERE accountId = 'a1') SET amount = 100.00, recordedAt = '2026-03-05 14:00:00'; | |
| CREATE EDGE TRANSFERRED FROM (SELECT FROM Account WHERE accountId = 'a2') TO (SELECT FROM Account WHERE accountId = 'a1') SET amount = 250.00, recordedAt = '2026-03-06 16:00:00'; | |
| CREATE EDGE TRANSFERRED FROM (SELECT FROM Account WHERE accountId = 'a5') TO (SELECT FROM Account WHERE accountId = 'a6') SET amount = 1800.00, recordedAt = '2026-03-13 05:00:00'; | |
| CREATE EDGE TRANSFERRED FROM (SELECT FROM Account WHERE accountId = 'a1') TO (SELECT FROM Account WHERE accountId = 'a4') SET amount = 300.00, recordedAt = '2026-03-08 12:00:00'; | |
| -- LINKED_DEVICE edges (a4 and a5 share devices with flagged a6) | |
| CREATE EDGE LINKED_DEVICE FROM (SELECT FROM Account WHERE accountId = 'a4') TO (SELECT FROM Account WHERE accountId = 'a6') SET deviceId = 'dev-001'; | |
| SELECT dep.name as name, dep.failureRate as failureRate, e.criticality as criticality, $depth as depth | |
| FROM ( | |
| MATCH {type: Equipment, where: (equipmentId = 'eq1')} | |
| .inE('DEPENDS_ON'){while: ($depth < 5), as: e} | |
| .outV(){as: dep} | |
| RETURN dep, e, $depth | |
| ) | |
| ORDER BY depth ASC |
| - "5432:5432" | ||
| environment: | ||
| JAVA_OPTS: >- | ||
| -Darcadedb.server.rootPassword=arcadedb |
There was a problem hiding this comment.
The root password for ArcadeDB is hardcoded in the docker-compose.yml file. For better security, it's recommended to source secrets from environment variables, which can be managed outside of version control (e.g., in a .env file). This prevents accidental exposure of credentials.
-Darcadedb.server.rootPassword=${ARCADEDB_ROOT_PASSWORD:-arcadedb}- Query 2: restructure MATCH to filter flagged in outer WHERE (terminal node after while: traversal is invalid ArcadeDB syntax) - Query 9: add while: for recursive traversal, use .inE()/.outV() to access edge criticality (was reading from vertex, returning null) - Query 8: rename "Personalized Ranking" to "Category Vector Search" (vector is hardcoded for u1, not dynamically read) - Rename command() to send_command() in queries.sh (shadows bash built-in) - Add package-lock.json, switch CI to npm ci for reproducible builds Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Review responseFixed in 327cd20:
Not implementing (with reasoning):
|
|
Code Review: feat(feature-store). Overall clean and well-structured. Key findings: (1) Query 2 materialises all accounts within 4 hops before outer WHERE flagged=true filter -- design doc had a more efficient MATCH-internal filter; (2) Query 11 Step 4 INSERT is non-idempotent, accumulates FeatureSnapshot rows on re-runs; (3) JS Query 5 and 6 use a double-fallback for Cypher column names that silently swallows undefined; (4) Merchant has no unique index on merchantId unlike Account/User/Equipment; (5) Java price cast in runQuery7/8 throws on null -- tryRun catches but skips silently; (6) Design doc Query Language Mapping table says Query 8 is Cypher but implementation uses SQL -- doc needs update. Nits: send_command in queries.sh deserves a comment on command vs query endpoint; package.json missing engines field; port 2424 omitted intentionally. What is done well: one-statement-per-line SQL, IF NOT EXISTS guards, three-runner CI matrix with fail-fast false, correct Cypher prefix for pg wire, lowercased SQL column names in JS, action SHAs match supply-chain.yml, LSM_VECTOR index syntax and vectorNeighbors pattern are exactly right per documented ArcadeDB quirks. Generated with Claude Code |
Summary
feature-store/use case demonstrating ArcadeDB as a unified ML feature storequeries.sh), Java (arcadedb-network), and JavaScript (pgdriver via PostgreSQL protocol)[curl, java, js]), dependabot entries (maven, npm, docker-compose), design doc, and implementation planFiles (16 new/modified)
feature-store/— full use case directory (docker-compose, setup.sh, sql/, queries/, java/, js/, README).github/workflows/feature-store.yml— CI with[curl, java, js]matrix.github/dependabot.yml— 3 new entries (maven, npm, docker-compose)docs/plans/2026-03-23-feature-store-design.md— design documentdocs/plans/2026-03-23-feature-store.md— implementation planREADME.md+CLAUDE.md— updated use cases tablesTest plan
docker compose up -dstarts ArcadeDB with PostgreSQL plugin (ports 2480 + 5432)./setup.shcreates FeatureStore database and loads schema + data without errors./queries/queries.shreturns non-empty results for all 11 queriesmvn package && java -jar target/feature-store.jarruns all 11 queriesnpm install && node feature-store.jsruns all 11 queries via PostgreSQL protocol🤖 Generated with Claude Code