Skip to content

Commit 33c0ed1

Browse files
amitdeshmukhclaude
andauthored
feat: graph discovery, composite FK/PK support, and cross-schema joins (#560)
* feat: add schema discovery documents and real-time subscription support Add automatic generation of schema discovery documents ("Schema Bible") that provide rich metadata about database tables, relationships, and query patterns. Documents are generated at startup and regenerated on schema changes. - Add GenerateDiscovery() and SubscribeDiscovery() to core API - Add OnSchemaChange callback for schema change notifications - Register discovery as MCP resource and REST endpoint - Support WebSocket subscriptions for real-time discovery updates - Fire callbacks on startup, reload, and DB watcher schema changes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add query syntax reference to discovery docs and fix templates The generated schema bible had no DSL reference, causing agents to guess syntax incorrectly (using group_by instead of distinct, wrong in operator format, etc.). Add a Query Syntax Reference section covering filter operators, aggregation functions, grouping with distinct, pagination, ordering, relationships, and common mistakes. Also fix query templates to use distinct:[col] for time-series and breakdown grouping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: auto-discover and resolve tables across all database schemas GraphJin previously required tables to be in the default schema (e.g., "public" for PostgreSQL) or explicit schema configuration. This made it fail silently on databases like AdventureWorks where tables live in non-public schemas (production, sales, etc.). Add a name-only secondary index (nameIndex) to DBSchema that enables cross-schema table resolution as a fallback when exact schema:name lookup fails. Resolution order: exact match → single cross-schema match → default schema preference → ambiguity error with schema list. Also fix the hardcoded "public" default in NewCompiler to use the discovered schema for all database dialects (MSSQL→dbo, MySQL→db name, etc.), and use the resolved schema in AddRole keys so role permissions work correctly for tables in non-default schemas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add table of contents to discovery docs and configurable workflow timeout - Discovery documents now include a navigable Table of Contents section with anchor links to all sections and individual tables - Workflow script timeout is now configurable via mcp.workflow_timeout (in seconds), defaulting to 5s when not set - Timeout value is exposed in get_js_runtime_api response so LLM agents can plan workflow strategies based on available headroom * feat: add composite FK support, cross-schema JOIN fix, and AdventureWorks integration tests - Fix Postgres composite FK discovery: change confkey[1] to confkey[array_position(co.conkey, f.attnum)] so each local FK column maps to its correct referenced column by position - Add DiscoverCompositeFKs() to detect multi-column FK constraints and merge them into single graph edges with ExtraPairs - Add ColPair/CompositeFKInfo types, propagate ExtraPairs through TEdge → TPath → DBRel → buildFilter → OpAnd expressions - Fix renderJoin to schema-qualify intermediate JOIN table names (e.g., INNER JOIN "person"."businessentity" instead of INNER JOIN businessentity) - Add AdventureWorks as integration test database (full 760K-row dataset) with 24 business-scenario tests verified against SQL ground truth - Add make test-adventureworks target (23/24 tests passing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: prevent FK columns from being misinterpreted as relationship joins in WHERE filters FK columns like customer_id or territoryid were being treated as nested relationship references (triggering EXISTS subqueries) instead of simple column filters. Now processNestedTable checks if the field name matches a column on the current table before attempting path resolution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: split discovery document into granular MCP resources Split the monolithic Schema Bible into focused sections (overview, syntax, tables, full_tables, insights) so MCP agents can load only what they need without exceeding context limits. Add server instructions for MCP. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update go.work.sum Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add composite FK extra pair columns to parent subquery SELECT list When a composite FK join references extra columns (e.g., specialofferid in the salesorderdetail → specialofferproduct join), those columns must be included in the parent subquery's SELECT list. Without this, the aliased subquery (e.g., salesorderdetail_2) doesn't expose the column, causing "column does not exist" errors. Also fix CustomerGeography test to filter for customers with personid (B2B store-only customers have NULL personid, causing empty joins). 24/24 AdventureWorks tests now pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: strip person.password sample data from AdventureWorks dump Remove hashed password + salt values from the test data fixture to resolve GitHub secret scanning alert. The person.password table contains AdventureWorks demo data (not real credentials) but triggers automated secret detection. No tests depend on this table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add composite FK discovery for MySQL, MariaDB, SQLite, Oracle, MSSQL, Snowflake Extend DiscoverCompositeFKs() with per-database SQL queries to detect multi-column foreign key constraints. Each DB uses its native system catalog (information_schema, pragma_foreign_key_list, all_constraints, sys.foreign_key_columns, _gj_fk_metadata) with GROUP BY + HAVING COUNT to identify composite FKs. The downstream machinery (edge merging, ExtraPairs propagation, AND filter generation) is already DB-agnostic from the Postgres implementation. Includes unit tests for query constants, CSV parsing, normalization per DB type (Oracle/MSSQL/Snowflake: snake_case + lowercase), and unsupported DB fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: move AdventureWorks test data to tests-large/ and enable git LFS Large SQL fixtures (75MB data dump) moved out of tests/ into tests-large/ to keep the main test directory lean. The 75MB data file is now tracked via git LFS. Updated init script paths in dbint_test.go and added a test-large Makefile target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add composite FK integration tests for all 6 database types Add product_variants + order_items tables with composite FK (product_id, variant_id) to Postgres, MySQL, MariaDB, SQLite, MSSQL, and Oracle test schemas. Integration tests verify: - Forward join: order_items → product_variants (correct variant matched) - All-rows match: every order_item joins to its correct variant - Reverse join: product_variants → order_items All 18 tests pass (3 tests × 6 databases). Also adds unit tests for composite FK query constants, CSV parsing with per-DB normalization, and unsupported DB fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add composite primary key support across all 6 database dialects Tables with multi-column primary keys (e.g., PRIMARY KEY (user_id, session_id)) now work correctly throughout GraphJin. Previously only the first PK column was recognized and the rest were silently dropped. Core changes: - Add PrimaryCols []DBColumn to DBTable with HasCompositePK/PKColNames/IsPKCol helpers - compileArgID accepts composite PK as object: id: {col1: val1, col2: val2} - orderByIDCol adds all PK columns to ORDER BY - Mutation helpers generate multi-column variable declarations and WHERE clauses Dialect updates (150 PrimaryCol refs across 18 files): - Postgres: multi-column ON CONFLICT - SQLite: JSON-encoded composite keys in _gj_ids, multi-col ON CONFLICT/RETURNING - MySQL: multi-col JSON_TABLE, PK detection via IsPKCol - MSSQL: multi-col MERGE ON, OPENJSON columns - Oracle: multi-col ORDER BY, RETURNING INTO, JSON_TABLE - Snowflake: multi-col identity updates, PK detection Tested: 4 unit tests + 15 integration tests (3 tests × 5 DBs) all passing, full Postgres and MySQL regression suites clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: don't use JSON_TABLE for connect/disconnect mutations on MySQL/MariaDB Connect and disconnect mutations use a compiled WHERE filter (e.g., id IN (1,2,3)), not a JSON record set. Passing the connect data through JSON_TABLE caused MySQL Error 3666 ("Can't store an array in scalar JSON_TABLE column") because the filter value is an array, not a scalar. The fix removes the unnecessary RenderMutateToRecordSet calls from RenderLinearConnect and RenderLinearDisconnect for both MySQL and MariaDB dialects. The WHERE filter rendered by renderFilter() is sufficient. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update test documentation for all 9 database targets Rewrites tests/TESTS.md to cover the full test infrastructure: - All 9 database targets (PG, MySQL, MariaDB, SQLite, Oracle, MSSQL, Snowflake, MongoDB, AdventureWorks) with container images and make targets - Composite FK and PK test sections with per-DB compatibility - Ground truth verification pattern documentation - Full compatibility matrix and schema file listing - Known issues section for SQLite, MariaDB, MongoDB, Snowflake Adds tests-large/TESTS.md explaining the large-scale test strategy, AdventureWorks database stats, and how to add new large-scale fixtures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 3dc3f88 commit 33c0ed1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+14515
-583
lines changed

.gitattributes

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
tests-large/02_adventureworks_data.sql filter=lfs diff=lfs merge=lfs -text

Makefile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,12 @@ test-mongodb:
5757
@echo "Running MongoDB tests..."
5858
@cd tests; go test -v -timeout 30m -race -db=mongodb .
5959

60+
test-adventureworks:
61+
@echo "Running AdventureWorks tests..."
62+
@cd tests; go test -v -timeout 60m -race -db=adventureworks -run TestAdventureWorks .
63+
64+
test-large: test-adventureworks
65+
6066
BIN_DIR := $(GOPATH)/bin
6167
WEB_BUILD_DIR := ./serv/web/build/manifest.json
6268

core/api.go

Lines changed: 105 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,10 +120,99 @@ type GraphJin struct {
120120
done chan bool
121121
stopOnce sync.Once
122122
reloadMu sync.Mutex // serializes reload operations
123+
124+
// Schema change callbacks
125+
schemaCallbacks []func(dbName string, hash string)
126+
callbackMu sync.RWMutex
127+
128+
// Discovery document cache
129+
discovery sync.Map // map[string]*DiscoveryDocument
123130
}
124131

125132
type Option func(*graphjinEngine) error
126133

134+
// OnSchemaChange registers a callback that fires when the database schema changes.
135+
// The callback receives the database name and a hex-encoded hash of the schema.
136+
// Callbacks also fire once at startup after initial schema discovery.
137+
func (g *GraphJin) OnSchemaChange(fn func(dbName string, hash string)) {
138+
g.callbackMu.Lock()
139+
defer g.callbackMu.Unlock()
140+
g.schemaCallbacks = append(g.schemaCallbacks, fn)
141+
}
142+
143+
// fireSchemaCallbacks invokes all registered schema change callbacks.
144+
// Runs each callback in a goroutine to avoid blocking the caller (which may hold reloadMu).
145+
func (g *GraphJin) fireSchemaCallbacks(dbName string, hash string) {
146+
g.callbackMu.RLock()
147+
callbacks := make([]func(string, string), len(g.schemaCallbacks))
148+
copy(callbacks, g.schemaCallbacks)
149+
g.callbackMu.RUnlock()
150+
151+
for _, fn := range callbacks {
152+
fn := fn
153+
go fn(dbName, hash)
154+
}
155+
}
156+
157+
// DefaultDatabase returns the name of the default (primary) database.
158+
func (g *GraphJin) DefaultDatabase() string {
159+
gj, err := g.getEngine()
160+
if err != nil {
161+
return ""
162+
}
163+
return gj.defaultDB
164+
}
165+
166+
// DatabaseNames returns the names of all configured databases.
167+
func (g *GraphJin) DatabaseNames() []string {
168+
gj, err := g.getEngine()
169+
if err != nil {
170+
return nil
171+
}
172+
return gj.sortedDatabaseNames()
173+
}
174+
175+
// fireAllSchemaCallbacks fires schema change callbacks for all databases with initialized schemas.
176+
func (g *GraphJin) fireAllSchemaCallbacks() {
177+
gj, err := g.getEngine()
178+
if err != nil {
179+
return
180+
}
181+
for name, ctx := range gj.databases {
182+
if ctx.dbinfo != nil {
183+
g.fireSchemaCallbacks(name, fmt.Sprintf("%x", ctx.dbinfo.Hash()))
184+
}
185+
}
186+
}
187+
188+
// generateAllDiscovery generates discovery documents for all databases with initialized schemas.
189+
// Called at startup and after schema changes.
190+
func (g *GraphJin) generateAllDiscovery() {
191+
gj, err := g.getEngine()
192+
if err != nil {
193+
return
194+
}
195+
ctx := context.Background()
196+
for name, dbCtx := range gj.databases {
197+
if dbCtx.schema == nil {
198+
continue
199+
}
200+
start := time.Now()
201+
doc, err := g.GenerateDiscovery(ctx, name)
202+
if err != nil {
203+
gj.log.Printf("ERR discovery: %s: %v", name, err)
204+
continue
205+
}
206+
dbLabel := name
207+
if dbLabel == "" {
208+
dbLabel = "(default)"
209+
}
210+
tables := len(dbCtx.schema.GetTables())
211+
gj.log.Printf("INF discovery: %s — %d tables, %d bytes, hash %s (%s)",
212+
dbLabel, tables, len(doc.Markdown), doc.Hash, time.Since(start).Round(time.Millisecond))
213+
}
214+
}
215+
127216
// NewGraphJin creates the GraphJin struct, this involves querying the database to learn its
128217
// schemas and relationships
129218
func NewGraphJin(conf *Config, db *sql.DB, options ...Option) (g *GraphJin, err error) {
@@ -142,6 +231,9 @@ func NewGraphJin(conf *Config, db *sql.DB, options ...Option) (g *GraphJin, err
142231
g = nil
143232
return
144233
}
234+
235+
g.generateAllDiscovery()
236+
g.fireAllSchemaCallbacks()
145237
return
146238
}
147239

@@ -157,6 +249,9 @@ func NewGraphJinWithFS(conf *Config, db *sql.DB, fs FS, options ...Option) (g *G
157249
g = nil
158250
return
159251
}
252+
253+
g.generateAllDiscovery()
254+
g.fireAllSchemaCallbacks()
160255
return
161256
}
162257

@@ -657,7 +752,12 @@ func (g *GraphJin) Reload() error {
657752
if pdb := gj.primaryDB(); pdb != nil {
658753
db = pdb.db
659754
}
660-
return g.newGraphJin(gj.conf, db, nil, gj.fs, gj.opts...)
755+
if err := g.newGraphJin(gj.conf, db, nil, gj.fs, gj.opts...); err != nil {
756+
return err
757+
}
758+
g.generateAllDiscovery()
759+
g.fireAllSchemaCallbacks()
760+
return nil
661761
}
662762

663763
// ReloadWithDB redoes database discover with a new primary DB connection.
@@ -803,6 +903,7 @@ type TableSchema struct {
803903
Type string `json:"type"`
804904
Comment string `json:"comment,omitempty"`
805905
PrimaryKey string `json:"primary_key,omitempty"`
906+
PrimaryKeys []string `json:"primary_keys,omitempty"`
806907
Columns []ColumnInfo `json:"columns"`
807908
Relationships struct {
808909
Outgoing []RelationInfo `json:"outgoing"` // Tables this table references
@@ -939,6 +1040,9 @@ func (gj *graphjinEngine) buildTableSchema(dbSchema *sdata.DBSchema, dbName, tab
9391040
if t.PrimaryCol.Name != "" {
9401041
schema.PrimaryKey = t.PrimaryCol.Name
9411042
}
1043+
if len(t.PrimaryCols) > 1 {
1044+
schema.PrimaryKeys = t.PKColNames()
1045+
}
9421046

9431047
// Add columns
9441048
for _, col := range t.Columns {

0 commit comments

Comments
 (0)