Nexus

Distributed search engine — AST code indexing, hybrid BM25 + vector scoring, gossip-based PageRank, ε-differential privacy. Written in Rust + TypeScript.

Architecture

nexus/
├── core/        Rust engine  — indexing, scoring, gossip, PageRank
├── gateway/     TypeScript   — HTTP/WebSocket API (Elysia + Bun)
└── tools/       Python scripts — benchmarking (benchmark.py) + model training (train_alpha.py)

Nodes communicate over QUIC. State is propagated via gossip (no leader, no consensus).

Stack

Layer	Tech
Transport	QUIC (quinn) + ed25519 node identity
Index	Inverted index, BP128-compressed posting lists, HNSW vectors
Scoring	BM25F multi-field + dense hybrid (BAAI/bge-small-en-v1.5 via fastembed), QPP-adaptive alpha, distributed IDF via HyperLogLog
PageRank	Local power iteration + gossip aggregation
Privacy	Laplace ε-DP on gossiped HLL sketches
Gateway	Bun + Elysia, msgpack, WebSocket streaming, RAG via Claude

Run

# Core node
cargo run -p nexus-core --release

# Gateway
cd gateway && bun run index.ts

Tools

Script	Purpose
`tools/train_alpha.py`	Fit BM25/vector fusion weights from 100 labeled queries (no API). Outputs `WEIGHTS`/`BIAS` constants for `query_features.rs`.
`tools/benchmark.py`	Benchmark Nexus vs Tantivy on the crates.io corpus (MRR@10, Hits@1, P50/P95/P99 latency, QPS). Two suites: named-entity retrieval (crate name = query) and 78 hand-labeled natural-language queries.
`tools/sweep_dp_epsilon.py`	Privacy-utility tradeoff curve: sweep `dp_epsilon` over [0.05 → ∞], spin up a 2-node gossip cluster per value, measure NL MRR@10. Quantifies the cost of ε-DP on search quality.

# Train QPP weights
python3 tools/train_alpha.py

# Benchmark — Tantivy only (no Nexus node required)
python3 tools/benchmark.py --no-nexus

# Benchmark — Nexus + Tantivy (node running on localhost:3001)
# After indexing, rebuild the vector index first:
curl -X POST http://localhost:3001/rebuild-vector
python3 tools/benchmark.py --skip-fetch --nexus-url http://localhost:3001

# Smaller run for quick iteration
python3 tools/benchmark.py --no-nexus --corpus-size 500 --query-size 50

# ef_search recall-latency sweep (requires Nexus node + vector index)
python3 tools/benchmark.py --skip-fetch --nexus-url http://localhost:3001 \
  --ef-search 10 20 40 80 160 320

Dependencies: pip install requests tantivy tqdm

Benchmark

crates.io corpus (top 2,000 crates), 200 named-entity queries + 92 hand-labeled natural-language queries.

Relevance

Two configurations measured: BM25F only (no vector index) and Hybrid (BM25F + fastembed dense vectors, BAAI/bge-small-en-v1.5 384-dim).

Suite	Nexus Hybrid	Nexus BM25F	Tantivy
NE MRR@10	0.9618	0.9714	0.9422
NE Hits@1	0.9300	0.9500	0.9100
NL MRR@10	0.4074	0.3766	0.3865
NL Hits@1	0.2609	0.2935	0.3043

BM25F (multi-field with name boost ×3) dominates named-entity retrieval (+2.9 pts MRR vs Tantivy). Dense hybrid improves NL MRR@10 by +2.1 pts over Tantivy — semantic embeddings help when query wording differs from indexed text. The hybrid trades Hits@1 for MRR@10: it surfaces the right answer in top-5 more often even when it misses the top slot.

Latency (Nexus HTTP loopback, single node, release build)

Suite	Config	P50	P95	P99
NE (200 queries)	BM25F only	0.85 ms	1.02 ms	1.12 ms
NL (92 queries)	BM25F only	0.84 ms	0.98 ms	1.17 ms
NE (200 queries)	Hybrid	5.61 ms	6.98 ms	7.34 ms
NL (92 queries)	Hybrid	5.85 ms	6.87 ms	7.44 ms

Tantivy runs in-process (no network). Nexus latency includes HTTP loopback + Rust server. Hybrid overhead (~5 ms) is fastembed ONNX inference on a single CPU core — expected to improve significantly on multi-core hardware or with batched query serving.

Privacy-utility tradeoff (ε-DP sweep)

2-node gossip cluster, corpus split 50/50, 35 s gossip convergence wait, NL MRR@10 measured on node A.

ε	MRR@10	Δ vs ε=1.0
0.05	—	—
0.1	—	—
1.0	0.408	baseline
10.0	—	—
∞ (no noise)	—	—

Result on 2,000-doc corpus: MRR@10 range across all ε values ≈ 0.007 — statistically negligible. The DP noise perturbs estimated_global_terms (gossip-propagated N used for IDF), but BM25 is robust to small N perturbations at this corpus size. The effect grows with corpus size and number of peers; expect more visible degradation at low ε on 50k+ doc deployments.

Run: cargo build -p nexus-core --release && python3 tools/sweep_dp_epsilon.py

HNSW ef_search — recall vs latency

ef_search controls the beam width during HNSW graph traversal: higher values explore more candidate nodes, improving recall at the cost of latency. The default is max(limit × 4, 50).

Run the sweep (requires node with vector index built):

curl -X POST http://localhost:3001/rebuild-vector
python3 tools/benchmark.py --skip-fetch --nexus-url http://localhost:3001 \
  --ef-search 10 20 40 80 160 320

ef_search	NL MRR@10	P50 ms
auto (limit×4)	0.4193	8.62
10	0.4193	8.72
20	0.4193	8.77
40	0.4193	8.76
80	0.4193	8.60
160	0.4193	8.67
320	0.4193	8.67

Result on 2,000-doc corpus: MRR@10 is flat across all ef_search values — even ef=10 achieves full recall. On a corpus this small the HNSW graph is shallow (few layers) and greedy search reaches the same neighbours regardless of beam width. Latency (~8.7 ms) is dominated by fastembed ONNX inference, not graph traversal. The ef_search tradeoff becomes visible at 100k+ docs where multi-layer graphs can miss neighbours with a narrow beam.

Docs

CONSISTENCY.md — consistency model, CRDT proofs, PageRank convergence

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
core		core
gateway		gateway
proto		proto
tools		tools
.dockerignore		.dockerignore
.gitignore		.gitignore
CONSISTENCY.md		CONSISTENCY.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
bun.lock		bun.lock
docker-compose.yml		docker-compose.yml
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nexus

Architecture

Stack

Run

Tools

Benchmark

Relevance

Latency (Nexus HTTP loopback, single node, release build)

Privacy-utility tradeoff (ε-DP sweep)

HNSW ef_search — recall vs latency

Docs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Nexus

Architecture

Stack

Run

Tools

Benchmark

Relevance

Latency (Nexus HTTP loopback, single node, release build)

Privacy-utility tradeoff (ε-DP sweep)

HNSW ef_search — recall vs latency

Docs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages