Skip to content

Commit 7dd709d

Browse files
add new rag data set and evaluation runner via cli component
1 parent b543089 commit 7dd709d

11 files changed

Lines changed: 1510 additions & 0 deletions

File tree

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# RAG Evaluation Runner
2+
3+
The RAG Evaluation Runner provides a small, framework-agnostic pipeline
4+
for benchmarking retrieval quality across different chunking and
5+
embedding configurations using simple JSONL datasets.
6+
7+
It is designed for **Python power users** who want to:
8+
9+
- Point the runner at a corpus and queries file.
10+
- Configure chunking and embedding variants.
11+
- Run matrix experiments and collect metrics.
12+
13+
## Dataset format
14+
15+
Both corpus and queries are JSONL files. Blank lines and lines starting
16+
with `#` are ignored.
17+
18+
### Corpus JSONL
19+
20+
Each line is a JSON object with the following fields:
21+
22+
- `id` (str, required)
23+
- `text` (str, required)
24+
- `source_uri` (str, optional)
25+
- `metadata` (object, optional)
26+
27+
Example:
28+
29+
```json
30+
{"id": "doc-1", "text": "Hello world", "source_uri": "memory://", "metadata": {"topic": "greeting"}}
31+
```
32+
33+
### Queries JSONL
34+
35+
Each line is a JSON object with the following fields:
36+
37+
- `id` (str, required)
38+
- `query` (str, required)
39+
- `relevant_ids` (list[str], required) – chunk ids considered relevant
40+
- `metadata` (object, optional)
41+
42+
Example:
43+
44+
```json
45+
{"id": "q1", "query": "hello", "relevant_ids": ["doc-1:0"]}
46+
```
47+
48+
## Experiment matrix
49+
50+
An evaluation run describes a matrix of experiments:
51+
52+
- `chunk_variants` – different `ChunkingConfig` settings.
53+
- `embedder_variants` – logical embedders (for example `"fake"`, `"openai"`).
54+
- `top_k_values` – list of cut-off ranks.
55+
56+
The runner expands these into experiments:
57+
58+
```text
59+
experiments = chunk_variants × embedder_variants × top_k_values
60+
```
61+
62+
Each experiment produces:
63+
64+
- Aggregate metrics per `k`.
65+
- Optional per-query breakdown.
66+
67+
## Metrics
68+
69+
The runner computes the following metrics for each `k`:
70+
71+
- Hit rate@k – fraction of queries with at least one relevant chunk in
72+
the top-`k` results.
73+
- Precision@k – macro-averaged precision.
74+
- Recall@k – macro-averaged recall.
75+
- MRR@k – mean reciprocal rank.
76+
77+
Metrics are computed using the deterministic utilities from
78+
`electripy.ai.rag.evaluation` plus a local MRR implementation.
79+
80+
## CLI usage
81+
82+
A Typer-based CLI command is exposed as:
83+
84+
```bash
85+
electripy rag eval --corpus corpus.jsonl --queries queries.jsonl \
86+
--top-k 3,5,10 --chunk-size 500 --chunk-overlap 100 --embedder fake \
87+
--report-json out.json --report-csv out.csv
88+
```
89+
90+
Key options:
91+
92+
- `--corpus PATH` – corpus JSONL file.
93+
- `--queries PATH` – queries JSONL file.
94+
- `--top-k 3,5,10` – comma-separated list of cut-offs.
95+
- `--chunk-size` / `--chunk-overlap` – basic chunking config.
96+
- `--chunker-config PATH` – optional JSON file for advanced chunking
97+
configs; takes precedence over `--chunk-size` / `--chunk-overlap`.
98+
- `--embedder` – one or more embedders (for example `"fake"`),
99+
optionally as a comma-separated list.
100+
- `--report-json` / `--report-csv` – report output paths.
101+
- `--fail-under` – thresholds such as `hit_rate@5=0.85`.
102+
103+
## Determinism and reproducibility
104+
105+
- Fake embeddings are deterministic functions of the input text.
106+
- The in-memory vector store uses cosine similarity with deterministic
107+
tie-breaking on chunk id.
108+
- Experiments are expanded in a stable order.
109+
- Experiment ids are computed as SHA-256 hashes of the configuration.
110+
111+
## Extensibility
112+
113+
To plug in a custom chunker or embedder, implement the existing RAG
114+
ports (`ChunkerPort`, `EmbeddingPort`, `VectorStorePort`) and wire them
115+
into your own orchestration, or extend the helpers in
116+
`electripy.ai.rag_eval_runner.services`.
117+
118+
In particular:
119+
120+
- Add a new embedder via `EmbedderVariant` and extend the
121+
`_build_default_embedding_port` helper.
122+
- Swap the vector store by providing an implementation of
123+
`VectorStorePort` instead of `InMemoryVectorStoreAdapter`.
124+
125+
## CI gating with `--fail-under`
126+
127+
The CLI exposes `--fail-under <metric@k=value>` to make evaluations
128+
suitable for CI. Thresholds must be met by **all** experiments; otherwise
129+
an error is raised and the process exits with a non-zero status.
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
"""RAG evaluation runner component.
2+
3+
High-level exports for dataset models, experiment configuration, and
4+
orchestration services.
5+
"""
6+
7+
from __future__ import annotations
8+
9+
from .domain import CorpusRecord, ExperimentConfig, QueryRecord
10+
from .errors import DatasetFormatError, EvalRunnerError, ExperimentConfigError, RagEvalError
11+
from .services import DatasetLoader, Evaluator, IndexBuilder, ReportWriter
12+
13+
__all__ = [
14+
"CorpusRecord",
15+
"QueryRecord",
16+
"ExperimentConfig",
17+
"DatasetLoader",
18+
"IndexBuilder",
19+
"Evaluator",
20+
"ReportWriter",
21+
"RagEvalError",
22+
"DatasetFormatError",
23+
"ExperimentConfigError",
24+
"EvalRunnerError",
25+
]
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
"""Adapters and fakes for the RAG evaluation runner.
2+
3+
This module provides:
4+
5+
- ``FakeEmbeddingAdapter`` – deterministic, stateless embeddings derived
6+
from text hashing, suitable for tests and offline runs.
7+
- ``InMemoryVectorStoreAdapter`` – simple in-memory vector store using
8+
cosine similarity with deterministic tie-breaking.
9+
10+
Both adapters implement the ports defined in :mod:`electripy.ai.rag` and
11+
are intentionally minimal to keep dependencies small and behaviour
12+
predictable.
13+
"""
14+
15+
from __future__ import annotations
16+
17+
import hashlib
18+
import math
19+
from collections.abc import Mapping, Sequence
20+
21+
from electripy.ai.rag.domain import Chunk
22+
from electripy.ai.rag.ports import EmbeddingPort, VectorStorePort
23+
24+
25+
class FakeEmbeddingAdapter(EmbeddingPort):
26+
"""Deterministic embedding adapter based on SHA-256 hashing.
27+
28+
The adapter produces fixed-size embedding vectors whose components
29+
are derived from the SHA-256 digest of the input text. The mapping
30+
is purely functional and does not involve any randomness, making it
31+
suitable for reproducible tests.
32+
33+
Example:
34+
>>> adapter = FakeEmbeddingAdapter()
35+
>>> vectors = adapter.embed_texts(["hello", "world"])
36+
>>> len(vectors) == 2
37+
True
38+
"""
39+
40+
def __init__(self, *, dim: int = 16) -> None:
41+
if dim <= 0:
42+
raise ValueError("dim must be positive")
43+
self._dim = dim
44+
45+
def embed_texts(self, texts: Sequence[str]) -> list[list[float]]:
46+
if not texts:
47+
return []
48+
return [self._embed_single(text) for text in texts]
49+
50+
def _embed_single(self, text: str) -> list[float]:
51+
digest = hashlib.sha256(text.encode("utf-8")).digest()
52+
# Use bytes from the digest to populate the vector deterministically.
53+
values: list[float] = []
54+
for i in range(self._dim):
55+
# Wrap around the digest if needed.
56+
b = digest[i % len(digest)]
57+
# Map byte to [-0.5, 0.5] and then scale.
58+
values.append((float(b) / 255.0) - 0.5)
59+
# L2-normalise to keep cosine similarity well-behaved.
60+
norm = math.sqrt(sum(v * v for v in values)) or 1.0
61+
return [v / norm for v in values]
62+
63+
64+
class InMemoryVectorStoreAdapter(VectorStorePort):
65+
"""In-memory vector store implementing :class:`VectorStorePort`.
66+
67+
Notes:
68+
- Stores vectors in process memory only; suitable for tests and
69+
local evaluation runs.
70+
- Uses cosine similarity for ranking and breaks ties
71+
deterministically by chunk id.
72+
"""
73+
74+
def __init__(self) -> None:
75+
self._store: dict[str, tuple[Chunk, list[float]]] = {}
76+
77+
def upsert(self, chunks: Sequence[Chunk], vectors: Sequence[list[float]]) -> None:
78+
if len(chunks) != len(vectors):
79+
raise ValueError("chunks and vectors must have the same length")
80+
for chunk, vector in zip(chunks, vectors):
81+
self._store[chunk.id] = (chunk, list(vector))
82+
83+
def query(
84+
self,
85+
vector: Sequence[float],
86+
*,
87+
top_k: int,
88+
filters: Mapping[str, object] | None = None,
89+
) -> list[tuple[Chunk, float]]:
90+
if top_k <= 0:
91+
raise ValueError("top_k must be positive")
92+
if not self._store:
93+
return []
94+
95+
# For now, filters are ignored; they are present to satisfy the
96+
# protocol and keep a future extension point.
97+
del filters
98+
99+
norm_q = math.sqrt(sum(float(v) * float(v) for v in vector)) or 1.0
100+
scores: list[tuple[Chunk, float]] = []
101+
for chunk_id, (chunk, stored_vec) in self._store.items():
102+
dot = 0.0
103+
norm_v = 0.0
104+
for a, b in zip(vector, stored_vec):
105+
fa = float(a)
106+
fb = float(b)
107+
dot += fa * fb
108+
norm_v += fb * fb
109+
norm_v = math.sqrt(norm_v) or 1.0
110+
score = dot / (norm_q * norm_v)
111+
scores.append((chunk, score))
112+
113+
# Deterministic ordering: sort by descending score, then chunk id.
114+
scores.sort(key=lambda item: (-item[1], item[0].id))
115+
return scores[:top_k]
116+
117+
def delete_by_document(self, document_id: str) -> None:
118+
to_delete = [cid for cid, (chunk, _) in self._store.items() if chunk.document_id == document_id]
119+
for cid in to_delete:
120+
self._store.pop(cid, None)

0 commit comments

Comments
 (0)