Skip to content

Commit fe504e4

Browse files
authored
Merge pull request #68 from martipath/tagging
Implement per-page tagging and JSON writer enhancements
2 parents bbb2c6c + 223ec0f commit fe504e4

6 files changed

Lines changed: 240 additions & 59 deletions

File tree

docs/readme/indexer-skills.md

Lines changed: 48 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -40,14 +40,23 @@ This document describes all available skills that can be used in the indexer pip
4040
2. An `embedding` to generate embeddings from the Q&A content.
4141
3. A `vector-store` to store the embeddings.
4242

43+
6. You want to avoid re-running expensive embedding and indexing when the content hasn't changed since the last run? Insert a `writer` (`json-writer`) skill as a change gate:
44+
45+
1. A `file-scanner` (or `exporter`) to locate/export your source documents.
46+
2. A `file-reader` to read their content.
47+
3. A `splitter` to split the documents into chunks.
48+
4. A `writer` (`json-writer`) with `checksum_path` set — it computes a SHA-256 checksum of each **chunk's content** individually (keyed by `document_id`); only chunks whose content has changed (or are new) pass downstream, so unchanged chunks are stripped and their embedding and indexing are skipped automatically.
49+
5. An `embedding` to generate embeddings (skipped when content is unchanged).
50+
6. A `vector-store` to store the embeddings (skipped when content is unchanged).
51+
4352

4453
# Available Skills
4554

4655
<details><summary>Exporter Skills</summary>
4756
Export data from one source to another. For example export a confluence page to a markdown file.
4857

4958
### Scroll Word Exporter
50-
Exports a confluence page to Microsoft Word document
59+
Exports Confluence pages to Microsoft Word documents. Each entry in `page_urls` and `page_ids` supports an optional inline `tag`. Entries without a tag fall back to the top-level `tag` param.
5160

5261
```yaml
5362
- skill: &Exporter
@@ -58,12 +67,19 @@ Exports a confluence page to Microsoft Word document
5867
auth_token: env.SWE_AUTH_TOKEN # Scroll Word API token - can be obtained in Confluence
5968
poll_interval: 20 # Interval in seconds to check the status of the export
6069
export_folder: ~/Downloads/sw_export_temp # Path where the exported file(s) should be saved
61-
scope: current # Possible values: [current | descendants]. `current` exports just the current page, where `descendants` include all the descendants of the current page
62-
page_ids: # List all page IDs that you'd like to export
63-
- 1774209540
64-
page_urls: # List all page URLs that you'd like to export
65-
- https://your/corporate/confluence/prefix/wiki/spaces/your/confluence/space
66-
confluence_prefix: https://your/corporate/confluence/prefix # Your corporate Confluence URL
70+
scope: current # Possible values: [current | descendants]
71+
confluence_prefix: https://your/corporate/confluence/prefix
72+
tag: generic # Optional: default tag for all pages (fallback)
73+
page_urls:
74+
- url: https://your/confluence/spaces/SPACE/pages/123/Page+Title
75+
tag: my-tag # Optional: overrides top-level tag for this page
76+
- url: https://your/confluence/spaces/SPACE/pages/456/Another+Page
77+
# no tag — falls back to top-level tag
78+
page_ids:
79+
- id: 1774209540
80+
tag: my-tag # Optional
81+
- id: 1234567890
82+
# no tag — falls back to top-level tag
6783
```
6884
</details>
6985

@@ -136,13 +152,15 @@ Loads data from Jira issues
136152
### Teams Q&A Loader
137153
Loads enriched Q&A pairs from a JSON file produced by the FAQ enrichment pipeline. Each Q&A pair becomes a single document with one chunk. The skill prefers rephrased questions/answers when available, falling back to originals.
138154
155+
Each Q&A object in the JSON can optionally include a `tag` field that overrides the skill-level `tag` for that specific chunk, allowing fine-grained tagging within a single file.
156+
139157
```yaml
140158
- skill: &TeamsQnALoader
141159
type: loader
142160
name: teams-qna-loader
143161
params:
144162
file_path: data/processed_output/enriched_qna.json # Required: path to enriched Q&A JSON file
145-
tag: teams-faq # Optional: tag for chunks (default: "enriched-qna")
163+
tag: teams-faq # Optional: default tag for chunks (default: "enriched-qna"); can be overridden per Q&A object via a "tag" field in the JSON
146164
```
147165
</details>
148166

@@ -180,6 +198,8 @@ Splits text by grouping semantically equivalent chunks together. A bit more adva
180198
### Confluence FAQ Splitter
181199
Extracts Q&A pairs directly from FAQ `.docx` files exported from Confluence. Each heading that contains a `?` or starts with a problem/question pattern (e.g. "How do I", "I cannot") is treated as a question, and the body content below it becomes the answer. Each Q&A pair is produced as a single atomic chunk. No `file-reader` is needed — this skill reads `.docx` files directly via `python-docx`.
182200

201+
Each chunk's `document_id` is a SHA-256 hash of the **question text only**, so the ID stays stable even when the answer is updated. This makes it a reliable unique key for Azure AI Search upserts — changed Q&A pairs are re-indexed in place without creating duplicates and pairs whose answers haven't changed are skipped by the `json-writer` change gate.
202+
183203
All parameters are optional with sensible defaults.
184204

185205
```yaml
@@ -201,6 +221,25 @@ All parameters are optional with sensible defaults.
201221
```
202222
</details>
203223

224+
<details><summary>Writer Skills</summary>
225+
Capture and optionally gate intermediate pipeline state to a file.
226+
227+
### JSON Writer
228+
Extracts text content from all chunks and writes it as a sorted JSON array to a file. Useful for inspecting intermediate pipeline state (e.g. after splitting) and as a **per-chunk change-detection gate**: when `checksum_path` is configured, the skill computes a SHA-256 checksum of each **chunk's content** individually and stores the results in a JSON map keyed by `document_id`. On subsequent runs, only chunks whose content has changed (or are new) are passed downstream — unchanged chunks are stripped from their documents, so embedding and indexing are skipped for those chunks only.
229+
230+
This works well with Azure AI Search's key-based upsert — changed documents are re-indexed in place without creating duplicates.
231+
232+
```yaml
233+
- skill: &JSONWriter
234+
type: writer
235+
name: json-writer
236+
params:
237+
output_path: data/pipeline_output.json # Path to the combined output JSON file (default: "data/pipeline_output.json")
238+
checksum_path: data/checksums.json # Optional: path to a JSON file storing per-chunk SHA-256 checksums keyed by document_id. Enables per-chunk change detection.
239+
skip_downstream_if_unchanged: true # Optional: if true (default) and checksum_path is set, strips unchanged chunks from their documents, skipping their embedding/indexing
240+
```
241+
</details>
242+
204243
<details><summary>Embedding</summary>
205244
Generate embeddings from text. Embeddings is a vector representation of your text data.
206245

@@ -250,6 +289,7 @@ Stores embeddings in an Azure AI Search index.
250289
document_name: document_name
251290
embedding: embedding
252291
overwrite_index: true # true - before storing data, it will remove all the documents from your index. false - will append documents to your index
292+
batch_size: 50 # Optional: number of documents uploaded per API call (default: 50, max: 50)
253293
```
254294
255295
### Chroma

src/docs2vecs/subcommands/indexer/config/config_schema.yaml

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,29 @@ definitions:
3636
required: False
3737
page_ids:
3838
type: list
39+
required: False
3940
schema:
40-
type: ['string', 'integer']
41+
type: dict
42+
schema:
43+
id:
44+
type: ['string', 'integer']
45+
required: True
46+
tag:
47+
type: string
48+
required: False
4149
page_urls:
4250
type: list
51+
required: False
4352
schema:
44-
type: string
45-
regex: '^http.*'
53+
type: dict
54+
schema:
55+
url:
56+
type: string
57+
regex: '^http.*'
58+
required: True
59+
tag:
60+
type: string
61+
required: False
4662
confluence_prefix:
4763
type: string
4864
regex: '^http.*'
@@ -109,6 +125,12 @@ definitions:
109125
output_path:
110126
type: string
111127
required: False
128+
checksum_path:
129+
type: string
130+
required: False
131+
skip_downstream_if_unchanged:
132+
type: boolean
133+
required: False
112134
# ConfluenceFAQSplitter params
113135
min_heading_level:
114136
type: integer
@@ -183,6 +205,10 @@ definitions:
183205
required: False
184206
overwrite_index:
185207
type: boolean
208+
batch_size:
209+
type: integer
210+
required: False
211+
min: 1
186212
jql_query:
187213
type: string
188214
required: False

src/docs2vecs/subcommands/indexer/skills/confluence_faq_splitter_skill.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,11 @@ def run(self, input: Optional[List[Document]] = None) -> List[Document]:
122122
combined_text = f"Q: {question}\n\nA: {answer}{links_text}"
123123

124124
chunk = Chunk()
125-
chunk.document_id = hashlib.sha256(combined_text.encode()).hexdigest()
125+
# Hash document_id from question only — the question is the
126+
# stable identity of a Q&A pair, so the ID stays the same
127+
# even when the answer is updated. This makes it a reliable
128+
# unique key for Azure AI Search upserts.
129+
chunk.document_id = hashlib.sha256(question.encode()).hexdigest()
126130
chunk.document_name = Path(doc.filename).name
127131
chunk.tag = doc.tag
128132
chunk.content = combined_text # Full Q&A for retrieval
Lines changed: 104 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,12 @@
1-
"""Skill that extracts chunk content from Documents and writes it to a JSON file.
1+
"""Writes chunk content to a JSON file with optional per-document change detection.
22
3-
Use this skill at any point in a pipeline to capture intermediate state,
4-
e.g. after a splitter, so the output can be checksummed for change detection
5-
without running expensive downstream skills like embedding and indexing.
6-
7-
Only the chunk text content is written as a sorted JSON array of strings —
8-
volatile metadata like filenames, document IDs, and timestamps are excluded
9-
so the checksum remains stable when the underlying text hasn't changed.
3+
Outputs a sorted JSON array of chunk text strings (metadata excluded).
4+
When ``checksum_path`` is set, per-chunk SHA-256 checksums (keyed by
5+
``document_id``) gate downstream processing — only changed or new chunks
6+
are kept; unchanged chunks are stripped from their documents.
107
"""
118

9+
import hashlib
1210
import json
1311
import os
1412
from typing import List, Optional
@@ -19,47 +17,131 @@
1917

2018

2119
class JSONWriterSkill(IndexerSkill):
22-
"""Extract text content from all chunks and write it as a sorted JSON array.
23-
24-
The output is a flat list of strings (one per non-empty chunk), sorted
25-
alphabetically for deterministic checksumming. Documents are passed
26-
through unchanged for downstream skills.
20+
"""Write chunk text as a sorted JSON array with per-chunk change gating.
2721
2822
Config params:
29-
output_path (str): Path to the output JSON file (default:
30-
``data/pipeline_output.json``). Parent
31-
directories are created automatically.
23+
output_path (str): Output JSON path (default: ``data/pipeline_output.json``).
24+
checksum_path (str, optional): JSON file for per-chunk SHA-256 checksums
25+
keyed by ``document_id``.
26+
skip_downstream_if_unchanged (bool, optional): Strip unchanged chunks
27+
so downstream skills skip them (default: true).
3228
"""
3329

3430
def __init__(self, skill_config: dict, global_config: Config) -> None:
3531
super().__init__(skill_config, global_config)
3632
self._output_path = self._config.get("output_path", "data/pipeline_output.json")
33+
self._checksum_path = self._config.get("checksum_path", None)
34+
self._skip_if_unchanged = self._config.get("skip_downstream_if_unchanged", True)
35+
36+
def _compute_checksum(self, content_bytes: bytes) -> str:
37+
return hashlib.sha256(content_bytes).hexdigest()
38+
39+
def _read_stored_checksums(self) -> dict:
40+
"""Return stored {document_id: checksum} map, or empty dict."""
41+
if self._checksum_path and os.path.isfile(self._checksum_path):
42+
try:
43+
with open(self._checksum_path, "r", encoding="utf-8") as f:
44+
data = json.load(f)
45+
if isinstance(data, dict):
46+
return data
47+
# Legacy format — cannot migrate, start fresh.
48+
self.logger.warning(
49+
"Checksum file contains legacy format — starting fresh."
50+
)
51+
except Exception as e:
52+
self.logger.warning(f"Failed to read stored checksums: {e}")
53+
return {}
54+
55+
def _write_checksums(self, checksums: dict) -> None:
56+
"""Save per-document checksums to disk."""
57+
if self._checksum_path:
58+
os.makedirs(os.path.dirname(self._checksum_path) or ".", exist_ok=True)
59+
with open(self._checksum_path, "w", encoding="utf-8") as f:
60+
json.dump(checksums, f, indent=2, ensure_ascii=False)
61+
62+
def _compute_chunk_checksum(self, chunk) -> str:
63+
"""SHA-256 checksum of a single chunk's content."""
64+
payload = (chunk.content or "").encode("utf-8")
65+
return self._compute_checksum(payload)
3766

3867
def run(self, input: Optional[List[Document]] = None) -> List[Document]:
3968
if not input:
4069
self.logger.warning("JSONWriterSkill received no input — nothing to write.")
4170
return input or []
4271

43-
# Collect only the content from every chunk across all documents
72+
# Collect chunk content across all documents
4473
contents = []
4574
for doc in input:
4675
for chunk in doc.chunks:
4776
if chunk.content:
4877
contents.append(chunk.content)
4978

50-
# Sort for deterministic output (stable checksums)
51-
contents.sort()
79+
contents.sort() # deterministic order for stable checksums
5280

5381
os.makedirs(os.path.dirname(self._output_path) or ".", exist_ok=True)
5482

55-
with open(self._output_path, "w", encoding="utf-8") as f:
56-
json.dump(contents, f, indent=2, ensure_ascii=False)
83+
json_bytes = json.dumps(contents, indent=2, ensure_ascii=False).encode("utf-8")
84+
85+
with open(self._output_path, "wb") as f:
86+
f.write(json_bytes)
5787

5888
self.logger.info(
5989
"Wrote %d chunk content entries to %s",
6090
len(contents),
6191
self._output_path,
6292
)
6393

64-
# Pass-through: downstream skills can still consume the documents
94+
# ── Per-chunk checksum-based change gate ────────────────
95+
# Each chunk is keyed by its document_id (e.g. question hash).
96+
# Only chunks whose content has changed (or are new) are kept;
97+
# unchanged chunks are removed so downstream skills skip them.
98+
if self._checksum_path:
99+
old_checksums = self._read_stored_checksums()
100+
new_checksums: dict = {}
101+
102+
changed_count = 0
103+
unchanged_count = 0
104+
105+
for doc in input:
106+
unchanged_chunks = set()
107+
108+
for chunk in doc.chunks:
109+
doc_id = chunk.document_id or chunk.chunk_id or "unknown"
110+
chunk_checksum = self._compute_chunk_checksum(chunk)
111+
new_checksums[doc_id] = chunk_checksum
112+
113+
old_checksum = old_checksums.get(doc_id)
114+
115+
if old_checksum and chunk_checksum == old_checksum and self._skip_if_unchanged:
116+
unchanged_chunks.add(chunk)
117+
unchanged_count += 1
118+
self.logger.debug(
119+
"Chunk %s unchanged — will be stripped.",
120+
doc_id[:12],
121+
)
122+
else:
123+
changed_count += 1
124+
if old_checksum:
125+
self.logger.debug(
126+
"Chunk %s changed (old: %s, new: %s).",
127+
doc_id[:12],
128+
old_checksum[:12],
129+
chunk_checksum[:12],
130+
)
131+
else:
132+
self.logger.debug("Chunk %s is new.", doc_id[:12])
133+
134+
# Remove unchanged chunks from this document
135+
if unchanged_chunks:
136+
doc.chunks -= unchanged_chunks
137+
138+
self.logger.info(
139+
"Change detection: %d changed/new, %d unchanged out of %d chunks.",
140+
changed_count,
141+
unchanged_count,
142+
changed_count + unchanged_count,
143+
)
144+
145+
self._write_checksums(new_checksums)
146+
65147
return input

0 commit comments

Comments
 (0)