Summary
Introduce a local SQLite-backed search index for Ten Second Tom that sits on top of the existing filesystem-based storage providers (Default + Obsidian, etc.). The goal is to provide fast, powerful full-text and (optionally) semantic search across notes, summaries, and audio transcripts without changing the core “Markdown + WAV on disk” storage model.
The SQLite database will be treated as an index/cache that can be rebuilt from the filesystem at any time.
Motivation
Current characteristics:
- Ten Second Tom persists memories primarily as Markdown files on disk:
- Daily notes, summaries, and weekly reviews.
- Recording transcripts and associated metadata.
- Audio is stored as
.wav files alongside Markdown metadata.
- Storage is provider-based (default filesystem provider, Obsidian vault provider, etc.), and the on-disk text is considered the source of truth.
Limitations today:
- Search requires crawling/parsing files each time or is limited to basic filename/date patterns.
- There’s no notion of ranked full-text search or semantic similarity search.
- As the number of notes/recordings grows, search will degrade without an index.
What we want:
- Fast, ranked full-text search across all text content.
- Optional semantic (vector) search later, while staying fully local.
- Zero change to the existing Markdown/WAV storage model and provider abstractions.
Goals
- Add a SQLite-backed search/index layer:
- Single
.db file per user installation (e.g., ~/.ten-second-tom/index.db or ~/ten-second-tom/index/tst.db).
- Treat SQLite as an index/cache, not the canonical data store.
- Use SQLite FTS5 for full-text search:
- Search across content, summaries, tags, and other textual metadata.
- Support ranked results, phrase search, and prefix search.
- Integrate with the existing storage provider model:
- Default filesystem provider and Obsidian provider continue to own the actual files.
- The index listens to writes/updates and stays in sync.
- Provide a CLI experience for search:
tom search <query> as the main entry point.
- Filter by date range, kind, tags, presence of audio, etc.
- Provide a way to rebuild/repair the index:
tom index rebuild (or similar) to rescan the filesystem into SQLite.
Non-goals (for this issue)
- Changing the canonical storage model away from Markdown + WAV on disk.
- Introducing or depending on external services (Meilisearch, Qdrant, etc.).
- Defining a complex UI beyond basic CLI commands and options.
- Designing a “query language” beyond reasonable flags and FTS-style queries.
- Implementing semantic/vector search end-to-end (can be a follow-up once the core index exists).
Proposed Design
1. High-level architecture
- Keep existing storage providers as-is:
- Default filesystem provider (Markdown + WAV in a local directory).
- Obsidian provider (Markdown inside a vault).
- Add a new component, e.g.
ISearchIndex, which encapsulates all index logic:
- Responsible for reading/writing to SQLite.
- Exposed to the rest of the app via dependency injection.
- Wire search indexing into the existing command/handler pipeline:
- After successful write/update of a note/summary/transcript, emit a notification or domain event that the index can consume.
- Background tasks are optional; initial implementation can be synchronous or “eventually consistent” within the command handler.
Conceptually:
IStorageProvider → canonical Markdown/WAV operations.
ISearchIndex → SQLite-based index over those entries.
2. SQLite database location
- Default path:
- Option A:
~/.ten-second-tom/index.db
- Option B:
~/ten-second-tom/index/tst.db
- DB path should be configurable via config/env/CLI flag, but not required for normal usage.
- The app should create directories as needed on first use.
3. Data model
3.1 Core table: entries
Represents each searchable “unit” (note, summary, weekly review, transcript, etc.).
CREATE TABLE entries (
id TEXT PRIMARY KEY, -- stable identifier, e.g. "note/2025-11-24_1"
kind TEXT NOT NULL, -- e.g. note_raw, note_summary, weekly_review, recording_transcript
title TEXT, -- parsed from first H1 or derived from filename
date TEXT NOT NULL, -- YYYY-MM-DD (primary date for filtering)
created_at TEXT NOT NULL, -- ISO-8601
updated_at TEXT NOT NULL, -- ISO-8601
path TEXT NOT NULL, -- relative path to the .md file
storage_provider TEXT NOT NULL, -- e.g. DefaultFileSystem, Obsidian
tags TEXT, -- comma-separated or JSON-encoded
has_audio INTEGER NOT NULL DEFAULT 0,
audio_path TEXT, -- relative path to associated .wav if any
duration_seconds INTEGER,
stt_engine TEXT -- e.g. whisper-cpp, openai-whisper, etc.
);
Notes:
id should remain stable regardless of future refactors; ideally derived from logical identity, not transient paths.
kind and storage_provider allow for flexible filtering and joining.
3.2 Full-text index: entry_fts (FTS5)
Backed by SQLite FTS5. The exact schema may evolve, but an initial version:
CREATE VIRTUAL TABLE entry_fts USING fts5(
id,
content,
summary,
tags,
kind,
tokenize = 'unicode61'
);
Indexing behavior:
-
content:
- Raw markdown body with frontmatter stripped.
- Or a text-only representation.
-
summary:
- Summarized content if present; otherwise empty string.
-
tags:
- All tags concatenated, or a normalized representation (e.g.,
tag:foo tag:bar).
-
kind:
- Helps boost/weight certain kinds of entries in scoring, if desired.
Insert/update logic:
-
On new/updated entry:
- Upsert into
entries.
- Upsert into
entry_fts with the same id.
-
On deletion:
- Remove from both
entries and entry_fts.
4. Search behavior and CLI UX
Base CLI:
tom search "heat pump water heater"
Options (examples):
Example query pipeline:
-
Parse CLI options into a SearchQuery object.
-
Translate into:
SELECT e.*, bm25(entry_fts) AS score
FROM entry_fts
JOIN entries e ON e.id = entry_fts.id
WHERE entry_fts MATCH @ftsQuery
AND e.date >= @fromDate
AND e.date <= @toDate
-- additional filters...
ORDER BY score
LIMIT @limit;
-
Render results:
- Show date, title, kind, optional snippet.
- Optionally show score for debugging.
The first version can keep search semantics simple and opinionated; we don’t need to expose raw FTS syntax to end users beyond basic phrases.
5. Index lifecycle
5.1 Automatic updates
5.2 Rebuild
CLI command:
Behavior:
- Drops and recreates
entries and entry_fts tables (or truncates them).
- Enumerates all known entries via the active
IStorageProvider.
- Parses frontmatter and body for each entry.
- Re-populates core table + FTS table.
This makes the index fully recoverable from the filesystem and is the escape hatch for bugs/migrations.
Implementation Plan (incremental)
Phase 1: Minimal FTS-backed search
-
Introduce ISearchIndex abstraction and concrete SqliteSearchIndex implementation.
-
Add SQLite dependency:
- Use a cross-platform .NET SQLite provider (e.g.
Microsoft.Data.Sqlite).
-
Create schema migrations:
- On startup, ensure DB file exists.
- Apply or upgrade schema (simple versioning via a
schema_version table).
-
Add indexing integration:
- Add a minimal internal representation of an entry for indexing (e.g.
IndexableEntry).
- Hook into write/update/delete paths to keep the index up to date.
-
Implement tom search using FTS5:
- Basic query string.
- Optional
--from, --to, --limit.
Phase 2: Rich filters and robustness
- Add filters for
kind, storage_provider, has_audio, etc.
- Add
tom index rebuild and possibly tom index vacuum.
- Introduce logging/telemetry around index operations (for debugging).
- Document usage and configuration in the README.
Phase 3 (follow-up): Optional semantic/vector search
(Separate issue, once core index is stable.)
- Add an
entry_embeddings table using sqlite-vec or similar.
- Define an embedding pipeline and configuration.
- Extend
ISearchIndex with SearchSemanticAsync.
- Add
--semantic flag to tom search.
Research & open questions
The following items need investigation and decisions before/during implementation.
1. SQLite provider & distribution
-
Which .NET SQLite library to standardize on?
- Likely
Microsoft.Data.Sqlite due to good cross-platform support and integration.
- Confirm behavior on macOS, Linux, and Windows.
-
How to distribute it cleanly with the existing CLI packaging (Homebrew, etc.)?
- Confirm there are no platform-specific runtime issues.
2. FTS5 configuration
3. Identifier strategy
4. Integration with existing slices and MediatR
This has UX vs complexity trade-offs that need to be evaluated.
5. Index location and configuration
6. Rebuild strategy and performance
7. Future semantic/vector search (follow-up design)
-
Which vector extension to standardize on (e.g. sqlite-vec, sqlite-vector)?
- Installation, licensing, and cross-platform story.
-
Embedding model choice:
- Local-only vs remote API.
- Integration with the existing LLM provider abstractions.
-
Cost/performance implications:
- How/when to generate embeddings (e.g. on-demand vs background job).
- How to handle re-embedding when you change models or summarization strategies.
Acceptance criteria
- A SQLite database is created and maintained automatically on local disk.
tom search "<query>" returns meaningful, ranked results using FTS5 across notes, summaries, and transcripts.
- Index updates automatically when new entries are created or existing ones are updated/deleted.
tom index rebuild fully reconstructs the index from the existing storage provider(s) without manual intervention.
- No changes to the canonical Markdown/WAV storage model or existing providers.
- The feature works on macOS, Linux, and Windows in typical CLI scenarios.
Summary
Introduce a local SQLite-backed search index for Ten Second Tom that sits on top of the existing filesystem-based storage providers (Default + Obsidian, etc.). The goal is to provide fast, powerful full-text and (optionally) semantic search across notes, summaries, and audio transcripts without changing the core “Markdown + WAV on disk” storage model.
The SQLite database will be treated as an index/cache that can be rebuilt from the filesystem at any time.
Motivation
Current characteristics:
.wavfiles alongside Markdown metadata.Limitations today:
What we want:
Goals
.dbfile per user installation (e.g.,~/.ten-second-tom/index.dbor~/ten-second-tom/index/tst.db).tom search <query>as the main entry point.tom index rebuild(or similar) to rescan the filesystem into SQLite.Non-goals (for this issue)
Proposed Design
1. High-level architecture
ISearchIndex, which encapsulates all index logic:Conceptually:
IStorageProvider→ canonical Markdown/WAV operations.ISearchIndex→ SQLite-based index over those entries.2. SQLite database location
~/.ten-second-tom/index.db~/ten-second-tom/index/tst.db3. Data model
3.1 Core table:
entriesRepresents each searchable “unit” (note, summary, weekly review, transcript, etc.).
Notes:
idshould remain stable regardless of future refactors; ideally derived from logical identity, not transient paths.kindandstorage_providerallow for flexible filtering and joining.3.2 Full-text index:
entry_fts(FTS5)Backed by SQLite FTS5. The exact schema may evolve, but an initial version:
CREATE VIRTUAL TABLE entry_fts USING fts5( id, content, summary, tags, kind, tokenize = 'unicode61' );Indexing behavior:
content:summary:tags:tag:foo tag:bar).kind:Insert/update logic:
On new/updated entry:
entries.entry_ftswith the sameid.On deletion:
entriesandentry_fts.4. Search behavior and CLI UX
Base CLI:
tom search "heat pump water heater"Options (examples):
Filter by date range:
--from 2025-01-01--to 2025-03-31Filter by kind:
--kind note_raw--kind recording_transcriptFilter by storage provider:
--provider ObsidianFilter by audio:
--has-audioLimit results:
--limit 20Example query pipeline:
Parse CLI options into a
SearchQueryobject.Translate into:
Render results:
The first version can keep search semantics simple and opinionated; we don’t need to expose raw FTS syntax to end users beyond basic phrases.
5. Index lifecycle
5.1 Automatic updates
After write operations (new note, update note, recording transcript, etc.):
ISearchIndex.IndexAsync(entry).On delete:
ISearchIndex.RemoveAsync(id).5.2 Rebuild
CLI command:
Behavior:
entriesandentry_ftstables (or truncates them).IStorageProvider.This makes the index fully recoverable from the filesystem and is the escape hatch for bugs/migrations.
Implementation Plan (incremental)
Phase 1: Minimal FTS-backed search
Introduce
ISearchIndexabstraction and concreteSqliteSearchIndeximplementation.Add SQLite dependency:
Microsoft.Data.Sqlite).Create schema migrations:
schema_versiontable).Add indexing integration:
IndexableEntry).Implement
tom searchusing FTS5:--from,--to,--limit.Phase 2: Rich filters and robustness
kind,storage_provider,has_audio, etc.tom index rebuildand possiblytom index vacuum.Phase 3 (follow-up): Optional semantic/vector search
(Separate issue, once core index is stable.)
entry_embeddingstable usingsqlite-vecor similar.ISearchIndexwithSearchSemanticAsync.--semanticflag totom search.Research & open questions
The following items need investigation and decisions before/during implementation.
1. SQLite provider & distribution
Which .NET SQLite library to standardize on?
Microsoft.Data.Sqlitedue to good cross-platform support and integration.How to distribute it cleanly with the existing CLI packaging (Homebrew, etc.)?
2. FTS5 configuration
Confirm FTS5 availability in the chosen SQLite provider:
Tokenization strategy:
unicode61?Search semantics:
3. Identifier strategy
How should
entries.idbe generated?How to handle renames/moves of files?
4. Integration with existing slices and MediatR
What is the cleanest way to hook indexing into existing command/query pipelines?
Do we want indexing to be:
This has UX vs complexity trade-offs that need to be evaluated.
5. Index location and configuration
Finalize default DB path:
~/.ten-second-tom/index.dbvs~/ten-second-tom/index/tst.db.Configuration:
6. Rebuild strategy and performance
For large vaults:
tom index rebuild?Should we consider incremental reconstruction strategies (e.g. only missing entries) or keep it simple for now?
7. Future semantic/vector search (follow-up design)
Which vector extension to standardize on (e.g.
sqlite-vec,sqlite-vector)?Embedding model choice:
Cost/performance implications:
Acceptance criteria
tom search "<query>"returns meaningful, ranked results using FTS5 across notes, summaries, and transcripts.tom index rebuildfully reconstructs the index from the existing storage provider(s) without manual intervention.