Skip to content

feat: Add SQLite-backed local search index (FTS + optional vectors) #61

@sirkirby

Description

@sirkirby

Summary

Introduce a local SQLite-backed search index for Ten Second Tom that sits on top of the existing filesystem-based storage providers (Default + Obsidian, etc.). The goal is to provide fast, powerful full-text and (optionally) semantic search across notes, summaries, and audio transcripts without changing the core “Markdown + WAV on disk” storage model.

The SQLite database will be treated as an index/cache that can be rebuilt from the filesystem at any time.


Motivation

Current characteristics:

  • Ten Second Tom persists memories primarily as Markdown files on disk:
    • Daily notes, summaries, and weekly reviews.
    • Recording transcripts and associated metadata.
  • Audio is stored as .wav files alongside Markdown metadata.
  • Storage is provider-based (default filesystem provider, Obsidian vault provider, etc.), and the on-disk text is considered the source of truth.

Limitations today:

  • Search requires crawling/parsing files each time or is limited to basic filename/date patterns.
  • There’s no notion of ranked full-text search or semantic similarity search.
  • As the number of notes/recordings grows, search will degrade without an index.

What we want:

  • Fast, ranked full-text search across all text content.
  • Optional semantic (vector) search later, while staying fully local.
  • Zero change to the existing Markdown/WAV storage model and provider abstractions.

Goals

  • Add a SQLite-backed search/index layer:
    • Single .db file per user installation (e.g., ~/.ten-second-tom/index.db or ~/ten-second-tom/index/tst.db).
    • Treat SQLite as an index/cache, not the canonical data store.
  • Use SQLite FTS5 for full-text search:
    • Search across content, summaries, tags, and other textual metadata.
    • Support ranked results, phrase search, and prefix search.
  • Integrate with the existing storage provider model:
    • Default filesystem provider and Obsidian provider continue to own the actual files.
    • The index listens to writes/updates and stays in sync.
  • Provide a CLI experience for search:
    • tom search <query> as the main entry point.
    • Filter by date range, kind, tags, presence of audio, etc.
  • Provide a way to rebuild/repair the index:
    • tom index rebuild (or similar) to rescan the filesystem into SQLite.

Non-goals (for this issue)

  • Changing the canonical storage model away from Markdown + WAV on disk.
  • Introducing or depending on external services (Meilisearch, Qdrant, etc.).
  • Defining a complex UI beyond basic CLI commands and options.
  • Designing a “query language” beyond reasonable flags and FTS-style queries.
  • Implementing semantic/vector search end-to-end (can be a follow-up once the core index exists).

Proposed Design

1. High-level architecture

  • Keep existing storage providers as-is:
    • Default filesystem provider (Markdown + WAV in a local directory).
    • Obsidian provider (Markdown inside a vault).
  • Add a new component, e.g. ISearchIndex, which encapsulates all index logic:
    • Responsible for reading/writing to SQLite.
    • Exposed to the rest of the app via dependency injection.
  • Wire search indexing into the existing command/handler pipeline:
    • After successful write/update of a note/summary/transcript, emit a notification or domain event that the index can consume.
    • Background tasks are optional; initial implementation can be synchronous or “eventually consistent” within the command handler.

Conceptually:

  • IStorageProvider → canonical Markdown/WAV operations.
  • ISearchIndex → SQLite-based index over those entries.

2. SQLite database location

  • Default path:
    • Option A: ~/.ten-second-tom/index.db
    • Option B: ~/ten-second-tom/index/tst.db
  • DB path should be configurable via config/env/CLI flag, but not required for normal usage.
  • The app should create directories as needed on first use.

3. Data model

3.1 Core table: entries

Represents each searchable “unit” (note, summary, weekly review, transcript, etc.).

CREATE TABLE entries (
    id               TEXT PRIMARY KEY,         -- stable identifier, e.g. "note/2025-11-24_1"
    kind             TEXT NOT NULL,            -- e.g. note_raw, note_summary, weekly_review, recording_transcript
    title            TEXT,                     -- parsed from first H1 or derived from filename
    date             TEXT NOT NULL,            -- YYYY-MM-DD (primary date for filtering)
    created_at       TEXT NOT NULL,            -- ISO-8601
    updated_at       TEXT NOT NULL,            -- ISO-8601
    path             TEXT NOT NULL,            -- relative path to the .md file
    storage_provider TEXT NOT NULL,            -- e.g. DefaultFileSystem, Obsidian
    tags             TEXT,                     -- comma-separated or JSON-encoded
    has_audio        INTEGER NOT NULL DEFAULT 0,
    audio_path       TEXT,                     -- relative path to associated .wav if any
    duration_seconds INTEGER,
    stt_engine       TEXT                      -- e.g. whisper-cpp, openai-whisper, etc.
);

Notes:

  • id should remain stable regardless of future refactors; ideally derived from logical identity, not transient paths.
  • kind and storage_provider allow for flexible filtering and joining.

3.2 Full-text index: entry_fts (FTS5)

Backed by SQLite FTS5. The exact schema may evolve, but an initial version:

CREATE VIRTUAL TABLE entry_fts USING fts5(
    id,
    content,
    summary,
    tags,
    kind,
    tokenize = 'unicode61'
);

Indexing behavior:

  • content:

    • Raw markdown body with frontmatter stripped.
    • Or a text-only representation.
  • summary:

    • Summarized content if present; otherwise empty string.
  • tags:

    • All tags concatenated, or a normalized representation (e.g., tag:foo tag:bar).
  • kind:

    • Helps boost/weight certain kinds of entries in scoring, if desired.

Insert/update logic:

  • On new/updated entry:

    • Upsert into entries.
    • Upsert into entry_fts with the same id.
  • On deletion:

    • Remove from both entries and entry_fts.

4. Search behavior and CLI UX

Base CLI:

tom search "heat pump water heater"

Options (examples):

  • Filter by date range:

    • --from 2025-01-01
    • --to 2025-03-31
  • Filter by kind:

    • --kind note_raw
    • --kind recording_transcript
  • Filter by storage provider:

    • --provider Obsidian
  • Filter by audio:

    • --has-audio
  • Limit results:

    • --limit 20

Example query pipeline:

  1. Parse CLI options into a SearchQuery object.

  2. Translate into:

    SELECT e.*, bm25(entry_fts) AS score
    FROM entry_fts
    JOIN entries e ON e.id = entry_fts.id
    WHERE entry_fts MATCH @ftsQuery
      AND e.date >= @fromDate
      AND e.date <= @toDate
      -- additional filters...
    ORDER BY score
    LIMIT @limit;
  3. Render results:

    • Show date, title, kind, optional snippet.
    • Optionally show score for debugging.

The first version can keep search semantics simple and opinionated; we don’t need to expose raw FTS syntax to end users beyond basic phrases.

5. Index lifecycle

5.1 Automatic updates

  • After write operations (new note, update note, recording transcript, etc.):

    • The relevant command handler (or a domain event handler) calls ISearchIndex.IndexAsync(entry).
  • On delete:

    • Call ISearchIndex.RemoveAsync(id).

5.2 Rebuild

CLI command:

tom index rebuild

Behavior:

  • Drops and recreates entries and entry_fts tables (or truncates them).
  • Enumerates all known entries via the active IStorageProvider.
  • Parses frontmatter and body for each entry.
  • Re-populates core table + FTS table.

This makes the index fully recoverable from the filesystem and is the escape hatch for bugs/migrations.


Implementation Plan (incremental)

Phase 1: Minimal FTS-backed search

  1. Introduce ISearchIndex abstraction and concrete SqliteSearchIndex implementation.

  2. Add SQLite dependency:

    • Use a cross-platform .NET SQLite provider (e.g. Microsoft.Data.Sqlite).
  3. Create schema migrations:

    • On startup, ensure DB file exists.
    • Apply or upgrade schema (simple versioning via a schema_version table).
  4. Add indexing integration:

    • Add a minimal internal representation of an entry for indexing (e.g. IndexableEntry).
    • Hook into write/update/delete paths to keep the index up to date.
  5. Implement tom search using FTS5:

    • Basic query string.
    • Optional --from, --to, --limit.

Phase 2: Rich filters and robustness

  1. Add filters for kind, storage_provider, has_audio, etc.
  2. Add tom index rebuild and possibly tom index vacuum.
  3. Introduce logging/telemetry around index operations (for debugging).
  4. Document usage and configuration in the README.

Phase 3 (follow-up): Optional semantic/vector search

(Separate issue, once core index is stable.)

  • Add an entry_embeddings table using sqlite-vec or similar.
  • Define an embedding pipeline and configuration.
  • Extend ISearchIndex with SearchSemanticAsync.
  • Add --semantic flag to tom search.

Research & open questions

The following items need investigation and decisions before/during implementation.

1. SQLite provider & distribution

  • Which .NET SQLite library to standardize on?

    • Likely Microsoft.Data.Sqlite due to good cross-platform support and integration.
    • Confirm behavior on macOS, Linux, and Windows.
  • How to distribute it cleanly with the existing CLI packaging (Homebrew, etc.)?

    • Confirm there are no platform-specific runtime issues.

2. FTS5 configuration

  • Confirm FTS5 availability in the chosen SQLite provider:

    • Ensure FTS5 is compiled in by default.
    • Validate behavior across platforms.
  • Tokenization strategy:

    • Stick with unicode61?
    • Do we need custom tokenization or stop-word lists?
  • Search semantics:

    • Decide on whether to use FTS5’s advanced query syntax or keep it “hidden” behind a simpler model.

3. Identifier strategy

  • How should entries.id be generated?

    • Option A: path-based (easiest, but fragile if paths change).
    • Option B: derived from logical identifiers already in the app (preferred if available).
  • How to handle renames/moves of files?

    • Is there already a stable identity in the current model that should be reused?

4. Integration with existing slices and MediatR

  • What is the cleanest way to hook indexing into existing command/query pipelines?

    • Event-based (domain events + handlers) vs direct calls in command handlers.
  • Do we want indexing to be:

    • Strictly synchronous (writes only complete after index update)?
    • Best-effort/async (writes complete and index catches up shortly after)?

This has UX vs complexity trade-offs that need to be evaluated.

5. Index location and configuration

  • Finalize default DB path:

    • ~/.ten-second-tom/index.db vs ~/ten-second-tom/index/tst.db.
  • Configuration:

    • Should users be able to override the DB path via config/env?
    • Do we need a per-storage-provider index, or a single global index spanning all providers?

6. Rebuild strategy and performance

  • For large vaults:

    • How expensive is full rebuild?
    • Do we need a progress indicator or streaming feedback during tom index rebuild?
  • Should we consider incremental reconstruction strategies (e.g. only missing entries) or keep it simple for now?

7. Future semantic/vector search (follow-up design)

  • Which vector extension to standardize on (e.g. sqlite-vec, sqlite-vector)?

    • Installation, licensing, and cross-platform story.
  • Embedding model choice:

    • Local-only vs remote API.
    • Integration with the existing LLM provider abstractions.
  • Cost/performance implications:

    • How/when to generate embeddings (e.g. on-demand vs background job).
    • How to handle re-embedding when you change models or summarization strategies.

Acceptance criteria

  • A SQLite database is created and maintained automatically on local disk.
  • tom search "<query>" returns meaningful, ranked results using FTS5 across notes, summaries, and transcripts.
  • Index updates automatically when new entries are created or existing ones are updated/deleted.
  • tom index rebuild fully reconstructs the index from the existing storage provider(s) without manual intervention.
  • No changes to the canonical Markdown/WAV storage model or existing providers.
  • The feature works on macOS, Linux, and Windows in typical CLI scenarios.

Metadata

Metadata

Assignees

Labels

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions