feat: Add SQLite-backed local search index (FTS + optional vectors)

## Summary

Introduce a local SQLite-backed search index for Ten Second Tom that sits on top of the existing filesystem-based storage providers (Default + Obsidian, etc.). The goal is to provide fast, powerful full-text and (optionally) semantic search across notes, summaries, and audio transcripts without changing the core “Markdown + WAV on disk” storage model.

The SQLite database will be treated as an index/cache that can be rebuilt from the filesystem at any time.

---

## Motivation

Current characteristics:

- Ten Second Tom persists memories primarily as Markdown files on disk:
  - Daily notes, summaries, and weekly reviews.
  - Recording transcripts and associated metadata.
- Audio is stored as `.wav` files alongside Markdown metadata.
- Storage is provider-based (default filesystem provider, Obsidian vault provider, etc.), and the on-disk text is considered the source of truth.

Limitations today:

- Search requires crawling/parsing files each time or is limited to basic filename/date patterns.
- There’s no notion of ranked full-text search or semantic similarity search.
- As the number of notes/recordings grows, search will degrade without an index.

What we want:

- Fast, ranked full-text search across all text content.
- Optional semantic (vector) search later, while staying fully local.
- Zero change to the existing Markdown/WAV storage model and provider abstractions.

---

## Goals

- Add a **SQLite-backed search/index layer**:
  - Single `.db` file per user installation (e.g., `~/.ten-second-tom/index.db` or `~/ten-second-tom/index/tst.db`).
  - Treat SQLite as an index/cache, not the canonical data store.
- Use **SQLite FTS5** for full-text search:
  - Search across content, summaries, tags, and other textual metadata.
  - Support ranked results, phrase search, and prefix search.
- Integrate with the existing storage provider model:
  - Default filesystem provider and Obsidian provider continue to own the actual files.
  - The index listens to writes/updates and stays in sync.
- Provide a CLI experience for search:
  - `tom search <query>` as the main entry point.
  - Filter by date range, kind, tags, presence of audio, etc.
- Provide a way to rebuild/repair the index:
  - `tom index rebuild` (or similar) to rescan the filesystem into SQLite.

---

## Non-goals (for this issue)

- Changing the canonical storage model away from Markdown + WAV on disk.
- Introducing or depending on external services (Meilisearch, Qdrant, etc.).
- Defining a complex UI beyond basic CLI commands and options.
- Designing a “query language” beyond reasonable flags and FTS-style queries.
- Implementing semantic/vector search end-to-end (can be a follow-up once the core index exists).

---

## Proposed Design

### 1. High-level architecture

- Keep existing storage providers as-is:
  - Default filesystem provider (Markdown + WAV in a local directory).
  - Obsidian provider (Markdown inside a vault).
- Add a new component, e.g. `ISearchIndex`, which encapsulates all index logic:
  - Responsible for reading/writing to SQLite.
  - Exposed to the rest of the app via dependency injection.
- Wire search indexing into the existing command/handler pipeline:
  - After successful write/update of a note/summary/transcript, emit a notification or domain event that the index can consume.
  - Background tasks are optional; initial implementation can be synchronous or “eventually consistent” within the command handler.

Conceptually:

- `IStorageProvider` → canonical Markdown/WAV operations.
- `ISearchIndex` → SQLite-based index over those entries.

### 2. SQLite database location

- Default path:
  - Option A: `~/.ten-second-tom/index.db`
  - Option B: `~/ten-second-tom/index/tst.db`
- DB path should be configurable via config/env/CLI flag, but not required for normal usage.
- The app should create directories as needed on first use.

### 3. Data model

#### 3.1 Core table: `entries`

Represents each searchable “unit” (note, summary, weekly review, transcript, etc.).

```sql
CREATE TABLE entries (
    id               TEXT PRIMARY KEY,         -- stable identifier, e.g. "note/2025-11-24_1"
    kind             TEXT NOT NULL,            -- e.g. note_raw, note_summary, weekly_review, recording_transcript
    title            TEXT,                     -- parsed from first H1 or derived from filename
    date             TEXT NOT NULL,            -- YYYY-MM-DD (primary date for filtering)
    created_at       TEXT NOT NULL,            -- ISO-8601
    updated_at       TEXT NOT NULL,            -- ISO-8601
    path             TEXT NOT NULL,            -- relative path to the .md file
    storage_provider TEXT NOT NULL,            -- e.g. DefaultFileSystem, Obsidian
    tags             TEXT,                     -- comma-separated or JSON-encoded
    has_audio        INTEGER NOT NULL DEFAULT 0,
    audio_path       TEXT,                     -- relative path to associated .wav if any
    duration_seconds INTEGER,
    stt_engine       TEXT                      -- e.g. whisper-cpp, openai-whisper, etc.
);
````

Notes:

* `id` should remain stable regardless of future refactors; ideally derived from logical identity, not transient paths.
* `kind` and `storage_provider` allow for flexible filtering and joining.

#### 3.2 Full-text index: `entry_fts` (FTS5)

Backed by SQLite FTS5. The exact schema may evolve, but an initial version:

```sql
CREATE VIRTUAL TABLE entry_fts USING fts5(
    id,
    content,
    summary,
    tags,
    kind,
    tokenize = 'unicode61'
);
```

Indexing behavior:

* `content`:

  * Raw markdown body with frontmatter stripped.
  * Or a text-only representation.
* `summary`:

  * Summarized content if present; otherwise empty string.
* `tags`:

  * All tags concatenated, or a normalized representation (e.g., `tag:foo tag:bar`).
* `kind`:

  * Helps boost/weight certain kinds of entries in scoring, if desired.

Insert/update logic:

* On new/updated entry:

  * Upsert into `entries`.
  * Upsert into `entry_fts` with the same `id`.
* On deletion:

  * Remove from both `entries` and `entry_fts`.

### 4. Search behavior and CLI UX

Base CLI:

```bash
tom search "heat pump water heater"
```

Options (examples):

* Filter by date range:

  * `--from 2025-01-01`
  * `--to 2025-03-31`
* Filter by kind:

  * `--kind note_raw`
  * `--kind recording_transcript`
* Filter by storage provider:

  * `--provider Obsidian`
* Filter by audio:

  * `--has-audio`
* Limit results:

  * `--limit 20`

Example query pipeline:

1. Parse CLI options into a `SearchQuery` object.

2. Translate into:

   ```sql
   SELECT e.*, bm25(entry_fts) AS score
   FROM entry_fts
   JOIN entries e ON e.id = entry_fts.id
   WHERE entry_fts MATCH @ftsQuery
     AND e.date >= @fromDate
     AND e.date <= @toDate
     -- additional filters...
   ORDER BY score
   LIMIT @limit;
   ```

3. Render results:

   * Show date, title, kind, optional snippet.
   * Optionally show score for debugging.

The first version can keep search semantics simple and opinionated; we don’t need to expose raw FTS syntax to end users beyond basic phrases.

### 5. Index lifecycle

#### 5.1 Automatic updates

* After write operations (new note, update note, recording transcript, etc.):

  * The relevant command handler (or a domain event handler) calls `ISearchIndex.IndexAsync(entry)`.
* On delete:

  * Call `ISearchIndex.RemoveAsync(id)`.

#### 5.2 Rebuild

CLI command:

```bash
tom index rebuild
```

Behavior:

* Drops and recreates `entries` and `entry_fts` tables (or truncates them).
* Enumerates all known entries via the active `IStorageProvider`.
* Parses frontmatter and body for each entry.
* Re-populates core table + FTS table.

This makes the index fully recoverable from the filesystem and is the escape hatch for bugs/migrations.

---

## Implementation Plan (incremental)

### Phase 1: Minimal FTS-backed search

1. **Introduce `ISearchIndex` abstraction** and concrete `SqliteSearchIndex` implementation.
2. **Add SQLite dependency**:

   * Use a cross-platform .NET SQLite provider (e.g. `Microsoft.Data.Sqlite`).
3. **Create schema migrations**:

   * On startup, ensure DB file exists.
   * Apply or upgrade schema (simple versioning via a `schema_version` table).
4. **Add indexing integration**:

   * Add a minimal internal representation of an entry for indexing (e.g. `IndexableEntry`).
   * Hook into write/update/delete paths to keep the index up to date.
5. **Implement `tom search` using FTS5**:

   * Basic query string.
   * Optional `--from`, `--to`, `--limit`.

### Phase 2: Rich filters and robustness

1. Add filters for `kind`, `storage_provider`, `has_audio`, etc.
2. Add `tom index rebuild` and possibly `tom index vacuum`.
3. Introduce logging/telemetry around index operations (for debugging).
4. Document usage and configuration in the README.

### Phase 3 (follow-up): Optional semantic/vector search

(Separate issue, once core index is stable.)

* Add an `entry_embeddings` table using `sqlite-vec` or similar.
* Define an embedding pipeline and configuration.
* Extend `ISearchIndex` with `SearchSemanticAsync`.
* Add `--semantic` flag to `tom search`.

---

## Research & open questions

The following items need investigation and decisions before/during implementation.

### 1. SQLite provider & distribution

* Which .NET SQLite library to standardize on?

  * Likely `Microsoft.Data.Sqlite` due to good cross-platform support and integration.
  * Confirm behavior on macOS, Linux, and Windows.
* How to distribute it cleanly with the existing CLI packaging (Homebrew, etc.)?

  * Confirm there are no platform-specific runtime issues.

### 2. FTS5 configuration

* Confirm FTS5 availability in the chosen SQLite provider:

  * Ensure FTS5 is compiled in by default.
  * Validate behavior across platforms.
* Tokenization strategy:

  * Stick with `unicode61`?
  * Do we need custom tokenization or stop-word lists?
* Search semantics:

  * Decide on whether to use FTS5’s advanced query syntax or keep it “hidden” behind a simpler model.

### 3. Identifier strategy

* How should `entries.id` be generated?

  * Option A: path-based (easiest, but fragile if paths change).
  * Option B: derived from logical identifiers already in the app (preferred if available).
* How to handle renames/moves of files?

  * Is there already a stable identity in the current model that should be reused?

### 4. Integration with existing slices and MediatR

* What is the cleanest way to hook indexing into existing command/query pipelines?

  * Event-based (domain events + handlers) vs direct calls in command handlers.
* Do we want indexing to be:

  * Strictly synchronous (writes only complete after index update)?
  * Best-effort/async (writes complete and index catches up shortly after)?

This has UX vs complexity trade-offs that need to be evaluated.

### 5. Index location and configuration

* Finalize default DB path:

  * `~/.ten-second-tom/index.db` vs `~/ten-second-tom/index/tst.db`.
* Configuration:

  * Should users be able to override the DB path via config/env?
  * Do we need a per-storage-provider index, or a single global index spanning all providers?

### 6. Rebuild strategy and performance

* For large vaults:

  * How expensive is full rebuild?
  * Do we need a progress indicator or streaming feedback during `tom index rebuild`?
* Should we consider incremental reconstruction strategies (e.g. only missing entries) or keep it simple for now?

### 7. Future semantic/vector search (follow-up design)

* Which vector extension to standardize on (e.g. `sqlite-vec`, `sqlite-vector`)?

  * Installation, licensing, and cross-platform story.
* Embedding model choice:

  * Local-only vs remote API.
  * Integration with the existing LLM provider abstractions.
* Cost/performance implications:

  * How/when to generate embeddings (e.g. on-demand vs background job).
  * How to handle re-embedding when you change models or summarization strategies.

---

## Acceptance criteria

* A SQLite database is created and maintained automatically on local disk.
* `tom search "<query>"` returns meaningful, ranked results using FTS5 across notes, summaries, and transcripts.
* Index updates automatically when new entries are created or existing ones are updated/deleted.
* `tom index rebuild` fully reconstructs the index from the existing storage provider(s) without manual intervention.
* No changes to the canonical Markdown/WAV storage model or existing providers.
* The feature works on macOS, Linux, and Windows in typical CLI scenarios.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add SQLite-backed local search index (FTS + optional vectors) #61

Summary

Motivation

Goals

Non-goals (for this issue)

Proposed Design

1. High-level architecture

2. SQLite database location

3. Data model

3.1 Core table: `entries`

3.2 Full-text index: `entry_fts` (FTS5)

4. Search behavior and CLI UX

5. Index lifecycle

5.1 Automatic updates

5.2 Rebuild

Implementation Plan (incremental)

Phase 1: Minimal FTS-backed search

Phase 2: Rich filters and robustness

Phase 3 (follow-up): Optional semantic/vector search

Research & open questions

1. SQLite provider & distribution

2. FTS5 configuration

3. Identifier strategy

4. Integration with existing slices and MediatR

5. Index location and configuration

6. Rebuild strategy and performance

7. Future semantic/vector search (follow-up design)

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

feat: Add SQLite-backed local search index (FTS + optional vectors) #61

Description

Summary

Motivation

Goals

Non-goals (for this issue)

Proposed Design

1. High-level architecture

2. SQLite database location

3. Data model

3.1 Core table: entries

3.2 Full-text index: entry_fts (FTS5)

4. Search behavior and CLI UX

5. Index lifecycle

5.1 Automatic updates

5.2 Rebuild

Implementation Plan (incremental)

Phase 1: Minimal FTS-backed search

Phase 2: Rich filters and robustness

Phase 3 (follow-up): Optional semantic/vector search

Research & open questions

1. SQLite provider & distribution

2. FTS5 configuration

3. Identifier strategy

4. Integration with existing slices and MediatR

5. Index location and configuration

6. Rebuild strategy and performance

7. Future semantic/vector search (follow-up design)

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

3.1 Core table: `entries`

3.2 Full-text index: `entry_fts` (FTS5)