A command-line workflow that parses a browser bookmark export, enriches it with live page metadata, uses an LLM to suggest a reorganised folder structure, and emits refreshed HTML plus validation reports.
- Python 3.11+
uvfor environment and script management- An OpenAI API key
OPENAI_API_KEY
-
Install the dependencies (installs into a local
.venv):uv sync
-
Activate the virtual environment if you prefer running scripts directly (optional when using
uv run):.\.venv\Scripts\Activate.ps1
-
Export your API key:
$env:OPENAI_API_KEY = "sk-..."
Run the orchestrator with uv run python main.py and choose a mode that targets a specific pipeline phase. Add --verbose for detailed logging.
| Mode | Purpose | Outputs (written/updated) |
|---|---|---|
parse |
Parse the HTML export and serialize initial bookmark records. | bookmarks.json (fresh write) |
metadata |
Enrich records with page metadata (reuse cache unless --fresh-scrape). Concurrency accelerates I/O-bound fetches. |
bookmarks.json (updated), may skip fetch if fully cached |
llm |
Run categorisation only (assumes parsed + optionally enriched JSON). | bookmarks.json (updated with reorg fields), bookmarks_reorganised.html, validation log |
html |
Generate reorganised HTML from existing JSON (requires prior llm). |
bookmarks_reorganised.html |
compare |
Validate an existing reorganised HTML against original export. | Console validation report |
all |
End-to-end: parse → metadata (concurrent) → LLM → HTML → validation. | All artefacts above |
--use-json-cachereuses existing JSON instead of re-parsing the HTML.--fresh-scrapeforces metadata re-fetches for every URL.--verbosesurfaces detailed progress logs.--instruction-filepoints to free-form guidance that is injected into the user prompt.--system-instruction-fileappends immutable guardrails to the system prompt before each LLM batch.--metadata-modecontrols enrichment strategy:all(default) oronly-missingto skip already enriched entries.
| Flag | Default | Description |
|---|---|---|
--input |
bookmarks_31_10_2025.html (or $env:BOOKMARKS_EXPORT_FILE if set) |
Source bookmark HTML export. |
--json-output |
bookmarks.json |
Path for the intermediate & evolving JSON (overwritten each stage). |
--html-output |
bookmarks_reorganised.html |
Destination for the reorganised HTML. |
--model |
gpt-4.1-mini |
OpenAI model for the LLM categorisation stage. |
--mode |
llm |
Pipeline stage selector (now includes all). |
--metadata-mode |
all |
Strategy for enrichment: all or only-missing. |
parsemode cannot be combined with--use-json-cacheor--fresh-scrape(they are meaningless before JSON exists).htmlmode cannot be combined with--fresh-scrape(no scraping occurs).llmregenerates HTML and runs validation after categorisation; it skips parsing if--use-json-cacheprovided.allexecutes every phase in order, respecting--fresh-scrapeand metadata mode.- Metadata reuse requires
--use-json-cacheand a pre-existing JSON file. - If any bookmark lacks a
location_afterin JSON,--mode htmlwill abort; runllmorallto regenerate.
The file indicated by --json-output is overwritten at each stage:
parse: base entries (no metadata, no reorganised fields).metadata: adds or updatesmetadata(title/description/tags). Cached entries are reused unless--fresh-scrape.llm: appendstitle_after,location_after, refined tags; then HTML is built and validation runs.all: performs steps 1–3 sequentially.
There are no separate bookmarks_with_metadata.json or bookmarks_with_llm.json files—only the evolving bookmarks.json.
-
Parse the source export:
uv run python main.py --mode parse --verbose
-
Enrich metadata, reusing the freshly written JSON:
uv run python main.py --mode metadata --use-json-cache --verbose
-
Ask the LLM to reorganise, using cached metadata:
uv run python main.py --mode llm --use-json-cache --verbose
-
Rebuild the bookmark HTML:
uv run python main.py --mode html --verbose
-
Confirm the output mirrors the original bookmark set:
uv run python main.py --mode compare --verbose
The orchestrator writes results into the project root, allowing you to open bookmarks_reorganised.html in a browser and inspect the final structure. Validation logs appear in the console after llm and compare.
- Missing metadata typically means the site blocked scraping; the tool falls back to the domain root when possible (401/403/407 triggers root retry).
- If the metadata cache seems stale, rerun with
--fresh-scrape. - The HTML build requires successful
llmoutput; rerun step 3 (orall) if you see a "reorganised locations missing" error. - Use
--system-instruction-filefor hard safety/format constraints; use--instruction-filefor softer organisational hints. - If all metadata is already present,
metadatamode with--metadata-mode only-missingwill skip network calls entirely. - Concurrency: metadata enrichment uses a thread pool (default workers=12) and falls back to root URL on permission errors.
The validator enforces:
- URL multiset equivalence between original and reorganised output.
- Every record gets a non-empty
location_afterafter LLM step. - Index continuity and ordering expectations inside generated HTML blocks.
- Structural depth trimming to
MAX_FOLDER_DEPTH(currently 4) to avoid overly nested folders.
config.py provides global constants (e.g. MAX_FOLDER_DEPTH, DEFAULT_BATCH_SIZE). Adjust carefully; depth >4 can inflate token usage and reduce clarity.
Run the full test suite (pytest) via uv:
uv run pytest -qRuff lint (auto-fix where possible):
uv run ruff check . --fixAdd a dependency needed only for development (example):
uv add --dev httpxRegenerate the lock / sync after editing pyproject.toml manually:
uv syncThe implemented test coverage currently targets:
- Parsing record count integrity.
- Metadata enrichment concurrency & skip logic (mocked requests).
- LLM response schema validation & retry (including malformed JSON batch scenarios).
- HTML generation depth enforcement.
- Validator edge cases (missing locations, URL mismatch, empty location_after).
--input can be omitted if BOOKMARKS_EXPORT_FILE environment variable is set.
LLM calls use retry with exponential backoff and pydantic schema validation. Malformed items are logged and skipped, ensuring downstream HTML remains consistent.
Concurrent metadata fetching dramatically reduces total enrichment time for large bookmark sets. Small sets (≤3) run serially to avoid thread overhead.