Woozi stands for Wet Open Overheid Zoek Index. It aims to index all public NL government documents. It also serves as a next-gen replacement of Open-Raadsinformatie.
This folder contains new rewrite-oriented assets that are grounded in the existing ORI codebase.
The first step is a minimal schema package based on current model and transformer output, not a full redesign of the domain model.
The current implementation is split into:
- a Deno backend for extraction, search APIs, admin APIs, and production serving
- a Vite + TypeScript frontend for the public UI and admin UI
- shared TypeScript contracts in
src/types.ts
See API.md
docker compose up -d
# visit http://0.0.0.0:8787For frontend iteration with HMR, use one command:
pnpm run devThat does two things:
- starts the Docker backend services needed for app development:
quickwitandopenbesluitvorming - starts the Vite HMR frontend on the host
That gives you:
- the Docker backend on
http://127.0.0.1:8787 - the Vite dev server with HMR on
http://127.0.0.1:4317
Vite proxies /api/* calls to the Docker backend, so reruns and extraction use
the same environment as the containerized app, including the installed transmutation
binary.
Open http://127.0.0.1:4317 while iterating on the frontend.
Object storage is taken from your environment configuration. If your .env points
to external S3-compatible storage such as Hetzner, pnpm run dev uses that directly
and does not start local MinIO.
Important:
pnpm run devis the intended development entrypointpnpm run devstarts Docker-backed services and the API, then runs Vite with HMRpnpm run devclears stale listeners on the HMR port before starting, so leftover Vite processes do not usually require manual cleanup
If you only want the infra:
pnpm run dev:infraRun:
pnpm run devTo stop the Docker side again:
pnpm run dev:downIf you want a different dev port:
WOOZI_WEB_PORT=4401 pnpm run dev:webIf you want a different HMR web port:
WOOZI_WEB_PORT=4401 pnpm run devThe repo includes a production-oriented compose file and Caddy config:
That setup is intended for:
openbesluitvormingquickwitcaddy
with external S3-compatible object storage from .env.
Preferred beta deploy flow:
git push origin mainThat triggers the GitHub Actions workflow in .github/workflows/publish-openbesluitvorming.yml, which builds and publishes:
ghcr.io/openstate/woozi-openbesluitvorming:mainghcr.io/openstate/woozi-openbesluitvorming:sha-<git-sha>ghcr.io/openstate/woozi-openbesluitvorming:latest
Then update beta to the exact current commit image:
pnpm run deploy:betadeploy:beta now does one thing: over SSH, it tells the server to pull ghcr.io/openstate/woozi-openbesluitvorming:sha-<short-git-sha> and restart the app container.
Before it deploys, it checks the running server for active imports and refuses to restart the app if any imports are still running.
To override that safety check:
FORCE=1 pnpm run deploy:betaBy default, deploy:beta derives the GHCR owner from your origin remote. If you need to override it explicitly:
IMAGE_REPOSITORY=ghcr.io/your-org/woozi-openbesluitvorming pnpm run deploy:betaWhen production infra files change, sync those separately:
pnpm run deploy:beta:infraThat is only for runtime config such as:
Required production env includes:
DOMAINADMIN_PASSWORD_HASHS3_ACCESS_KEYS3_SECRET_KEYS3_STORAGE_BUCKET_NAMES3_STORAGE_ENDPOINTS3_STORAGE_REGIONQUICKWIT_INDEX_IDQUICKWIT_CLUSTER_IDQUICKWIT_NODE_IDQUICKWIT_INDEX_ROOT_PREFIX
The server should run production with:
docker compose -f docker-compose.production.yml up -dPoint your domain to the server first so Caddy can obtain Let's Encrypt certificates.
Important:
- the GHCR package must be public, or the server must be logged in to GHCR
- code deploys should update container images, not rsync source files
Quickwit defaults are intentionally different between local and production so both environments do not accidentally share the same S3-backed metastore and index:
- local/dev defaults:
QUICKWIT_CLUSTER_ID=woozi-devQUICKWIT_NODE_ID=quickwit-devQUICKWIT_INDEX_ROOT_PREFIX=indexes-devQUICKWIT_INDEX_ID=woozi-events-dev
- production defaults:
QUICKWIT_CLUSTER_ID=woozi-prodQUICKWIT_NODE_ID=quickwit-prodQUICKWIT_INDEX_ROOT_PREFIX=indexes-prodQUICKWIT_INDEX_ID=woozi-events-prod
Important:
- if production previously used
indexes+woozi-events, switching toindexes-prod+woozi-events-prodcreates a fresh search projection - after that change, production search needs a reindex/reimport before results appear again
Woozi is designed as an event-driven indexing system.
The core split is:
- extract data from source systems
- normalize it into canonical entities
- emit events only for changes
- build read projections from those events
The main flow is:
source system
-> extractor / poller
-> canonical entity
-> event broker
-> projection services
-> search index / resolver / other read models
Extractors talk to source systems such as Notubiz, iBabs, GO, and Parlaeus.
Responsibilities:
- poll for changes in a date range or from a source cursor
- fetch raw payloads and documents
- store retrieved files in S3-compatible object storage
- extract markdown-ready document text from PDFs and office files
- normalize source-specific structures
- produce canonical entities
The current ORI extractor logic is the starting point for this layer.
The system does not index raw source payloads directly.
Instead, each source payload is transformed into a canonical entity such as:
MeetingDocumentCommitteeVote
The schemas in schemas/ are the first version of those contracts.
Quickwit is not used to determine whether something changed.
Change detection happens before indexing:
- each canonical entity gets a stable ID
- each canonical payload gets a content hash
- the latest known hash is compared against metadata storage
- if nothing changed, no event is emitted
- if something changed, a new commit event is emitted
Edits are modeled as new versions, not in-place mutations. Deletes are modeled as tombstones or delete commits.
When an entity changes, Woozi emits an event into an event broker.
The expected event model is:
- CloudEvents envelope
entity.commitpayload
This broker decouples extraction from indexing and makes replay possible.
The canonical representation should not live in Quickwit.
Canonical JSON and original files should be stored in object storage, such as:
- S3
- MinIO
Example object classes:
- raw source payloads
- canonical JSON snapshots
- original files
- derived markdown and search text
Woozi still needs a small metadata store.
PostgreSQL remains useful for:
- source cursors
- latest entity head per ID
- content hashes
- commit metadata
- projector checkpoints
- resolver mappings
PostgreSQL should be small and boring in Woozi. It is not the document store and not the search index.
Current prototype note:
- extraction/admin run state is currently stored in a local SQLite file
- that is an implementation shortcut for the prototype, not the intended final metadata design
Projection services consume events and build read models.
Initial projections:
- search projection
- document resolver projection
- admin/reporting projection
Each projection can be rebuilt by replaying events and canonical snapshots.
Quickwit is the search projection, not the source of truth.
Responsibilities:
- store search-ready projected documents
- support search and filtering at large scale
- stay cheap by relying on object storage-backed indexing
Quickwit should receive projection documents derived from canonical entities. It should not receive raw source payloads and should not be the place where updates are detected.
- Raw source payloads are not the public contract.
- Canonical entities are the internal contract.
- Events describe changes to canonical entities.
- Quickwit is a projection.
- Object storage holds canonical payloads and files.
- PostgreSQL holds metadata and coordination state.
This folder currently contains:
- minimal entity schemas
- a first Deno-based Notubiz extractor slice
- a Vite + TypeScript frontend with HMR for the public UI and admin UI
- shared frontend/backend TypeScript API types
entity.commitevents for canonical meetings and documents- attachment download into S3-compatible object storage
- markdown extraction for PDF and Word-style documents
- a local Quickwit setup and projection client
- a small admin UI for reruns and extraction run inspection
- live e2e coverage that ingests Haarlem meetings and attached files into Quickwit and the GUI
From woozi/:
pnpm run devpnpm run dev:infrapnpm run dev:dockerpnpm run dev:downpnpm run dev:webpnpm run webpnpm run serve:webpnpm run build:webpnpm testpnpm test:e2epnpm test:guipnpm test:quickwitpnpm run extract:haarlempnpm run ingest:haarlempnpm run lintpnpm run formatpnpm run check-format
Quickwit helpers live in quickwit/.
For real S3-compatible storage, put these values in .env.example copied to .env:
S3_ACCESS_KEYS3_SECRET_KEYS3_STORAGE_BUCKET_NAMES3_STORAGE_ENDPOINTS3_STORAGE_REGION
To run against external S3-compatible storage:
docker compose up -dTo run the local stack with MinIO for development and tests:
docker compose --profile local-s3 up -dTo extract one Haarlem day and ingest the resulting commit events into Quickwit:
pnpm run ingest:haarlemThat command now:
- extracts public Haarlem meetings from Notubiz
- downloads attached source files
- stores the originals in S3-compatible object storage
- extracts markdown from those files and serves it lazily in the detail view
- emits
entity.commitevents forMeetingandDocument - projects both entity types into Quickwit
To start the frontend prototype:
docker compose up -d --build openbesluitvormingFor local production-style serving without Docker:
pnpm run webThe host-side ingest command currently uses:
- the Rust
transmutationCLI for PDFs - direct text decoding for
.txt,.md,.json, and HTML-like documents - office formats like
.doc,.docx,.rtf, and.odtare not supported yet and are logged as extraction warnings
In Docker and compose, the transmutation CLI is installed in the image automatically.
On the host, Woozi will try to call transmutation from PATH or from WOOZI_TRANSMUTATION_BIN.
For Apple Silicon macOS, you can install it like this:
cargo install --locked transmutationIf the CLI is missing, PDF extraction fails explicitly and the import records an extraction issue.
There is no remaining host-only Office extractor in the runtime anymore. Unsupported office formats are currently skipped with an explicit extraction warning.