Trawl Web Extraction

trawl-web-extraction is a narrow UOS tool plugin for structured web extraction. It owns schema-guided extraction, provenance, trust scoring, and source-drift handling. It does not own the generic connector boundary or broader automation routing across unrelated tools.

Built as part of the UOS split workspace on top of Paperclip, which remains the upstream control-plane substrate.

Boundary Summary

Owns schema-guided extraction, provenance, trust scoring, and extraction-target drift handling.
Depends on uos-plugin-connectors only when provider auth or callback plumbing is required around extraction jobs.
Feeds extraction quality and provenance evidence into uos-plugin-operations-cockpit when operators need reviewable signals.
Stays narrower than the generic connector layer and narrower than the broader task-routing plugin.

What This Repo Owns

Schema-guided extraction and normalization.
Source provenance, trust scoring, and confidence handling.
Resilience to DOM or layout changes and fallback strategies.
Quality evaluation and drift detection for extraction targets.

Runtime Form

Tool plugin first. Extraction schemas, provenance, and downstream delivery workflows should strengthen targeted extraction jobs without becoming a generic connector or automation surface.

Highest-Value Workflows

Define or refine target schemas.
Extract structured content with provenance and confidence.
Detect drift when sources change.
Validate extracted output against expectations and downstream needs.

Key Connections and Operating Surfaces

HTTP fetch, browser automation, search engines, sitemaps, RSS/Atom feeds, file downloads, PDFs, and structured-data surfaces needed to extract trustworthy information from messy websites.
Schema registries, validation layers, screenshots, evidence capture, provenance logs, and QA workflows when downstream consumers need confidence and traceability, not just raw text.
Docs, spreadsheets, databases, warehouses, and search indexes when extracted data must land in a form other systems can review, query, compare, or act on.
Any adjacent system required to connect discovery, extraction, normalization, validation, enrichment, and downstream delivery into one complete workflow.

KPI Targets

Schema completeness reaches >= 90% on maintained extraction targets.
Provenance coverage reaches 100% for extracted fields in benchmark workflows.
Source drift is detected within one scheduled run of a material page or schema change.
Manual cleanup time per maintained extraction target falls by 50% after normalization and validation.

Implementation Backlog

Now

Define the first wave of target schemas and provenance capture rules.
Build validation and drift-detection logic for the maintained extraction targets.
Standardize downstream export formats so extracted data is immediately usable.

Later

Support richer enrichment and cross-source reconciliation for more complex extraction workflows.
Integrate extraction evidence into broader automation and review systems by default.

Local Plugin Use

curl -X POST http://127.0.0.1:3100/api/plugins/install \
  -H "Content-Type: application/json" \
  -d '{"packageName":"<absolute-path-to-this-repo>","isLocalPath":true}'

Validation

pnpm install
pnpm build
pnpm test
pnpm plugin:typecheck

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
tests		tests
.gitignore		.gitignore
PRD.md		PRD.md
README.md		README.md
esbuild.config.mjs		esbuild.config.mjs
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
rollup.config.mjs		rollup.config.mjs
tsconfig.json		tsconfig.json
tsconfig.plugin.json		tsconfig.plugin.json
vitest.config.ts		vitest.config.ts
vitest.plugin.config.ts		vitest.plugin.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trawl Web Extraction

Boundary Summary

What This Repo Owns

Runtime Form

Highest-Value Workflows

Key Connections and Operating Surfaces

KPI Targets

Implementation Backlog

Now

Next

Later

Local Plugin Use

Validation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Trawl Web Extraction

Boundary Summary

What This Repo Owns

Runtime Form

Highest-Value Workflows

Key Connections and Operating Surfaces

KPI Targets

Implementation Backlog

Now

Next

Later

Local Plugin Use

Validation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages