Skip to content

Replace flat text-classification with tree-based page-structuring#277

Open
nicpottier wants to merge 7 commits intomainfrom
nicpottier/tree-text-extraction
Open

Replace flat text-classification with tree-based page-structuring#277
nicpottier wants to merge 7 commits intomainfrom
nicpottier/tree-text-extraction

Conversation

@nicpottier
Copy link
Copy Markdown
Contributor

Summary

  • Replaces the flat text-classification pipeline step with a tree-based page-structuring model that splits the single nodeType field into two orthogonal fields: structure (for containers) and role (for leaves), reducing LLM concept space from 36 to 21 types
  • Adds LLM self-review refinement loop for page structuring with configurable max iterations
  • Adds eval harness (judge, improver, report) for iterating on structuring quality
  • Adds single-page re-structure API endpoint and UI button in ExtractPageDetail
  • Modernizes the extract page detail tree view with connector lines, pill badges, and improved image container rendering

Test plan

  • pnpm typecheck passes
  • Unit tests pass (page-structuring.test.ts, pipeline-translation.test.ts)
  • Run eval harness on test book to verify report renders correctly
  • Load a page in Studio UI and verify structure/role dropdowns and tree rendering
  • Test single-page re-structure button spins during task and refreshes on completion
  • Verify i18n catalogs are complete for en, es, pt-BR

Replaces the flat text-classification pipeline step with a recursive
tree-based page-structuring step. Content nodes form a tree with
container types (heading, paragraph, list, etc.) holding children,
and leaf types split into text types and image types.

Key changes:
- New page-structuring types with unrolled 5-level LLM schema (avoids
  OpenAI $ref rejection), nullable fields (keeps required[] complete)
- New structurePage() pipeline function with image placement support
- Tree-based translatePageStructuring() with depth-first text collection
- Three-generation config migration (text_types/text_group_types →
  leaf_types/container_types → text_types/image_types/container_types)
- Frontend: tree view with separate dropdowns for text, image, and
  container types; collapsible containers; JSON debug dump
- Updated all pipeline runners (DAG, sequential, stage-runner)
- Full i18n for new image type and container type labels
…improvements

Replace the single nodeType field with two orthogonal fields: structure (for
containers) and role (for leaves). This reduces LLM concept space from 36 to 21
types, improving classification accuracy. Add self-review refinement loop, eval
harness (judge, improver, report), single-page re-structure API endpoint, and
modernized tree UI in ExtractPageDetail.
The eval harness (judge, improver, report, CLI) and test book data are
not needed for the core page-structuring feature.
Add a Container Types tab to extract settings, mirroring the Text Types
tab for editing container type descriptions. Fix page navigation buttons
in the step header to be right-aligned next to the settings icon.
Size structure/role dropdown pills based on current label text instead
of longest option. Rename "Extraction Prompt" tab to "Structuring Prompt".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant