Replace flat text-classification with tree-based page-structuring by nicpottier · Pull Request #277 · unicef/adt-studio

nicpottier · 2026-04-13T17:15:11Z

Summary

Replaces the flat text-classification pipeline step with a tree-based page-structuring model that splits the single nodeType field into two orthogonal fields: structure (for containers) and role (for leaves), reducing LLM concept space from 36 to 21 types
Adds LLM self-review refinement loop for page structuring with configurable max iterations
Adds eval harness (judge, improver, report) for iterating on structuring quality
Adds single-page re-structure API endpoint and UI button in ExtractPageDetail
Modernizes the extract page detail tree view with connector lines, pill badges, and improved image container rendering

Test plan

pnpm typecheck passes
Unit tests pass (page-structuring.test.ts, pipeline-translation.test.ts)
Run eval harness on test book to verify report renders correctly
Load a page in Studio UI and verify structure/role dropdowns and tree rendering
Test single-page re-structure button spins during task and refreshes on completion
Verify i18n catalogs are complete for en, es, pt-BR

Replaces the flat text-classification pipeline step with a recursive tree-based page-structuring step. Content nodes form a tree with container types (heading, paragraph, list, etc.) holding children, and leaf types split into text types and image types. Key changes: - New page-structuring types with unrolled 5-level LLM schema (avoids OpenAI $ref rejection), nullable fields (keeps required[] complete) - New structurePage() pipeline function with image placement support - Tree-based translatePageStructuring() with depth-first text collection - Three-generation config migration (text_types/text_group_types → leaf_types/container_types → text_types/image_types/container_types) - Frontend: tree view with separate dropdowns for text, image, and container types; collapsible containers; JSON debug dump - Updated all pipeline runners (DAG, sequential, stage-runner) - Full i18n for new image type and container type labels

…improvements Replace the single nodeType field with two orthogonal fields: structure (for containers) and role (for leaves). This reduces LLM concept space from 36 to 21 types, improving classification accuracy. Add self-review refinement loop, eval harness (judge, improver, report), single-page re-structure API endpoint, and modernized tree UI in ExtractPageDetail.

The eval harness (judge, improver, report, CLI) and test book data are not needed for the core page-structuring feature.

Add a Container Types tab to extract settings, mirroring the Text Types tab for editing container type descriptions. Fix page navigation buttons in the step header to be right-aligned next to the settings icon.

Size structure/role dropdown pills based on current label text instead of longest option. Rename "Extraction Prompt" tab to "Structuring Prompt".

nicpottier added 7 commits April 7, 2026 07:32

Add page structuring self-review loop

89551f0

Remove eval harness files

3c611d1

The eval harness (judge, improver, report, CLI) and test book data are not needed for the core page-structuring feature.

Merge origin/main and resolve locale conflicts

579b261

Add container types settings tab and fix header nav alignment

0fdd331

Add a Container Types tab to extract settings, mirroring the Text Types tab for editing container type descriptions. Fix page navigation buttons in the step header to be right-aligned next to the settings icon.

Fix tree view label sizing and rename Extraction Prompt tab

6121093

Size structure/role dropdown pills based on current label text instead of longest option. Rename "Extraction Prompt" tab to "Structuring Prompt".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace flat text-classification with tree-based page-structuring#277

Replace flat text-classification with tree-based page-structuring#277
nicpottier wants to merge 7 commits intomainfrom
nicpottier/tree-text-extraction

nicpottier commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nicpottier commented Apr 13, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant