Replace flat text-classification with tree-based page-structuring#277
Open
nicpottier wants to merge 7 commits intomainfrom
Open
Replace flat text-classification with tree-based page-structuring#277nicpottier wants to merge 7 commits intomainfrom
nicpottier wants to merge 7 commits intomainfrom
Conversation
Replaces the flat text-classification pipeline step with a recursive tree-based page-structuring step. Content nodes form a tree with container types (heading, paragraph, list, etc.) holding children, and leaf types split into text types and image types. Key changes: - New page-structuring types with unrolled 5-level LLM schema (avoids OpenAI $ref rejection), nullable fields (keeps required[] complete) - New structurePage() pipeline function with image placement support - Tree-based translatePageStructuring() with depth-first text collection - Three-generation config migration (text_types/text_group_types → leaf_types/container_types → text_types/image_types/container_types) - Frontend: tree view with separate dropdowns for text, image, and container types; collapsible containers; JSON debug dump - Updated all pipeline runners (DAG, sequential, stage-runner) - Full i18n for new image type and container type labels
…improvements Replace the single nodeType field with two orthogonal fields: structure (for containers) and role (for leaves). This reduces LLM concept space from 36 to 21 types, improving classification accuracy. Add self-review refinement loop, eval harness (judge, improver, report), single-page re-structure API endpoint, and modernized tree UI in ExtractPageDetail.
The eval harness (judge, improver, report, CLI) and test book data are not needed for the core page-structuring feature.
Add a Container Types tab to extract settings, mirroring the Text Types tab for editing container type descriptions. Fix page navigation buttons in the step header to be right-aligned next to the settings icon.
Size structure/role dropdown pills based on current label text instead of longest option. Rename "Extraction Prompt" tab to "Structuring Prompt".
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
nodeTypefield into two orthogonal fields:structure(for containers) androle(for leaves), reducing LLM concept space from 36 to 21 typesTest plan
pnpm typecheckpassespage-structuring.test.ts,pipeline-translation.test.ts)