Skip to content

Commit ecd1cd8

Browse files
committed
Add page sectioning, web rendering steps and replace Listr2 with cli-progress
Implements the remaining pipeline steps (page-sectioning, web-rendering with HTML validation) and replaces Listr2 with cli-progress for reliable concurrent page processing. Listr2 had a fundamental bug where nested task.newListr() subtasks were not properly awaited under concurrency, causing pipeline steps to run out of order. Key changes: - Add page-sectioning and web-rendering pipeline steps with Liquid prompts - Add HTML validation (data-id uniqueness, text containment, image refs) - Replace Listr2 with cli-progress MultiBar + custom async concurrency pool - Generate unique per-text IDs (pg001_gp001_tx001) for web rendering - Add configurable max_retries to StepConfig (default 8 for web rendering) - Add step/item_id columns to llm_log table with v2→v3 migration - Progress bars per pipeline step instead of per page for better scaling - Spinner for metadata extraction, progress bar for PDF extraction
1 parent a17627a commit ecd1cd8

24 files changed

Lines changed: 2021 additions & 333 deletions

config.yaml

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,20 +27,57 @@ text_group_types:
2727
list: A list of items (ordered or unordered)
2828
other: Anything that doesn't fit the above
2929

30+
section_types:
31+
front_cover: Front cover page (first page only)
32+
inside_cover: Inside cover page
33+
back_cover: Back cover page (last page only)
34+
separator: Separator page between logical sections
35+
credits: Credits and acknowledgements
36+
foreword: Introduction, overview, or author's note
37+
table_of_contents: Table of contents
38+
boxed_text: Text in a box or callout
39+
text_only: Reading section with only text
40+
text_and_single_image: Section with text and a single image
41+
text_and_images: Reading section with text and multiple images
42+
images_only: Section with only images
43+
activity_matching: Matching activity
44+
activity_fill_in_a_table: Table fill-in activity
45+
activity_multiple_choice: Multiple choice activity
46+
activity_true_false: True or false activity
47+
activity_open_ended_answer: Open-ended text response activity
48+
activity_fill_in_the_blank: Fill in the blank activity
49+
activity_sorting: Sorting activity
50+
other: Any other section type
51+
3052
metadata:
3153
prompt: metadata_extraction
3254
model: openai:gpt-4o
3355

3456
text_classification:
3557
prompt: text_classification
3658
model: openai:gpt-4o
37-
concurrency: 5
59+
concurrency: 16
60+
61+
page_sectioning:
62+
prompt: page_sectioning
63+
model: openai:gpt-4o
64+
65+
web_rendering:
66+
prompt: web_generation_html
67+
model: openai:gpt-4o
68+
concurrency: 16
69+
max_retries: 8
3870

3971
pruned_text_types:
4072
- header_text
4173
- footer_text
4274
- page_number
4375

76+
pruned_section_types:
77+
- back_cover
78+
- credits
79+
- inside_cover
80+
4481
image_filters:
4582
min_side: 100
4683
max_side: 5000

packages/pipeline/package.json

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,16 @@
1111
"pipeline": "./dist/cli.js"
1212
},
1313
"dependencies": {
14+
"@adt/llm": "workspace:*",
1415
"@adt/pdf": "workspace:*",
15-
"@adt/types": "workspace:*",
1616
"@adt/storage": "workspace:*",
17-
"@adt/llm": "workspace:*",
17+
"@adt/types": "workspace:*",
18+
"cli-progress": "^3.12.0",
19+
"htmlparser2": "^10.1.0",
1820
"js-yaml": "^4.1.0",
19-
"listr2": "^9.0.0",
2021
"zod": "^3.24.0"
22+
},
23+
"devDependencies": {
24+
"@types/cli-progress": "^3.11.6"
2125
}
2226
}

0 commit comments

Comments
 (0)