Skip to content

feat: add do_reading_order option to PdfPipelineOptions#3233

Open
Frank-Schruefer wants to merge 1 commit intodocling-project:mainfrom
Frank-Schruefer:feat/do-reading-order-option
Open

feat: add do_reading_order option to PdfPipelineOptions#3233
Frank-Schruefer wants to merge 1 commit intodocling-project:mainfrom
Frank-Schruefer:feat/do-reading-order-option

Conversation

@Frank-Schruefer
Copy link
Copy Markdown

Summary

This draft PR adds a do_reading_order flag to PdfPipelineOptions, analogous to the existing do_table_structure and do_ocr options.

Motivation

The ReadingOrderPredictor works well for most PDFs, but produces incorrect results for certain document types — in particular, scanned PDFs that also carry a native text layer. On such pages, each line of text typically becomes its own small orphan cluster, and the graph-based predictor struggles with 50–100 small clusters and reorders them incorrectly.

With do_reading_order=False, elements retain the order produced by the layout postprocessor. Combined with spatial cell sorting (sort_cells_spatial in BaseLayoutOptions), this gives correct reading order for the affected pages.

Changes

  • PdfPipelineOptions.do_reading_order: bool = True — new flag, defaults to True (no change in existing behaviour)
  • ReadingOrderOptions.skip_prediction: bool = False — internal flag on the model
  • StandardPdfPipeline wires do_reading_orderskip_prediction

Open question

Is this the right approach, or would you prefer a different mechanism to handle cases where the reading-order predictor produces poor results? Happy to adjust based on feedback.

Add do_reading_order (default True) to PdfPipelineOptions, analogous to
the existing do_table_structure and do_ocr flags.

When set to False, ReadingOrderModel skips the ReadingOrderPredictor and
keeps the element order produced by the layout postprocessor as-is.

This is useful for PDFs where the reading-order predictor produces
incorrect results — for example scanned pages that also carry a native
text layer, where the many resulting small orphan clusters confuse the
graph-based predictor.

Signed-off-by: stone <frank.schruefer@t-online.de>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

DCO Check Passed

Thanks @Frank-Schruefer, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 5, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dolfim-ibm dolfim-ibm marked this pull request as ready for review April 6, 2026 07:11
@dosubot
Copy link
Copy Markdown

dosubot bot commented Apr 6, 2026

Related Documentation

2 document(s) may need updating based on files changed in this PR:

Docling

How does Docling reconstruct reading order without using a large language model (LLM) call in an unsupervised manner?
View Suggested Changes
@@ -32,6 +32,12 @@
 ## Why It Works Without LLMs
 Spatial positioning encodes reading order for most documents (top-to-bottom, left-to-right conventions), and the ordering problem reduces to a **topological sort over a directed graph**—something DFS solves efficiently and deterministically.
 
+## Configuration
+
+Reading order prediction can be controlled via the `do_reading_order` flag in `PdfPipelineOptions` (defaults to `True`). Setting `do_reading_order=False` disables the `ReadingOrderPredictor` entirely, and elements retain the order produced by the layout postprocessor.
+
+Disabling reading order prediction is useful for certain document types—particularly scanned PDFs with native text layers—where the graph-based predictor may produce incorrect results due to many small orphan text clusters.
+
 ## Known Limitations
 - The 15% dilation threshold is fixed and **not user-configurable**, which can cause issues with complex multi-column layouts.
 - Reading order quality depends heavily on the accuracy of upstream layout detection.

[Accept] [Decline]

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?
View Suggested Changes
@@ -4,6 +4,7 @@
     - `from_formats`: Supported input formats include `docx`, `pptx`, `html`, `image`, `pdf`, `asciidoc`, `md` (including `txt`, `text`, `qmd`, `rmd`), `csv`, `xlsx`, `xml_uspto`, `xml_jats`, `xml_xbrl`, `mets_gbs`, `json_docling`, `audio`, `vtt`, `latex`
     - `to_formats`: Supported output formats include `md`, `json`, `yaml`, `html`, `html_split_page`, `text`, `doctags`, `vtt`
     - `pdf_backend`: Allowed values: `pypdfium2`, `docling_parse`, `dlparse_v1`, `dlparse_v2`, `dlparse_v4` (default: `docling_parse`)
+      - `do_reading_order` (default True): Enable reading-order prediction to reorder document elements into logical reading sequence; when disabled, elements retain the order produced by the layout postprocessor; disabling can improve results for scanned PDFs with native text layers and many small orphan clusters
     - `do_ocr` (default True): Use OCR
     - `force_ocr`: Replace existing text with OCR-generated text
     - `ocr_engine`, `ocr_lang`: OCR engine and language options

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

How did I do? Any feedback?  Join Discord

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../models/stages/reading_order/readingorder_model.py 75.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant