fix(pdf): extend ligature map with Dutch IJ and PUA glyph U+F0A0 by Smeet23 · Pull Request #3254 · docling-project/docling

Smeet23 · 2026-04-09T11:04:11Z

Summary

Extends the PDF text sanitizer's ligature normalization map with two entries that were missing:

Code point	Character	Normalized to	Reason
U+0132	Ĳ	`IJ`	Latin capital ligature IJ, used in Dutch (e.g. IJssel, IJmuiden)
U+0133	ĳ	`ij`	Latin small ligature IJ, used in Dutch
U+F0A0	(PUA)	`` (discarded)	Private-Use Area glyph emitted by some PDF fonts as a spurious character with no textual meaning

The existing map already handles U+FB00–U+FB06 (ﬀ, ﬁ, ﬂ, ﬃ, ﬄ, ﬅ, ﬆ). This PR fills the remaining gaps reported in the issue.

Note: U+0152 (Œ) and U+0153 (œ) are intentionally left as-is — they are legitimate characters in French (e.g. œuvre, cœur) and normalizing them to OE/oe would corrupt French text.

Changes

page_assemble_model.py: add U+0132, U+0133, U+F0A0 to _LIGATURE_MAP; update _LIGATURE_RE to match new code points
test_page_assemble_model.py: add 4 new test cases covering the new entries

Test plan

All page assemble model tests pass
test_ij_capital_ligature — U+0132 Ĳ → IJ
test_ij_small_ligature — U+0133 ĳ → ij
test_private_use_glyph_stripped — U+F0A0 discarded
test_private_use_glyph_with_spurious_space_stripped — U+F0A0 + spurious space discarded

Add two entries missing from the PDF text sanitizer's ligature map: - U+0132 (Ĳ) → "IJ" and U+0133 (ĳ) → "ij": Latin capital/small ligature IJ, used in Dutch (e.g. IJssel, Ĳ becomes IJ at the start of words). - U+F0A0 → "": a Private-Use Area glyph emitted by some PDF fonts as a spurious character with no textual meaning; it is silently discarded. The _LIGATURE_RE pattern is updated to match these new code points. Closes docling-project#2882 Signed-off-by: Smeet Agrawal <[email protected]>

Signed-off-by: Smeet Agrawal <[email protected]>

github-actions · 2026-04-09T11:04:23Z

✅ DCO Check Passed

Thanks @Smeet23, all your commits are properly signed off. 🎉

mergify · 2026-04-09T11:04:46Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dosubot · 2026-04-09T11:06:22Z

Related Documentation

1 document(s) may need updating based on files changed in this PR:

Docling

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?

View Suggested Changes

@@ -107,12 +107,36 @@
 - **Key Options**:
     - `treat_singleton_as_text` (default False): Treat 1x1 cells as TextItem
     - `gap_tolerance` (default 0): Table merging tolerance for empty cells
+    - `sheet_names` (default None): An optional list of sheet names to include in conversion. When set, only sheets whose names appear in this list will be processed. Sheet names are matched case-sensitively. Set to None (default) to include all sheets.
     - Enrichment options (image description, chart extraction)
 - **Processing**:
     - Each sheet is treated as a page
     - Table detection via flood-fill, image extraction (bounding box based on cell anchor)
     - Includes provenance (location info), auto page size calculation
+    - When `sheet_names` is specified, only sheets with names in the list are converted. If a sheet name in the list doesn't exist in the workbook, a warning is logged but processing continues.
 - **Notes**: Table detection algorithm and singleton cell handling are configurable. [Backend options code](https://github.com/docling-project/docling/blob/ae4fdbbb09fd377bb271e9b2efe541873eeb2990/docling/datamodel/backend_options.py#L80-L97).
+
+**Usage Example**:
+
+```python
+from docling.datamodel.backend_options import MsExcelBackendOptions
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import DocumentConverter, XlsxFormatOption
+
+# Process only specific sheets from an Excel file
+excel_options = MsExcelBackendOptions(
+    sheet_names=["Summary", "Data", "Q4 Results"]
+)
+
+converter = DocumentConverter(
+    format_options={
+        InputFormat.XLSX: XlsxFormatOption(backend_options=excel_options)
+    }
+)
+
+result = converter.convert("workbook.xlsx")
+doc = result.document
+```
 
 ---

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

^{How did I do? Any feedback?}

Smeet23 · 2026-04-09T11:07:51Z

Hi @PeterStaar-IBM — this is a reopened version of #3247, now submitted from the correct account.

Two things were fixed compared to the previous PR:

Wrong account — the original was opened from Smeet-Agrawal by mistake. This PR is from Smeet23.
Stray test file changes removed — the previous submission accidentally included additions to tests/test_backend_msexcel.py (sheet-names filter tests from an unrelated branch in progress). Those caused the CI failures. This PR touches only the files relevant to the ligature fix:
- docling/models/stages/page_assemble/page_assemble_model.py
- tests/test_page_assemble_model.py

All page assemble model tests pass cleanly. Sorry for the noise on the previous PR.

codecov · 2026-04-10T07:44:37Z

Codecov Report

❌ Patch coverage is 70.00000% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/backend/msexcel_backend.py	66.66%	6 Missing ⚠️

📢 Thoughts on this report? Let us know!

smeetagrawal23-sys added 3 commits April 7, 2026 15:20

style: apply ruff formatter fixes

5c6f673

Signed-off-by: Smeet Agrawal <[email protected]>

fix: remove accidentally included msexcel tests from ligature branch

b486419

Signed-off-by: Smeet Agrawal <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pdf): extend ligature map with Dutch IJ and PUA glyph U+F0A0#3254

fix(pdf): extend ligature map with Dutch IJ and PUA glyph U+F0A0#3254
Smeet23 wants to merge 3 commits intodocling-project:mainfrom
Smeet23:fix/ligature-map-extend

Smeet23 commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

mergify bot commented Apr 9, 2026

Uh oh!

dosubot bot commented Apr 9, 2026

Uh oh!

Smeet23 commented Apr 9, 2026

Uh oh!

codecov bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Smeet23 commented Apr 9, 2026

Summary

Changes

Test plan

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

mergify bot commented Apr 9, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

dosubot bot commented Apr 9, 2026

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?

Uh oh!

Smeet23 commented Apr 9, 2026

Uh oh!

codecov bot commented Apr 10, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants