Skip to content

fix(pdf): extend ligature map with Dutch IJ and PUA glyph U+F0A0#3254

Open
Smeet23 wants to merge 3 commits intodocling-project:mainfrom
Smeet23:fix/ligature-map-extend
Open

fix(pdf): extend ligature map with Dutch IJ and PUA glyph U+F0A0#3254
Smeet23 wants to merge 3 commits intodocling-project:mainfrom
Smeet23:fix/ligature-map-extend

Conversation

@Smeet23
Copy link
Copy Markdown

@Smeet23 Smeet23 commented Apr 9, 2026

Summary

Closes #2882

Extends the PDF text sanitizer's ligature normalization map with two entries that were missing:

Code point Character Normalized to Reason
U+0132 IJ IJ Latin capital ligature IJ, used in Dutch (e.g. IJssel, IJmuiden)
U+0133 ij ij Latin small ligature IJ, used in Dutch
U+F0A0 (PUA) `` (discarded) Private-Use Area glyph emitted by some PDF fonts as a spurious character with no textual meaning

The existing map already handles U+FB00–U+FB06 (ff, fi, fl, ffi, ffl, ſt, st). This PR fills the remaining gaps reported in the issue.

Note: U+0152 (Œ) and U+0153 (œ) are intentionally left as-is — they are legitimate characters in French (e.g. œuvre, cœur) and normalizing them to OE/oe would corrupt French text.

Changes

  • page_assemble_model.py: add U+0132, U+0133, U+F0A0 to _LIGATURE_MAP; update _LIGATURE_RE to match new code points
  • test_page_assemble_model.py: add 4 new test cases covering the new entries

Test plan

  • All page assemble model tests pass
  • test_ij_capital_ligature — U+0132 IJ → IJ
  • test_ij_small_ligature — U+0133 ij → ij
  • test_private_use_glyph_stripped — U+F0A0 discarded
  • test_private_use_glyph_with_spurious_space_stripped — U+F0A0 + spurious space discarded

Add two entries missing from the PDF text sanitizer's ligature map:
- U+0132 (IJ) → "IJ" and U+0133 (ij) → "ij": Latin capital/small ligature
  IJ, used in Dutch (e.g. IJssel, IJ becomes IJ at the start of words).
- U+F0A0 → "": a Private-Use Area glyph emitted by some PDF fonts as a
  spurious character with no textual meaning; it is silently discarded.

The _LIGATURE_RE pattern is updated to match these new code points.

Closes docling-project#2882

Signed-off-by: Smeet Agrawal <[email protected]>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

DCO Check Passed

Thanks @Smeet23, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 9, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dosubot
Copy link
Copy Markdown

dosubot bot commented Apr 9, 2026

Related Documentation

1 document(s) may need updating based on files changed in this PR:

Docling

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?
View Suggested Changes
@@ -107,12 +107,36 @@
 - **Key Options**:
     - `treat_singleton_as_text` (default False): Treat 1x1 cells as TextItem
     - `gap_tolerance` (default 0): Table merging tolerance for empty cells
+    - `sheet_names` (default None): An optional list of sheet names to include in conversion. When set, only sheets whose names appear in this list will be processed. Sheet names are matched case-sensitively. Set to None (default) to include all sheets.
     - Enrichment options (image description, chart extraction)
 - **Processing**:
     - Each sheet is treated as a page
     - Table detection via flood-fill, image extraction (bounding box based on cell anchor)
     - Includes provenance (location info), auto page size calculation
+    - When `sheet_names` is specified, only sheets with names in the list are converted. If a sheet name in the list doesn't exist in the workbook, a warning is logged but processing continues.
 - **Notes**: Table detection algorithm and singleton cell handling are configurable. [Backend options code](https://github.com/docling-project/docling/blob/ae4fdbbb09fd377bb271e9b2efe541873eeb2990/docling/datamodel/backend_options.py#L80-L97).
+
+**Usage Example**:
+
+```python
+from docling.datamodel.backend_options import MsExcelBackendOptions
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import DocumentConverter, XlsxFormatOption
+
+# Process only specific sheets from an Excel file
+excel_options = MsExcelBackendOptions(
+    sheet_names=["Summary", "Data", "Q4 Results"]
+)
+
+converter = DocumentConverter(
+    format_options={
+        InputFormat.XLSX: XlsxFormatOption(backend_options=excel_options)
+    }
+)
+
+result = converter.convert("workbook.xlsx")
+doc = result.document
+```
 
 ---
 

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

How did I do? Any feedback?  Join Discord

@Smeet23
Copy link
Copy Markdown
Author

Smeet23 commented Apr 9, 2026

Hi @PeterStaar-IBM — this is a reopened version of #3247, now submitted from the correct account.

Two things were fixed compared to the previous PR:

  1. Wrong account — the original was opened from Smeet-Agrawal by mistake. This PR is from Smeet23.

  2. Stray test file changes removed — the previous submission accidentally included additions to tests/test_backend_msexcel.py (sheet-names filter tests from an unrelated branch in progress). Those caused the CI failures. This PR touches only the files relevant to the ligature fix:

    • docling/models/stages/page_assemble/page_assemble_model.py
    • tests/test_page_assemble_model.py

All page assemble model tests pass cleanly. Sorry for the noise on the previous PR.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2026

Codecov Report

❌ Patch coverage is 70.00000% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/msexcel_backend.py 66.66% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ligature parsing

2 participants