fix(pdf): extend ligature map with Dutch IJ and PUA glyph U+F0A0#3254
fix(pdf): extend ligature map with Dutch IJ and PUA glyph U+F0A0#3254Smeet23 wants to merge 3 commits intodocling-project:mainfrom
Conversation
Add two entries missing from the PDF text sanitizer's ligature map: - U+0132 (IJ) → "IJ" and U+0133 (ij) → "ij": Latin capital/small ligature IJ, used in Dutch (e.g. IJssel, IJ becomes IJ at the start of words). - U+F0A0 → "": a Private-Use Area glyph emitted by some PDF fonts as a spurious character with no textual meaning; it is silently discarded. The _LIGATURE_RE pattern is updated to match these new code points. Closes docling-project#2882 Signed-off-by: Smeet Agrawal <[email protected]>
Signed-off-by: Smeet Agrawal <[email protected]>
Signed-off-by: Smeet Agrawal <[email protected]>
|
✅ DCO Check Passed Thanks @Smeet23, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: Docling What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?View Suggested Changes@@ -107,12 +107,36 @@
- **Key Options**:
- `treat_singleton_as_text` (default False): Treat 1x1 cells as TextItem
- `gap_tolerance` (default 0): Table merging tolerance for empty cells
+ - `sheet_names` (default None): An optional list of sheet names to include in conversion. When set, only sheets whose names appear in this list will be processed. Sheet names are matched case-sensitively. Set to None (default) to include all sheets.
- Enrichment options (image description, chart extraction)
- **Processing**:
- Each sheet is treated as a page
- Table detection via flood-fill, image extraction (bounding box based on cell anchor)
- Includes provenance (location info), auto page size calculation
+ - When `sheet_names` is specified, only sheets with names in the list are converted. If a sheet name in the list doesn't exist in the workbook, a warning is logged but processing continues.
- **Notes**: Table detection algorithm and singleton cell handling are configurable. [Backend options code](https://github.com/docling-project/docling/blob/ae4fdbbb09fd377bb271e9b2efe541873eeb2990/docling/datamodel/backend_options.py#L80-L97).
+
+**Usage Example**:
+
+```python
+from docling.datamodel.backend_options import MsExcelBackendOptions
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import DocumentConverter, XlsxFormatOption
+
+# Process only specific sheets from an Excel file
+excel_options = MsExcelBackendOptions(
+ sheet_names=["Summary", "Data", "Q4 Results"]
+)
+
+converter = DocumentConverter(
+ format_options={
+ InputFormat.XLSX: XlsxFormatOption(backend_options=excel_options)
+ }
+)
+
+result = converter.convert("workbook.xlsx")
+doc = result.document
+```
---
Note: You must be authenticated to accept/decline updates. |
|
Hi @PeterStaar-IBM — this is a reopened version of #3247, now submitted from the correct account. Two things were fixed compared to the previous PR:
All page assemble model tests pass cleanly. Sorry for the noise on the previous PR. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Summary
Closes #2882
Extends the PDF text sanitizer's ligature normalization map with two entries that were missing:
IJijThe existing map already handles U+FB00–U+FB06 (ff, fi, fl, ffi, ffl, ſt, st). This PR fills the remaining gaps reported in the issue.
Note: U+0152 (Œ) and U+0153 (œ) are intentionally left as-is — they are legitimate characters in French (e.g. œuvre, cœur) and normalizing them to OE/oe would corrupt French text.
Changes
page_assemble_model.py: add U+0132, U+0133, U+F0A0 to_LIGATURE_MAP; update_LIGATURE_REto match new code pointstest_page_assemble_model.py: add 4 new test cases covering the new entriesTest plan
test_ij_capital_ligature— U+0132 IJ → IJtest_ij_small_ligature— U+0133 ij → ijtest_private_use_glyph_stripped— U+F0A0 discardedtest_private_use_glyph_with_spurious_space_stripped— U+F0A0 + spurious space discarded