fix: detect .txt files as Markdown instead of rejecting them#3260
fix: detect .txt files as Markdown instead of rejecting them#3260lawrence3699 wants to merge 1 commit intodocling-project:mainfrom
Conversation
The .txt extension was added to FormatToExtensions[InputFormat.MD] in PR docling-project#3161, but _guess_from_content never learned to fall back to MD for text/plain content that isn't USPTO. A normal .txt file hits the text/plain branch, fails the USPTO check, and returns None — causing "File format not allowed" on conversion. Add an elif that falls back to InputFormat.MD when the content doesn't match USPTO format. Update the existing test expectations to match and add a plain .txt detection assertion. Fixes docling-project#3259
|
❌ DCO Check Failed Hi @lawrence3699, your pull request has failed the Developer Certificate of Origin (DCO) check. This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format. 🛠 Quick Fix: Add a remediation commitRun this command: git commit --allow-empty -s -m "DCO Remediation Commit for lawrence3699 <lawrence3699@users.noreply.github.com>
I, lawrence3699 <lawrence3699@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8ac7933b74ea056e1a9ef8fdc22eae12f55302ae"
git push🔧 Advanced: Sign off each commit directlyFor the latest commit: git commit --amend --signoff
git push --force-with-leaseFor multiple commits: git rebase --signoff origin/main
git push --force-with-leaseMore info: DCO check report |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
There was a problem hiding this comment.
Pull request overview
This PR fixes .txt input handling in Docling’s format-guessing logic so that non‑USPTO plain-text files are detected as Markdown (MD) instead of being rejected, while preserving the existing USPTO .txt detection via the PATN\r\n header.
Changes:
- Add an MD fallback in
_guess_from_content()fortext/plainwhen USPTO content is not detected. - Update and extend
_guess_formattests to expect.txt→InputFormat.MDfor non‑USPTO plain text.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
docling/datamodel/document.py |
Adds text/plain disambiguation fallback to InputFormat.MD when content is not USPTO. |
tests/test_input_doc.py |
Updates assertions for non‑USPTO .txt and adds a new plain .txt detection test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@lawrence3699 Can you please:
|
Summary
Fixes #3259
PR #3161 added
"txt"toFormatToExtensions[InputFormat.MD], but the content-based disambiguation in_guess_from_contentwas never updated to handle the new case. When a.txtfile enters_guess_format:_mime_from_extension("txt")returnsNone(intentionally, because"txt"is in both the XML_USPTO and MD extension lists).mime = "text/plain"._guess_from_contentchecks for USPTO format (PATN\r\nheader), finds nothing, and returnsNone._guess_formatreturnsNone→ converter raises "File format not allowed".The fix adds an
elif InputFormat.MD in formatsfallback so non-USPTOtext/plaincontent resolves toInputFormat.MD. USPTO.txtfiles are unaffected because the existingifcheck runs first.Changes
docling/datamodel/document.py: add MD fallback in_guess_from_contentfortext/plaintests/test_input_doc.py: update expectations for non-USPTO.txtfiles (None→InputFormat.MD) and add a plain.txtdetection caseChecks
pytest tests/test_input_doc.py— 8/8 passed