Skip to content

fix: detect .txt files as Markdown instead of rejecting them#3260

Open
lawrence3699 wants to merge 1 commit intodocling-project:mainfrom
lawrence3699:fix/txt-format-detection
Open

fix: detect .txt files as Markdown instead of rejecting them#3260
lawrence3699 wants to merge 1 commit intodocling-project:mainfrom
lawrence3699:fix/txt-format-detection

Conversation

@lawrence3699
Copy link
Copy Markdown

Summary

Fixes #3259

PR #3161 added "txt" to FormatToExtensions[InputFormat.MD], but the content-based disambiguation in _guess_from_content was never updated to handle the new case. When a .txt file enters _guess_format:

  1. _mime_from_extension("txt") returns None (intentionally, because "txt" is in both the XML_USPTO and MD extension lists).
  2. The content-detection chain produces mime = "text/plain".
  3. _guess_from_content checks for USPTO format (PATN\r\n header), finds nothing, and returns None.
  4. _guess_format returns None → converter raises "File format not allowed".

The fix adds an elif InputFormat.MD in formats fallback so non-USPTO text/plain content resolves to InputFormat.MD. USPTO .txt files are unaffected because the existing if check runs first.

Changes

  • docling/datamodel/document.py: add MD fallback in _guess_from_content for text/plain
  • tests/test_input_doc.py: update expectations for non-USPTO .txt files (NoneInputFormat.MD) and add a plain .txt detection case

Checks

  • pytest tests/test_input_doc.py — 8/8 passed

The .txt extension was added to FormatToExtensions[InputFormat.MD] in
PR docling-project#3161, but _guess_from_content never learned to fall back to MD for
text/plain content that isn't USPTO. A normal .txt file hits the
text/plain branch, fails the USPTO check, and returns None — causing
"File format not allowed" on conversion.

Add an elif that falls back to InputFormat.MD when the content doesn't
match USPTO format. Update the existing test expectations to match and
add a plain .txt detection assertion.

Fixes docling-project#3259
Copilot AI review requested due to automatic review settings April 9, 2026 20:11
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

DCO Check Failed

Hi @lawrence3699, your pull request has failed the Developer Certificate of Origin (DCO) check.

This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format.


🛠 Quick Fix: Add a remediation commit

Run this command:

git commit --allow-empty -s -m "DCO Remediation Commit for lawrence3699 <lawrence3699@users.noreply.github.com>

I, lawrence3699 <lawrence3699@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8ac7933b74ea056e1a9ef8fdc22eae12f55302ae"
git push

🔧 Advanced: Sign off each commit directly

For the latest commit:

git commit --amend --signoff
git push --force-with-lease

For multiple commits:

git rebase --signoff origin/main
git push --force-with-lease

More info: DCO check report

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 9, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes .txt input handling in Docling’s format-guessing logic so that non‑USPTO plain-text files are detected as Markdown (MD) instead of being rejected, while preserving the existing USPTO .txt detection via the PATN\r\n header.

Changes:

  • Add an MD fallback in _guess_from_content() for text/plain when USPTO content is not detected.
  • Update and extend _guess_format tests to expect .txtInputFormat.MD for non‑USPTO plain text.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
docling/datamodel/document.py Adds text/plain disambiguation fallback to InputFormat.MD when content is not USPTO.
tests/test_input_doc.py Updates assertions for non‑USPTO .txt and adds a new plain .txt detection test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2026

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/datamodel/document.py 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@PeterStaar-IBM
Copy link
Copy Markdown
Member

@lawrence3699 Can you please:

  1. fix the tests
  2. run a few times uv run pre-commit run --all-files (after you fix the tests?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

File format .txt not allowed despite v2.81.0

3 participants