Skip to content

Extracted Table from PDF to Markdown is different from one in JSON #17894

@etetteh

Description

@etetteh

🔎 Search before asking

  • I have searched the PaddleOCR Docs and found no similar bug report.
  • I have searched the PaddleOCR Issues and found no similar bug report.
  • I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

I just noticed that the extracted table in the exported markdown file is different from the exported JSON file, which causes data integrity and consistency issue.

The markdown output is accurate, but the JSON output is not.

virology_pg2_0_res.json
virology_pg2_0.md

🏃‍♂️ Environment (运行环境)

OS            macOS-26.2 
Environment   Jupyter
Python        3.13.2
PaddleOCR     3.4.0
Install       uv
RAM           16.00 GB
CPU           Apple M1
CUDA          None

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

from paddlex import create_pipeline

pipeline = create_pipeline(pipeline="PaddleOCR-VL-1.5")

output = pipeline.predict(input="./data/virology_pg2.pdf")

pages_res = list(output)

output = pipeline.restructure_pages(pages_res)

# output = pipeline.restructure_pages(pages_res, merge_table=True) # Merge tables across pages
# output = pipeline.restructure_pages(pages_res, merge_table=True, relevel_titles=True) # Merge tables across pages and reconstruct multi-level titles
# output = pipeline.restructure_pages(pages_res, merge_table=True, relevel_titles=True, concatenate_pages=True) # Merge tables across pages, reconstruct multi-level titles, and merge multiple pages
for res in output:
    res.print() # Print the structured prediction output
    res.save_to_json(save_path="output") # Save the current image's structured result in JSON format
    res.save_to_markdown(save_path="output") # Save the current image's result in Markdown format

virology_pg2.pdf

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions