Skip to content

Convert-PDFToText: reused ExtractionStrategy causes cumulative/duplicated text on multi-page PDFs #80

@MegaTRON-IT

Description

@MegaTRON-IT

Convert-PDFToText instantiates a single iText ExtractionStrategy once per call and
reuses it for every page in the document. iText strategies (both
SimpleTextExtractionStrategy and LocationTextExtractionStrategy) accumulate
internal state
across GetTextFromPage() calls. As a result, the N-th string
emitted from the function contains the text from pages 1..N concatenated/overlapped,
instead of only page N. On 3+ page PDFs the output contains heavy duplication and,
with LT, position-based mash-ups (e.g. rows from page 1 and page 2 interleaved at
the same Y coordinate).

Reproduction

Install-Module PSWritePDF -Scope CurrentUser
Import-Module PSWritePDF

# Any 3+ page PDF with text on each page works
$out = Convert-PDFToText -FilePath '.\sample-3pages.pdf' -ExtractionStrategy LT
$out.Count                  # 3 (one element per page, as expected)
$out[0].Length              # only page 1 content
$out[1].Length              # contains page 1 + page 2 (duplicated)
$out[2].Length              # contains page 1 + page 2 + page 3 (full mash-up)

On my test PDF (3 pages, error-report style, ~98 unique rows total):

Element VT/3 entries found (regex VT/3\s+\d+) Expected (rows on that page)
0 24 24
1 83 59
2 98 15

Workaround that does produce correct output (because the function re-enters
and creates a fresh strategy):

1..$totalPages | ForEach-Object {
    Convert-PDFToText -FilePath $pdf -Page $_ -ExtractionStrategy LT
}

Root cause

PSWritePDF.psm1, function Convert-PDFToText (around lines 1737–1773 in 0.0.20):
the $iTextExtractionStrategy object is created once, then reused inside both the
for ($Count = 1; ...) loop (all pages) and the foreach ($Count in $Page) loop
(selected pages).

Proposed fix

Move the strategy instantiation inside the per-page loop, so every page gets
a fresh instance:

# Inside both loops, before the GetTextFromPage call:
if ($ExtractionStrategy -eq "SimpleTextExtractionStrategy" -or $ExtractionStrategy -eq "ST") {
    $iTextExtractionStrategy = [iText.Kernel.Pdf.Canvas.Parser.Listener.SimpleTextExtractionStrategy]::new()
} elseif ($ExtractionStrategy -eq "LocationTextExtractionStrategy" -or $ExtractionStrategy -eq "LT") {
    $iTextExtractionStrategy = [iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy]::new()
}

(The original block creating the strategy in the outer try can then be removed.)

I have applied this patch locally and confirmed that:

  • The N-th emitted string now contains only page-N text.
  • Total matches across all returned strings equal the expected per-page totals
    summed (24 + 59 + 15 = 98 in my case, previously 24 + 83 + 98 = 205).

Environment

  • PSWritePDF 0.0.20
  • Windows PowerShell 5.1
  • Windows 11 Pro 24H2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions