Convert-PDFToText instantiates a single iText ExtractionStrategy once per call and
reuses it for every page in the document. iText strategies (both
SimpleTextExtractionStrategy and LocationTextExtractionStrategy) accumulate
internal state across GetTextFromPage() calls. As a result, the N-th string
emitted from the function contains the text from pages 1..N concatenated/overlapped,
instead of only page N. On 3+ page PDFs the output contains heavy duplication and,
with LT, position-based mash-ups (e.g. rows from page 1 and page 2 interleaved at
the same Y coordinate).
Reproduction
Install-Module PSWritePDF -Scope CurrentUser
Import-Module PSWritePDF
# Any 3+ page PDF with text on each page works
$out = Convert-PDFToText -FilePath '.\sample-3pages.pdf' -ExtractionStrategy LT
$out.Count # 3 (one element per page, as expected)
$out[0].Length # only page 1 content
$out[1].Length # contains page 1 + page 2 (duplicated)
$out[2].Length # contains page 1 + page 2 + page 3 (full mash-up)
On my test PDF (3 pages, error-report style, ~98 unique rows total):
| Element |
VT/3 entries found (regex VT/3\s+\d+) |
Expected (rows on that page) |
| 0 |
24 |
24 |
| 1 |
83 |
59 |
| 2 |
98 |
15 |
Workaround that does produce correct output (because the function re-enters
and creates a fresh strategy):
1..$totalPages | ForEach-Object {
Convert-PDFToText -FilePath $pdf -Page $_ -ExtractionStrategy LT
}
Root cause
PSWritePDF.psm1, function Convert-PDFToText (around lines 1737–1773 in 0.0.20):
the $iTextExtractionStrategy object is created once, then reused inside both the
for ($Count = 1; ...) loop (all pages) and the foreach ($Count in $Page) loop
(selected pages).
Proposed fix
Move the strategy instantiation inside the per-page loop, so every page gets
a fresh instance:
# Inside both loops, before the GetTextFromPage call:
if ($ExtractionStrategy -eq "SimpleTextExtractionStrategy" -or $ExtractionStrategy -eq "ST") {
$iTextExtractionStrategy = [iText.Kernel.Pdf.Canvas.Parser.Listener.SimpleTextExtractionStrategy]::new()
} elseif ($ExtractionStrategy -eq "LocationTextExtractionStrategy" -or $ExtractionStrategy -eq "LT") {
$iTextExtractionStrategy = [iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy]::new()
}
(The original block creating the strategy in the outer try can then be removed.)
I have applied this patch locally and confirmed that:
- The N-th emitted string now contains only page-N text.
- Total matches across all returned strings equal the expected per-page totals
summed (24 + 59 + 15 = 98 in my case, previously 24 + 83 + 98 = 205).
Environment
- PSWritePDF 0.0.20
- Windows PowerShell 5.1
- Windows 11 Pro 24H2
Convert-PDFToTextinstantiates a single iTextExtractionStrategyonce per call andreuses it for every page in the document. iText strategies (both
SimpleTextExtractionStrategyandLocationTextExtractionStrategy) accumulateinternal state across
GetTextFromPage()calls. As a result, the N-th stringemitted from the function contains the text from pages 1..N concatenated/overlapped,
instead of only page N. On 3+ page PDFs the output contains heavy duplication and,
with
LT, position-based mash-ups (e.g. rows from page 1 and page 2 interleaved atthe same Y coordinate).
Reproduction
On my test PDF (3 pages, error-report style, ~98 unique rows total):
VT/3\s+\d+)Workaround that does produce correct output (because the function re-enters
and creates a fresh strategy):
Root cause
PSWritePDF.psm1, functionConvert-PDFToText(around lines 1737–1773 in 0.0.20):the
$iTextExtractionStrategyobject is created once, then reused inside both thefor ($Count = 1; ...)loop (all pages) and theforeach ($Count in $Page)loop(selected pages).
Proposed fix
Move the strategy instantiation inside the per-page loop, so every page gets
a fresh instance:
(The original block creating the strategy in the outer
trycan then be removed.)I have applied this patch locally and confirmed that:
summed (24 + 59 + 15 = 98 in my case, previously 24 + 83 + 98 = 205).
Environment