Skip to content

fix(Deflate): clamp LZ77 distance to bytes of real prior data on first push#261

Open
PeterCoherent wants to merge 1 commit into
101arrowz:masterfrom
PeterCoherent:fix/streaming-deflate-invalid-distance
Open

fix(Deflate): clamp LZ77 distance to bytes of real prior data on first push#261
PeterCoherent wants to merge 1 commit into
101arrowz:masterfrom
PeterCoherent:fix/streaming-deflate-invalid-distance

Conversation

@PeterCoherent
Copy link
Copy Markdown

Fixes #260Deflate (and ZipDeflate by extension) emitted DEFLATE streams with back-references whose distance exceeded the bytes of real output behind them, producing "invalid distance too far back" on standard inflaters.

Root cause

Deflate's 96 KiB internal buffer reserves the first 32 KiB as an LZ77 lookback window; real input is written at offset 32768. On the first push, that lookback region is just buffer-init zeros (from new u8(98304)), not previously-emitted output.

The match-finder in dflt capped match distance with Math.min(32767, i), where i is the buffer position. On first push, i starts at 32768, so the cap was always 32767 — the full lookback. If an input byte happened to equal the zero fill at some offset, the encoder accepted the match and emitted a back-reference with a distance pointing before the start of the real output stream. RFC 1951 §3.2.5 forbids that, so standard inflaters (fflate.inflateSync, Node's zlib.inflateRawSync, etc.) reject the stream.

Data-dependent, which is why synthetic tests miss it. Discovered downstream on a 2.8 KB EMF image embedded in an XLSB workbook — bisection reduced the trigger to a 234-byte prefix containing only EMF format magic + standard GDI record opcodes.

Fix

Track a new f field on DeflateState representing the first valid source offset in dat:

  • Streaming Deflate constructor initializes f = 32768.
  • The dictionary path moves it back to 32768 - dict.length (dictionary IS valid prior data).
  • push() zeroes f after the first buffer-shift (from then on the entire 96 KiB holds real data).
  • dflt caps maxd with i - (st.f || 0). deflateSync never sets f, so its behavior is unchanged via undefined || 0 === 0.

Four insertions, one modification, total: +33 / -3 lines.

Verification

  • Streaming Deflate emits invalid back-references on some inputs (cross-validated against native zlib; deflateSync correct on same input) #260's 234-byte minimal repro at levels 1, 3, 6, 9: streaming output now inflates via both fflate.inflateSync and Node's native zlib.inflateRawSync; at level 3 it is byte-identical to deflateSync output.
  • Multi-push split of the same input across 3 pushes: round-trips correctly.
  • Chunked pushes (1-byte, 64-byte): produce the same valid output as single-push.
  • Dictionary path (compress short text against a dictionary, decompress with that same dict via fflate.Inflate({dictionary: dict})): round-trips correctly, no spurious distance errors.
  • Buffer-shift path exercised via random and all-zero 100 KB inputs (forces the multi-dflt-call branch in push()): valid output, round-trips correctly.
  • npm test: 120 / 120 pass.

Performance

Median-of-5 micro-benchmark on Node 22.22.0 (Linux), comparing the patched lib against current master:

Input master µs/op patched µs/op Δ output bytes match
234 B EMF (the trigger) 53.68 48.72 −9.2 % 153 = 153
4 KB random 142.77 144.34 +1.1 % (noise) 4 134 = 4 134
64 KB random 1 306 1 312 +0.4 % (noise) 65 590 = 65 590
1 MB random 30 243 24 956 −17.5 % 1 048 834 = 1 048 834
1 MB zeros 9 820 7 914 −19.4 % 1 034 = 1 034
1 MB text 7 966 7 670 −3.7 % 3 130 = 3 130

Net: faster on large inputs and identical on small. The speedup is because the LZ77 match-finder's chain-walk now terminates earlier when the cap prevents it from chasing matches into the (now correctly excluded) zero-init region. Output sizes are identical across all six inputs — the cap only suppresses invalid matches, never valid ones.

Memory

One extra numeric field per Deflate state object (~8 bytes inside V8's hidden class). No buffer or array changes.

Alternatives considered

  • Init head[] with a sentinel instead of zero — adds a per-hash-lookup check in the LZ77 inner loop; my approach pays the cost once per dflt call instead.
  • Boolean first flag — can't represent the dictionary case cleanly (dict moves the first-valid offset to a non-zero, non-32768 value).
  • Per-match-acceptance check inside the LZ77 loop — wastes work in the match-finder and rejects matches at the boundary instead of letting it find a shorter valid one. Capping maxd at search-start is strictly better.

…n first push

Streaming `Deflate` initializes its internal 96 KiB buffer with the
first 32 KiB reserved as an LZ77 lookback window. Real input is
written at offset 32768. On the **first push**, that 32 KiB lookback
region is just buffer-init zeros (from `new u8(98304)`), not
previously-emitted output.

The match-finder in `dflt` capped match distance with
`Math.min(32767, i)`, where `i` is the buffer position. On first push
`i` starts at 32768, so the cap was always 32767 — the full lookback
window. If an input byte happened to equal the zero fill at some
offset in the lookback region, the encoder would emit a match with a
distance pointing **before the start of the real output stream**.
RFC 1951 §3.2.5 forbids that; standard inflaters (fflate's own
`inflateSync`, Node's `zlib.inflateRawSync`, etc.) reject the stream
with "invalid distance too far back."

The bug is data-dependent and easy to miss in synthetic tests:
random data and highly compressible data don't trigger it. It does
trigger on real-world binary blobs containing zero runs at the
right offsets — discovered on an EMF image embedded in an XLSB
workbook (downstream issue 101arrowz#260).

Fix tracks a new `f` field in `DeflateState` holding the first
valid source offset in `dat`:
- Initialized to 32768 by the streaming `Deflate` constructor.
- Dictionary path moves it back to `32768 - dict.length` because
  the dictionary IS valid prior data.
- Zeroed by `push()` after the first buffer-shift (from then on
  the entire 96 KiB buffer holds real data).
- `dflt` caps `maxd` with `i - (st.f || 0)`. `deflateSync` doesn't
  set `f`, so its behavior is unchanged (`undefined || 0 === 0`).

Verification on the 234-byte downstream-bug minimal repro: streaming
output at level 3 is now byte-identical to `deflateSync` output, and
both fflate's own `inflateSync` and Node's native
`zlib.inflateRawSync` accept it. Full test suite passes (120/120).

Performance on a median-of-5 micro-benchmark (Node 22.22.0):
- 234 B EMF (the trigger): -9.2%
- 4 KB random: +1.1% (noise)
- 64 KB random: +0.4% (noise)
- 1 MB random: -17.5%
- 1 MB zeros: -19.4%
- 1 MB text: -3.7%

Faster on large inputs because the match-finder's chain-walk
terminates earlier when the cap prevents it from chasing matches
into the (now correctly excluded) zero-init region. Output sizes
identical across all six inputs — the patch only suppresses
invalid matches, never valid ones.

Closes 101arrowz#260
@101arrowz
Copy link
Copy Markdown
Owner

I'll review this in a few days as I'm currently a bit swamped. For my reference - was this PR created partially or fully with the assistance of AI coding agents?

@PeterCoherent
Copy link
Copy Markdown
Author

Hello -- this is Peter writing (hence the likely spelling / grammar mistakes). You are correct- 100% of the issue and PR was written by AI. However, I reviewed it and went through 3 cycles on it (asking for a minimal repeatable self-contained example, making sure we can prove RFC non-conformance and validating it against fflate's own inflate) -- so I feel pretty good about it. Also, the PR fixed the bug and has passed my relatively extensive regression bed.

However, as with all things, best to be skeptical, but I hope that the above makes it easy for you (or your AI) to validate the issue and fix with minimal effort and you find it a well written Issue.

BTW - thanks for managing such an important public project. Super helpful -- it's the only external dependency we have in the entire project.

Peter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streaming Deflate emits invalid back-references on some inputs (cross-validated against native zlib; deflateSync correct on same input)

2 participants