fix(Deflate): clamp LZ77 distance to bytes of real prior data on first push#261
Conversation
…n first push Streaming `Deflate` initializes its internal 96 KiB buffer with the first 32 KiB reserved as an LZ77 lookback window. Real input is written at offset 32768. On the **first push**, that 32 KiB lookback region is just buffer-init zeros (from `new u8(98304)`), not previously-emitted output. The match-finder in `dflt` capped match distance with `Math.min(32767, i)`, where `i` is the buffer position. On first push `i` starts at 32768, so the cap was always 32767 — the full lookback window. If an input byte happened to equal the zero fill at some offset in the lookback region, the encoder would emit a match with a distance pointing **before the start of the real output stream**. RFC 1951 §3.2.5 forbids that; standard inflaters (fflate's own `inflateSync`, Node's `zlib.inflateRawSync`, etc.) reject the stream with "invalid distance too far back." The bug is data-dependent and easy to miss in synthetic tests: random data and highly compressible data don't trigger it. It does trigger on real-world binary blobs containing zero runs at the right offsets — discovered on an EMF image embedded in an XLSB workbook (downstream issue 101arrowz#260). Fix tracks a new `f` field in `DeflateState` holding the first valid source offset in `dat`: - Initialized to 32768 by the streaming `Deflate` constructor. - Dictionary path moves it back to `32768 - dict.length` because the dictionary IS valid prior data. - Zeroed by `push()` after the first buffer-shift (from then on the entire 96 KiB buffer holds real data). - `dflt` caps `maxd` with `i - (st.f || 0)`. `deflateSync` doesn't set `f`, so its behavior is unchanged (`undefined || 0 === 0`). Verification on the 234-byte downstream-bug minimal repro: streaming output at level 3 is now byte-identical to `deflateSync` output, and both fflate's own `inflateSync` and Node's native `zlib.inflateRawSync` accept it. Full test suite passes (120/120). Performance on a median-of-5 micro-benchmark (Node 22.22.0): - 234 B EMF (the trigger): -9.2% - 4 KB random: +1.1% (noise) - 64 KB random: +0.4% (noise) - 1 MB random: -17.5% - 1 MB zeros: -19.4% - 1 MB text: -3.7% Faster on large inputs because the match-finder's chain-walk terminates earlier when the cap prevents it from chasing matches into the (now correctly excluded) zero-init region. Output sizes identical across all six inputs — the patch only suppresses invalid matches, never valid ones. Closes 101arrowz#260
|
I'll review this in a few days as I'm currently a bit swamped. For my reference - was this PR created partially or fully with the assistance of AI coding agents? |
|
Hello -- this is Peter writing (hence the likely spelling / grammar mistakes). You are correct- 100% of the issue and PR was written by AI. However, I reviewed it and went through 3 cycles on it (asking for a minimal repeatable self-contained example, making sure we can prove RFC non-conformance and validating it against fflate's own inflate) -- so I feel pretty good about it. Also, the PR fixed the bug and has passed my relatively extensive regression bed. However, as with all things, best to be skeptical, but I hope that the above makes it easy for you (or your AI) to validate the issue and fix with minimal effort and you find it a well written Issue. BTW - thanks for managing such an important public project. Super helpful -- it's the only external dependency we have in the entire project. Peter. |
Fixes #260 —
Deflate(andZipDeflateby extension) emitted DEFLATE streams with back-references whose distance exceeded the bytes of real output behind them, producing "invalid distance too far back" on standard inflaters.Root cause
Deflate's 96 KiB internal buffer reserves the first 32 KiB as an LZ77 lookback window; real input is written at offset 32768. On the first push, that lookback region is just buffer-init zeros (fromnew u8(98304)), not previously-emitted output.The match-finder in
dfltcapped match distance withMath.min(32767, i), whereiis the buffer position. On first push,istarts at 32768, so the cap was always 32767 — the full lookback. If an input byte happened to equal the zero fill at some offset, the encoder accepted the match and emitted a back-reference with a distance pointing before the start of the real output stream. RFC 1951 §3.2.5 forbids that, so standard inflaters (fflate.inflateSync, Node'szlib.inflateRawSync, etc.) reject the stream.Data-dependent, which is why synthetic tests miss it. Discovered downstream on a 2.8 KB EMF image embedded in an XLSB workbook — bisection reduced the trigger to a 234-byte prefix containing only EMF format magic + standard GDI record opcodes.
Fix
Track a new
ffield onDeflateStaterepresenting the first valid source offset indat:Deflateconstructor initializesf = 32768.32768 - dict.length(dictionary IS valid prior data).push()zeroesfafter the first buffer-shift (from then on the entire 96 KiB holds real data).dfltcapsmaxdwithi - (st.f || 0).deflateSyncnever setsf, so its behavior is unchanged viaundefined || 0 === 0.Four insertions, one modification, total: +33 / -3 lines.
Verification
Deflateemits invalid back-references on some inputs (cross-validated against native zlib;deflateSynccorrect on same input) #260's 234-byte minimal repro at levels 1, 3, 6, 9: streaming output now inflates via bothfflate.inflateSyncand Node's nativezlib.inflateRawSync; at level 3 it is byte-identical todeflateSyncoutput.fflate.Inflate({dictionary: dict})): round-trips correctly, no spurious distance errors.dflt-call branch inpush()): valid output, round-trips correctly.npm test: 120 / 120 pass.Performance
Median-of-5 micro-benchmark on Node 22.22.0 (Linux), comparing the patched lib against current
master:Net: faster on large inputs and identical on small. The speedup is because the LZ77 match-finder's chain-walk now terminates earlier when the cap prevents it from chasing matches into the (now correctly excluded) zero-init region. Output sizes are identical across all six inputs — the cap only suppresses invalid matches, never valid ones.
Memory
One extra numeric field per
Deflatestate object (~8 bytes inside V8's hidden class). No buffer or array changes.Alternatives considered
head[]with a sentinel instead of zero — adds a per-hash-lookup check in the LZ77 inner loop; my approach pays the cost once perdfltcall instead.firstflag — can't represent the dictionary case cleanly (dict moves the first-valid offset to a non-zero, non-32768 value).maxdat search-start is strictly better.