perf(encoding): Dispatch loop for Varint bulk decode to handle "sporadic large values" (#594) by srsuryadev · Pull Request #594 · facebookincubator/nimble

srsuryadev · 2026-03-19T22:38:12Z

Summary:

Dispatch loop for varint bulk decode to handle sporadic large heads, i.e in a large sequence of 1-byte/2-byte values we see sporadically placed large values.

Differential Revision: D97366866

Summary: Add `decodeSingleByteRun` fast path to `bulkVarintDecode32` and `bulkVarintDecode64` that processes leading runs of single-byte varints (values 0-127) using 8-byte word reads before falling through to the BMI2 switch-based decoder. For each 8-byte word where no continuation bits are set (`word & 0x8080808080808080 == 0`), all 8 varints are decoded with simple shifts, avoiding the `_pext_u64` and 64-case switch overhead. This is placed in the caller functions rather than inside `bulkVarintDecodeBmi2` to preserve the BMI2 function's code layout and icache behavior for mixed-width data. Benchmark results (1M elements, mode/opt): | Scenario | Before | After | Speedup | |-----------------------|-----------|-----------|-----------| | 1-byte (32-bit) | 465us | 260us | 1.79x | | 5-byte (32-bit) | slower | 1.22ms | fixed | | 3-byte (32-bit) | 1.04ms | 864us | 1.20x | | 4-byte (32-bit) | 1.50ms | 1.04ms | 1.44x | | 64-bit 1-byte | 294us | 232us | 1.27x | | batch1024 | 1.96us | 1.20us | 1.63x | | Uniform/2-byte/8-byte | unchanged | unchanged | no regress| Also enhances the varint benchmark with fixed byte-width benchmarks (1-5 byte for 32-bit, 1/4/8 byte for 64-bit), skip benchmarks, and batch size benchmarks. Differential Revision: D96617939

… single-byte varints Summary: Manually loop-unroll `decodeSingleByteRun` with a 3-tier approach: 1. 32-element (4-word) unrolled loop with combined high-bit check `(w0 | w1 | w2 | w3) & kHighBits` to minimize branch overhead 2. 8-element (1-word) loop for smaller runs 3. Single-element trailing loop to pick up individual single-byte varints before multi-byte values Also extracts the byte-expansion logic into a reusable `expandWord()` helper for clarity. Differential Revision: D96619597

…gleByteRun Summary: Replace scalar byte expansion and reinterpret_cast-based uint64_t loads in decodeSingleByteRun with xsimd-based SIMD operations: - Use xsimd::batch<uint8_t>::load_unaligned for a single wide load (32 bytes on AVX2) + vptest to check all high bits at once, replacing 4 separate uint64_t loads + OR chain. - Use xsimd::batch<T> construction and store_unaligned for byte-to-element widening (compiles to vpmovzxbd on AVX2, vmovl on NEON). - Replace reinterpret_cast<const uint64_t*> with std::memcpy in the 8-byte loop to avoid strict-aliasing/alignment issues. Differential Revision: D96628007 Reviewed By: xiaoxmeng

…ncoding to make it robust Summary: Add further tests to the varint encoding to make it robust Differential Revision: D96665765

…ecode Summary: use pre-compiled lookup table for varint decode to eliminate switch case Differential Revision: D96756546

Summary: Add encoding fuzzer testing for varint encoding with randomized encode-decode-verify cycles across diverse data patterns and access patterns. Differential Revision: D97054913

Summary: Add bulkDecodeTwoByteRun() that detects runs of 2-byte varints by checking for the alternating high-bit pattern (0x0080008000800080) in 8-byte words, decoding 4 varints per word with simple scalar ops. This fixes the 2-byte regression introduced by the table-driven BMI2 decode (D96756546). The old switch perfectly predicted uniform 2-byte data, while the table-driven approach paid lookup overhead without benefiting from misprediction elimination. Benchmark results (varint_benchmark, mode/opt): - BulkDecode_2byte: 1.12ms → 506µs (2.2x faster, 11% faster than baseline) - NimbleBulkDecodeUniform: 1.49ms (preserved, 69% faster than baseline) - All other benchmarks: unchanged Differential Revision: D97189009

meta-codesync · 2026-03-19T22:38:35Z

@srsuryadev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97366866.

…dic large values" (#594) Summary: Pull Request resolved: #594 Dispatch loop for varint bulk decode to handle sporadic large heads, i.e in a large sequence of 1-byte/2-byte values we see sporadically placed large values. Differential Revision: D97366866

srsuryadev and others added 7 commits March 19, 2026 06:44

test(encoding): add further tests for varint encoding to the varint e…

fb93c36

…ncoding to make it robust Summary: Add further tests to the varint encoding to make it robust Differential Revision: D96665765

perf(encoding): varint encoding - use pre-compiled lookup table for d…

ac567f9

…ecode Summary: use pre-compiled lookup table for varint decode to eliminate switch case Differential Revision: D96756546

test(encoding): Add encoding fuzz testing for varint encoding

29ba27d

Summary: Add encoding fuzzer testing for varint encoding with randomized encode-decode-verify cycles across diverse data patterns and access patterns. Differential Revision: D97054913

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 19, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 19, 2026

meta-codesync bot changed the title ~~perf(encoding): Dispatch loop for Varint bulk decode to handle "sporadic large values"~~ perf(encoding): Dispatch loop for Varint bulk decode to handle "sporadic large values" (#594) Mar 19, 2026

srsuryadev force-pushed the export-D97366866 branch from 3a3822c to 81a4d26 Compare March 19, 2026 23:50

srsuryadev force-pushed the export-D97366866 branch from 81a4d26 to d23c0d5 Compare March 20, 2026 00:06

srsuryadev force-pushed the export-D97366866 branch from d23c0d5 to 8dfa177 Compare March 20, 2026 03:39

srsuryadev force-pushed the export-D97366866 branch from 8dfa177 to f55003f Compare March 20, 2026 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(encoding): Dispatch loop for Varint bulk decode to handle "sporadic large values" (#594)#594

perf(encoding): Dispatch loop for Varint bulk decode to handle "sporadic large values" (#594)#594
srsuryadev wants to merge 8 commits intomainfrom
export-D97366866

srsuryadev commented Mar 19, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

srsuryadev commented Mar 19, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

srsuryadev commented Mar 19, 2026 •

edited by meta-codesync bot

Loading