perf(encoding): varint encoding - manual loop-unroll in decodeSingleByteRun for single-byte varints (#578) by srsuryadev · Pull Request #578 · facebookincubator/nimble

srsuryadev · 2026-03-18T06:35:39Z

Summary:

Manually loop-unroll decodeSingleByteRun with a 3-tier approach:

32-element (4-word) unrolled loop with combined high-bit check
(w0 | w1 | w2 | w3) & kHighBits to minimize branch overhead
8-element (1-word) loop for smaller runs
Single-element trailing loop to pick up individual single-byte
varints before multi-byte values

Also extracts the byte-expansion logic into a reusable expandWord()
helper for clarity.

Reviewed By: xiaoxmeng

Differential Revision: D96619597

…Width, MainlyConstant for faster iteration for SST workload Summary: Add v2 encoding scaffoldings for the Varint, RLE, FixedBitWidth, and MainlyConstant for faster iteration or perf tuning Differential Revision: D96684714

Summary: Add `decodeSingleByteRun` fast path to `bulkVarintDecode32` and `bulkVarintDecode64` that processes leading runs of single-byte varints (values 0-127) using 8-byte word reads before falling through to the BMI2 switch-based decoder. For each 8-byte word where no continuation bits are set (`word & 0x8080808080808080 == 0`), all 8 varints are decoded with simple shifts, avoiding the `_pext_u64` and 64-case switch overhead. This is placed in the caller functions rather than inside `bulkVarintDecodeBmi2` to preserve the BMI2 function's code layout and icache behavior for mixed-width data. Benchmark results (1M elements, mode/opt): | Scenario | Before | After | Speedup | |-----------------------|-----------|-----------|-----------| | 1-byte (32-bit) | 465us | 260us | 1.79x | | 5-byte (32-bit) | slower | 1.22ms | fixed | | 3-byte (32-bit) | 1.04ms | 864us | 1.20x | | 4-byte (32-bit) | 1.50ms | 1.04ms | 1.44x | | 64-bit 1-byte | 294us | 232us | 1.27x | | batch1024 | 1.96us | 1.20us | 1.63x | | Uniform/2-byte/8-byte | unchanged | unchanged | no regress| Also enhances the varint benchmark with fixed byte-width benchmarks (1-5 byte for 32-bit, 1/4/8 byte for 64-bit), skip benchmarks, and batch size benchmarks. Differential Revision: D96617939

meta-codesync · 2026-03-18T06:36:03Z

@srsuryadev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96619597.

…yteRun for single-byte varints (#578) Summary: Pull Request resolved: #578 Manually loop-unroll `decodeSingleByteRun` with a 3-tier approach: 1. 32-element (4-word) unrolled loop with combined high-bit check `(w0 | w1 | w2 | w3) & kHighBits` to minimize branch overhead 2. 8-element (1-word) loop for smaller runs 3. Single-element trailing loop to pick up individual single-byte varints before multi-byte values Also extracts the byte-expansion logic into a reusable `expandWord()` helper for clarity. Reviewed By: xiaoxmeng Differential Revision: D96619597

srsuryadev added 2 commits March 15, 2026 22:32

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 18, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 18, 2026

srsuryadev force-pushed the export-D96619597 branch from a8aaf4f to 0a79725 Compare March 18, 2026 06:41

meta-codesync bot changed the title ~~perf(encoding): varint encoding - manual loop-unroll in decodeSingleByteRun for single-byte varints~~ perf(encoding): varint encoding - manual loop-unroll in decodeSingleByteRun for single-byte varints (#578) Mar 18, 2026

srsuryadev force-pushed the export-D96619597 branch from 0a79725 to 19f8107 Compare March 18, 2026 06:46

srsuryadev force-pushed the export-D96619597 branch from 19f8107 to 41d6d3c Compare March 19, 2026 17:24

srsuryadev force-pushed the export-D96619597 branch from 41d6d3c to cba0ad6 Compare March 19, 2026 22:39

srsuryadev force-pushed the export-D96619597 branch from cba0ad6 to 8ccce3e Compare March 19, 2026 22:45

srsuryadev force-pushed the export-D96619597 branch from 8ccce3e to 9ed5394 Compare March 20, 2026 03:40

srsuryadev force-pushed the export-D96619597 branch from 9ed5394 to 728aec3 Compare March 20, 2026 03:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(encoding): varint encoding - manual loop-unroll in decodeSingleByteRun for single-byte varints (#578)#578

perf(encoding): varint encoding - manual loop-unroll in decodeSingleByteRun for single-byte varints (#578)#578
srsuryadev wants to merge 3 commits intomainfrom
export-D96619597

srsuryadev commented Mar 18, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

srsuryadev commented Mar 18, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

srsuryadev commented Mar 18, 2026 •

edited by meta-codesync bot

Loading