Skip to content

perf(encoding): Dispatch loop for Varint bulk decode to handle "sporadic large values" (#594)#594

Open
srsuryadev wants to merge 8 commits intomainfrom
export-D97366866
Open

perf(encoding): Dispatch loop for Varint bulk decode to handle "sporadic large values" (#594)#594
srsuryadev wants to merge 8 commits intomainfrom
export-D97366866

Conversation

@srsuryadev
Copy link
Copy Markdown
Contributor

@srsuryadev srsuryadev commented Mar 19, 2026

Summary:

Dispatch loop for varint bulk decode to handle sporadic large heads, i.e in a large sequence of 1-byte/2-byte values we see sporadically placed large values.

Differential Revision: D97366866

srsuryadev and others added 7 commits March 19, 2026 06:44
Summary:
Add `decodeSingleByteRun` fast path to `bulkVarintDecode32` and
`bulkVarintDecode64` that processes leading runs of single-byte varints
(values 0-127) using 8-byte word reads before falling through to the
BMI2 switch-based decoder. For each 8-byte word where no continuation
bits are set (`word & 0x8080808080808080 == 0`), all 8 varints are
decoded with simple shifts, avoiding the `_pext_u64` and 64-case switch
overhead.

This is placed in the caller functions rather than inside
`bulkVarintDecodeBmi2` to preserve the BMI2 function's code layout and
icache behavior for mixed-width data.

Benchmark results (1M elements, mode/opt):
| Scenario              | Before    | After     | Speedup   |
|-----------------------|-----------|-----------|-----------|
| 1-byte (32-bit)       | 465us     | 260us     | 1.79x     |
| 5-byte (32-bit)       | slower    | 1.22ms    | fixed     |
| 3-byte (32-bit)       | 1.04ms    | 864us     | 1.20x     |
| 4-byte (32-bit)       | 1.50ms    | 1.04ms    | 1.44x     |
| 64-bit 1-byte         | 294us     | 232us     | 1.27x     |
| batch1024             | 1.96us    | 1.20us    | 1.63x     |
| Uniform/2-byte/8-byte | unchanged | unchanged | no regress|

Also enhances the varint benchmark with fixed byte-width benchmarks
(1-5 byte for 32-bit, 1/4/8 byte for 64-bit), skip benchmarks, and
batch size benchmarks.

Differential Revision: D96617939
… single-byte varints

Summary:
Manually loop-unroll `decodeSingleByteRun` with a 3-tier approach:
1. 32-element (4-word) unrolled loop with combined high-bit check
   `(w0 | w1 | w2 | w3) & kHighBits` to minimize branch overhead
2. 8-element (1-word) loop for smaller runs
3. Single-element trailing loop to pick up individual single-byte
   varints before multi-byte values

Also extracts the byte-expansion logic into a reusable `expandWord()`
helper for clarity.

Differential Revision: D96619597
…gleByteRun

Summary:
Replace scalar byte expansion and reinterpret_cast-based uint64_t loads in
decodeSingleByteRun with xsimd-based SIMD operations:

- Use xsimd::batch<uint8_t>::load_unaligned for a single wide load (32 bytes
  on AVX2) + vptest to check all high bits at once, replacing 4 separate
  uint64_t loads + OR chain.
- Use xsimd::batch<T> construction and store_unaligned for byte-to-element
  widening (compiles to vpmovzxbd on AVX2, vmovl on NEON).
- Replace reinterpret_cast<const uint64_t*> with std::memcpy in the 8-byte
  loop to avoid strict-aliasing/alignment issues.

Differential Revision: D96628007

Reviewed By: xiaoxmeng
…ncoding to make it robust

Summary: Add further tests to the varint encoding to make it robust

Differential Revision: D96665765
…ecode

Summary: use pre-compiled lookup table for varint decode to eliminate switch case

Differential Revision: D96756546
Summary: Add encoding fuzzer testing for varint encoding with randomized encode-decode-verify cycles across diverse data patterns and access patterns.

Differential Revision: D97054913
Summary:
Add bulkDecodeTwoByteRun() that detects runs of 2-byte varints by
checking for the alternating high-bit pattern (0x0080008000800080) in
8-byte words, decoding 4 varints per word with simple scalar ops.

This fixes the 2-byte regression introduced by the table-driven BMI2
decode (D96756546). The old switch perfectly predicted uniform 2-byte
data, while the table-driven approach paid lookup overhead without
benefiting from misprediction elimination.

Benchmark results (varint_benchmark, mode/opt):
- BulkDecode_2byte: 1.12ms → 506µs (2.2x faster, 11% faster than baseline)
- NimbleBulkDecodeUniform: 1.49ms (preserved, 69% faster than baseline)
- All other benchmarks: unchanged

Differential Revision: D97189009
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 19, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 19, 2026

@srsuryadev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97366866.

srsuryadev added a commit that referenced this pull request Mar 19, 2026
…dic large values" (#594)

Summary:
Pull Request resolved: #594

Dispatch loop for varint bulk decode to handle sporadic large heads, i.e in a large sequence of 1-byte/2-byte values we see sporadically placed large values.

Differential Revision: D97366866
@meta-codesync meta-codesync bot changed the title perf(encoding): Dispatch loop for Varint bulk decode to handle "sporadic large values" perf(encoding): Dispatch loop for Varint bulk decode to handle "sporadic large values" (#594) Mar 19, 2026
srsuryadev added a commit that referenced this pull request Mar 20, 2026
…dic large values" (#594)

Summary:
Pull Request resolved: #594

Dispatch loop for varint bulk decode to handle sporadic large heads, i.e in a large sequence of 1-byte/2-byte values we see sporadically placed large values.

Differential Revision: D97366866
srsuryadev added a commit that referenced this pull request Mar 20, 2026
…dic large values" (#594)

Summary:
Pull Request resolved: #594

Dispatch loop for varint bulk decode to handle sporadic large heads, i.e in a large sequence of 1-byte/2-byte values we see sporadically placed large values.

Differential Revision: D97366866
…dic large values" (#594)

Summary:
Pull Request resolved: #594

Dispatch loop for varint bulk decode to handle sporadic large heads, i.e in a large sequence of 1-byte/2-byte values we see sporadically placed large values.

Differential Revision: D97366866
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant