Skip to content

perf(encoding): varint encoding - use manual vectorization/simd in decodeSingleByteRun (#579)#579

Open
srsuryadev wants to merge 4 commits intomainfrom
export-D96628007
Open

perf(encoding): varint encoding - use manual vectorization/simd in decodeSingleByteRun (#579)#579
srsuryadev wants to merge 4 commits intomainfrom
export-D96628007

Conversation

@srsuryadev
Copy link
Copy Markdown
Contributor

@srsuryadev srsuryadev commented Mar 18, 2026

Summary:

Replace scalar byte expansion and reinterpret_cast-based uint64_t loads in
decodeSingleByteRun with xsimd-based operations.

Reviewed By: xiaoxmeng

Differential Revision: D96628007

…Width, MainlyConstant for faster iteration for SST workload

Summary: Add v2 encoding scaffoldings for the Varint, RLE, FixedBitWidth, and MainlyConstant for faster iteration or perf tuning

Differential Revision: D96684714
Summary:
Add `decodeSingleByteRun` fast path to `bulkVarintDecode32` and
`bulkVarintDecode64` that processes leading runs of single-byte varints
(values 0-127) using 8-byte word reads before falling through to the
BMI2 switch-based decoder. For each 8-byte word where no continuation
bits are set (`word & 0x8080808080808080 == 0`), all 8 varints are
decoded with simple shifts, avoiding the `_pext_u64` and 64-case switch
overhead.

This is placed in the caller functions rather than inside
`bulkVarintDecodeBmi2` to preserve the BMI2 function's code layout and
icache behavior for mixed-width data.

Benchmark results (1M elements, mode/opt):
| Scenario              | Before    | After     | Speedup   |
|-----------------------|-----------|-----------|-----------|
| 1-byte (32-bit)       | 465us     | 260us     | 1.79x     |
| 5-byte (32-bit)       | slower    | 1.22ms    | fixed     |
| 3-byte (32-bit)       | 1.04ms    | 864us     | 1.20x     |
| 4-byte (32-bit)       | 1.50ms    | 1.04ms    | 1.44x     |
| 64-bit 1-byte         | 294us     | 232us     | 1.27x     |
| batch1024             | 1.96us    | 1.20us    | 1.63x     |
| Uniform/2-byte/8-byte | unchanged | unchanged | no regress|

Also enhances the varint benchmark with fixed byte-width benchmarks
(1-5 byte for 32-bit, 1/4/8 byte for 64-bit), skip benchmarks, and
batch size benchmarks.

Differential Revision: D96617939
… single-byte varints

Summary:
Manually loop-unroll `decodeSingleByteRun` with a 3-tier approach:
1. 32-element (4-word) unrolled loop with combined high-bit check
   `(w0 | w1 | w2 | w3) & kHighBits` to minimize branch overhead
2. 8-element (1-word) loop for smaller runs
3. Single-element trailing loop to pick up individual single-byte
   varints before multi-byte values

Also extracts the byte-expansion logic into a reusable `expandWord()`
helper for clarity.

Differential Revision: D96619597
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 18, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 18, 2026

@srsuryadev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96628007.

srsuryadev added a commit that referenced this pull request Mar 18, 2026
…codeSingleByteRun (#579)

Summary:
Pull Request resolved: #579

Replace scalar byte expansion and reinterpret_cast-based uint64_t loads in
decodeSingleByteRun with xsimd-based operations.

Reviewed By: xiaoxmeng

Differential Revision: D96628007
@meta-codesync meta-codesync bot changed the title perf(encoding): varint encoding - use manual vectorization/simd in decodeSingleByteRun perf(encoding): varint encoding - use manual vectorization/simd in decodeSingleByteRun (#579) Mar 18, 2026
@srsuryadev srsuryadev force-pushed the export-D96628007 branch 2 times, most recently from 892911b to 82e9336 Compare March 19, 2026 17:24
srsuryadev added a commit that referenced this pull request Mar 19, 2026
…codeSingleByteRun (#579)

Summary:
Pull Request resolved: #579

Replace scalar byte expansion and reinterpret_cast-based uint64_t loads in
decodeSingleByteRun with xsimd-based operations.

Reviewed By: xiaoxmeng

Differential Revision: D96628007
srsuryadev added a commit that referenced this pull request Mar 19, 2026
…codeSingleByteRun (#579)

Summary:
Pull Request resolved: #579

Replace scalar byte expansion and reinterpret_cast-based uint64_t loads in
decodeSingleByteRun with xsimd-based operations.

Reviewed By: xiaoxmeng

Differential Revision: D96628007
srsuryadev added a commit that referenced this pull request Mar 19, 2026
…codeSingleByteRun (#579)

Summary:
Pull Request resolved: #579

Replace scalar byte expansion and reinterpret_cast-based uint64_t loads in
decodeSingleByteRun with xsimd-based operations.

Reviewed By: xiaoxmeng

Differential Revision: D96628007
srsuryadev added a commit that referenced this pull request Mar 20, 2026
…codeSingleByteRun (#579)

Summary:
Pull Request resolved: #579

Replace scalar byte expansion and reinterpret_cast-based uint64_t loads in
decodeSingleByteRun with xsimd-based operations.

Reviewed By: xiaoxmeng

Differential Revision: D96628007
…codeSingleByteRun (#579)

Summary:
Pull Request resolved: #579

Replace scalar byte expansion and reinterpret_cast-based uint64_t loads in
decodeSingleByteRun with xsimd-based operations.

Reviewed By: xiaoxmeng

Differential Revision: D96628007
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant