Skip to content

perf(encoding): varint encoding - manual loop-unroll in decodeSingleByteRun for single-byte varints (#578)#578

Open
srsuryadev wants to merge 3 commits intomainfrom
export-D96619597
Open

perf(encoding): varint encoding - manual loop-unroll in decodeSingleByteRun for single-byte varints (#578)#578
srsuryadev wants to merge 3 commits intomainfrom
export-D96619597

Conversation

@srsuryadev
Copy link
Copy Markdown
Contributor

@srsuryadev srsuryadev commented Mar 18, 2026

Summary:

Manually loop-unroll decodeSingleByteRun with a 3-tier approach:

  1. 32-element (4-word) unrolled loop with combined high-bit check
    (w0 | w1 | w2 | w3) & kHighBits to minimize branch overhead
  2. 8-element (1-word) loop for smaller runs
  3. Single-element trailing loop to pick up individual single-byte
    varints before multi-byte values

Also extracts the byte-expansion logic into a reusable expandWord()
helper for clarity.

Reviewed By: xiaoxmeng

Differential Revision: D96619597

…Width, MainlyConstant for faster iteration for SST workload

Summary: Add v2 encoding scaffoldings for the Varint, RLE, FixedBitWidth, and MainlyConstant for faster iteration or perf tuning

Differential Revision: D96684714
Summary:
Add `decodeSingleByteRun` fast path to `bulkVarintDecode32` and
`bulkVarintDecode64` that processes leading runs of single-byte varints
(values 0-127) using 8-byte word reads before falling through to the
BMI2 switch-based decoder. For each 8-byte word where no continuation
bits are set (`word & 0x8080808080808080 == 0`), all 8 varints are
decoded with simple shifts, avoiding the `_pext_u64` and 64-case switch
overhead.

This is placed in the caller functions rather than inside
`bulkVarintDecodeBmi2` to preserve the BMI2 function's code layout and
icache behavior for mixed-width data.

Benchmark results (1M elements, mode/opt):
| Scenario              | Before    | After     | Speedup   |
|-----------------------|-----------|-----------|-----------|
| 1-byte (32-bit)       | 465us     | 260us     | 1.79x     |
| 5-byte (32-bit)       | slower    | 1.22ms    | fixed     |
| 3-byte (32-bit)       | 1.04ms    | 864us     | 1.20x     |
| 4-byte (32-bit)       | 1.50ms    | 1.04ms    | 1.44x     |
| 64-bit 1-byte         | 294us     | 232us     | 1.27x     |
| batch1024             | 1.96us    | 1.20us    | 1.63x     |
| Uniform/2-byte/8-byte | unchanged | unchanged | no regress|

Also enhances the varint benchmark with fixed byte-width benchmarks
(1-5 byte for 32-bit, 1/4/8 byte for 64-bit), skip benchmarks, and
batch size benchmarks.

Differential Revision: D96617939
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 18, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 18, 2026

@srsuryadev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96619597.

srsuryadev added a commit that referenced this pull request Mar 18, 2026
…yteRun for single-byte varints (#578)

Summary:
Pull Request resolved: #578

Manually loop-unroll `decodeSingleByteRun` with a 3-tier approach:
1. 32-element (4-word) unrolled loop with combined high-bit check
   `(w0 | w1 | w2 | w3) & kHighBits` to minimize branch overhead
2. 8-element (1-word) loop for smaller runs
3. Single-element trailing loop to pick up individual single-byte
   varints before multi-byte values

Also extracts the byte-expansion logic into a reusable `expandWord()`
helper for clarity.

Reviewed By: xiaoxmeng

Differential Revision: D96619597
@meta-codesync meta-codesync bot changed the title perf(encoding): varint encoding - manual loop-unroll in decodeSingleByteRun for single-byte varints perf(encoding): varint encoding - manual loop-unroll in decodeSingleByteRun for single-byte varints (#578) Mar 18, 2026
srsuryadev added a commit that referenced this pull request Mar 18, 2026
…yteRun for single-byte varints (#578)

Summary:
Pull Request resolved: #578

Manually loop-unroll `decodeSingleByteRun` with a 3-tier approach:
1. 32-element (4-word) unrolled loop with combined high-bit check
   `(w0 | w1 | w2 | w3) & kHighBits` to minimize branch overhead
2. 8-element (1-word) loop for smaller runs
3. Single-element trailing loop to pick up individual single-byte
   varints before multi-byte values

Also extracts the byte-expansion logic into a reusable `expandWord()`
helper for clarity.

Reviewed By: xiaoxmeng

Differential Revision: D96619597
srsuryadev added a commit that referenced this pull request Mar 19, 2026
…yteRun for single-byte varints (#578)

Summary:
Pull Request resolved: #578

Manually loop-unroll `decodeSingleByteRun` with a 3-tier approach:
1. 32-element (4-word) unrolled loop with combined high-bit check
   `(w0 | w1 | w2 | w3) & kHighBits` to minimize branch overhead
2. 8-element (1-word) loop for smaller runs
3. Single-element trailing loop to pick up individual single-byte
   varints before multi-byte values

Also extracts the byte-expansion logic into a reusable `expandWord()`
helper for clarity.

Reviewed By: xiaoxmeng

Differential Revision: D96619597
srsuryadev added a commit that referenced this pull request Mar 19, 2026
…yteRun for single-byte varints (#578)

Summary:
Pull Request resolved: #578

Manually loop-unroll `decodeSingleByteRun` with a 3-tier approach:
1. 32-element (4-word) unrolled loop with combined high-bit check
   `(w0 | w1 | w2 | w3) & kHighBits` to minimize branch overhead
2. 8-element (1-word) loop for smaller runs
3. Single-element trailing loop to pick up individual single-byte
   varints before multi-byte values

Also extracts the byte-expansion logic into a reusable `expandWord()`
helper for clarity.

Reviewed By: xiaoxmeng

Differential Revision: D96619597
srsuryadev added a commit that referenced this pull request Mar 19, 2026
…yteRun for single-byte varints (#578)

Summary:
Pull Request resolved: #578

Manually loop-unroll `decodeSingleByteRun` with a 3-tier approach:
1. 32-element (4-word) unrolled loop with combined high-bit check
   `(w0 | w1 | w2 | w3) & kHighBits` to minimize branch overhead
2. 8-element (1-word) loop for smaller runs
3. Single-element trailing loop to pick up individual single-byte
   varints before multi-byte values

Also extracts the byte-expansion logic into a reusable `expandWord()`
helper for clarity.

Reviewed By: xiaoxmeng

Differential Revision: D96619597
srsuryadev added a commit that referenced this pull request Mar 20, 2026
…yteRun for single-byte varints (#578)

Summary:
Pull Request resolved: #578

Manually loop-unroll `decodeSingleByteRun` with a 3-tier approach:
1. 32-element (4-word) unrolled loop with combined high-bit check
   `(w0 | w1 | w2 | w3) & kHighBits` to minimize branch overhead
2. 8-element (1-word) loop for smaller runs
3. Single-element trailing loop to pick up individual single-byte
   varints before multi-byte values

Also extracts the byte-expansion logic into a reusable `expandWord()`
helper for clarity.

Reviewed By: xiaoxmeng

Differential Revision: D96619597
…yteRun for single-byte varints (#578)

Summary:
Pull Request resolved: #578

Manually loop-unroll `decodeSingleByteRun` with a 3-tier approach:
1. 32-element (4-word) unrolled loop with combined high-bit check
   `(w0 | w1 | w2 | w3) & kHighBits` to minimize branch overhead
2. 8-element (1-word) loop for smaller runs
3. Single-element trailing loop to pick up individual single-byte
   varints before multi-byte values

Also extracts the byte-expansion logic into a reusable `expandWord()`
helper for clarity.

Reviewed By: xiaoxmeng

Differential Revision: D96619597
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant