Skip to content

Increase embedded string threshold from 64 to 128 bytes#3397

Merged
zuiderkwast merged 3 commits intovalkey-io:unstablefrom
Nikhil-Manglore:embedded_strings_update
Apr 7, 2026
Merged

Increase embedded string threshold from 64 to 128 bytes#3397
zuiderkwast merged 3 commits intovalkey-io:unstablefrom
Nikhil-Manglore:embedded_strings_update

Conversation

@Nikhil-Manglore
Copy link
Copy Markdown
Member

@Nikhil-Manglore Nikhil-Manglore commented Mar 23, 2026

This PR increases the embedded string threshold to 128 bytes (2 cache lines) in order to improve memory overhead. I also fixed the tests that pertained to embedded values

Closes #3025

@Nikhil-Manglore Nikhil-Manglore force-pushed the embedded_strings_update branch from e38f9c1 to 9d808ac Compare March 23, 2026 20:00
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.52%. Comparing base (6822a67) to head (41aa5d5).
⚠️ Report is 17 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3397      +/-   ##
============================================
+ Coverage     76.38%   76.52%   +0.13%     
============================================
  Files           159      159              
  Lines         79681    79680       -1     
============================================
+ Hits          60865    60972     +107     
+ Misses        18816    18708     -108     
Files with missing lines Coverage Δ
src/object.c 92.73% <100.00%> (+5.11%) ⬆️
src/unit/test_object.cpp 98.98% <100.00%> (ø)

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown

Benchmark ran on this commit: e38f9c1

RPS Benchmark Comparison: unstable vs 16d2c42

2 significant change(s)

  • ✅ +31±2% GET rps pipe=10 threads=9 (data_size=96)
  • ✅ +5.4±0.7% GET rps pipe=10 threads=1 (data_size=96)

14 with no significant change

Click to expand full comparison tables

data_size = 16

% Change Test unstable 16d2c42 unstable stats 16d2c42 stats
+1.4±0.8% GET rps P1 T1 227K 230K n=3, σ=1.51K, CV=0.7%, CI99%=±3.8%, PI99%=±7.6% n=3, σ=1.09K, CV=0.5%, CI99%=±2.7%, PI99%=±5.4%
-0±2% GET rps P1 T9 1.53M 1.53M n=3, σ=28.2K, CV=1.8%, CI99%=±10.6%, PI99%=±21.2% n=3, σ=19.4K, CV=1.3%, CI99%=±7.3%, PI99%=±14.5%
+1.6±0.4% GET rps P10 T1 1.233M 1.253M n=3, σ=3.96K, CV=0.3%, CI99%=±1.8%, PI99%=±3.7% n=3, σ=2.49K, CV=0.2%, CI99%=±1.1%, PI99%=±2.3%
+1±1% GET rps P10 T9 2.86M 2.896M n=3, σ=29.7K, CV=1.0%, CI99%=±5.9%, PI99%=±11.9% n=3, σ=7.14K, CV=0.2%, CI99%=±1.4%, PI99%=±2.8%
+0.3±0.9% SET rps P1 T1 221K 221.5K n=3, σ=1.98K, CV=0.9%, CI99%=±5.1%, PI99%=±10.3% n=3, σ=358, CV=0.2%, CI99%=±0.9%, PI99%=±1.9%
+0.6±0.9% SET rps P1 T9 1.483M 1.491M n=3, σ=9.15K, CV=0.6%, CI99%=±3.5%, PI99%=±7.1% n=3, σ=9.86K, CV=0.7%, CI99%=±3.8%, PI99%=±7.6%
+1.8±0.9% SET rps P10 T1 1.038M 1.057M n=3, σ=8.76K, CV=0.8%, CI99%=±4.8%, PI99%=±9.7% n=3, σ=3.02K, CV=0.3%, CI99%=±1.6%, PI99%=±3.3%
+3.7±0.8% SET rps P10 T9 1.93M 2.006M n=3, σ=14.8K, CV=0.8%, CI99%=±4.4%, PI99%=±8.8% n=3, σ=5.21K, CV=0.3%, CI99%=±1.5%, PI99%=±3.0%

data_size = 96

% Change Test unstable 16d2c42 unstable stats 16d2c42 stats
+3±1% GET rps P1 T1 221K 227K n=3, σ=2.01K, CV=0.9%, CI99%=±5.2%, PI99%=±10.5% n=3, σ=2.16K, CV=1.0%, CI99%=±5.5%, PI99%=±10.9%
+3±1% GET rps P1 T9 1.470M 1.51M n=3, σ=8.00K, CV=0.5%, CI99%=±3.1%, PI99%=±6.2% n=3, σ=15.1K, CV=1.0%, CI99%=±5.7%, PI99%=±11.4%
+5.4±0.7% GET rps P10 T1 1.165M 1.228M n=3, σ=3.47K, CV=0.3%, CI99%=±1.7%, PI99%=±3.4% n=3, σ=6.87K, CV=0.6%, CI99%=±3.2%, PI99%=±6.4%
+31±2% GET rps P10 T9 2.26M 2.95M n=3, σ=21.6K, CV=1.0%, CI99%=±5.5%, PI99%=±11.0% n=3, σ=26.7K, CV=0.9%, CI99%=±5.2%, PI99%=±10.4%
+4±1% SET rps P1 T1 213K 220.4K n=3, σ=1.96K, CV=0.9%, CI99%=±5.3%, PI99%=±10.5% n=3, σ=885, CV=0.4%, CI99%=±2.3%, PI99%=±4.6%
+2.1±0.7% SET rps P1 T9 1.458M 1.49M n=3, σ=1.30K, CV=0.1%, CI99%=±0.5%, PI99%=±1.0% n=3, σ=10.3K, CV=0.7%, CI99%=±4.0%, PI99%=±7.9%
+1.8±0.8% SET rps P10 T1 1.025M 1.044M n=3, σ=7.80K, CV=0.8%, CI99%=±4.4%, PI99%=±8.7% n=3, σ=2.85K, CV=0.3%, CI99%=±1.6%, PI99%=±3.1%
+8±2% SET rps P10 T9 1.86M 2.01M n=3, σ=34.8K, CV=1.9%, CI99%=±10.7%, PI99%=±21.5% n=3, σ=23.6K, CV=1.2%, CI99%=±6.7%, PI99%=±13.5%

Configuration:

  • architecture: aarch64
  • benchmark_mode: duration
  • clients: 1600
  • cluster_mode: False
  • duration: 180
  • tls: False
  • valkey_benchmark_threads: 90
  • warmup: 30

Legend:

  • Test column: Command, metric, P=pipeline depth, T=io-threads
  • Significance: ✅ significant improvement, ❌ significant regression, ➖ not significant, ❔ insufficient data

Statistical Notes:

  • CV: Coefficient of Variation - relative variability (σ/μ × 100%)
  • CI99%: 99% Confidence Interval - range where the true population mean is likely to fall
  • PI99%: 99% Prediction Interval - range where a single future observation is likely to fall

@rainsupreme
Copy link
Copy Markdown
Contributor

rainsupreme commented Mar 24, 2026

+31% performance is wild, to the point of being suspicious. What's going on there? I want to see if it's repeatable

@github-actions
Copy link
Copy Markdown

Benchmark ran on this commit: 9d808ac

RPS Benchmark Comparison: unstable vs a75d344

3 significant change(s)

  • ✅ +30±2% GET rps pipe=10 threads=9 (data_size=96)
  • ✅ +6.0±0.8% GET rps pipe=10 threads=1 (data_size=96)
  • ✅ +4.0±0.4% SET rps pipe=1 threads=1 (data_size=96)

13 with no significant change

Click to expand full comparison tables

data_size = 16

% Change Test unstable a75d344 unstable stats a75d344 stats
+0.4±0.6% GET rps P1 T1 230K 230.6K n=3, σ=1.06K, CV=0.5%, CI99%=±2.7%, PI99%=±5.3% n=3, σ=904, CV=0.4%, CI99%=±2.2%, PI99%=±4.5%
-0±2% GET rps P1 T9 1.53M 1.52M n=3, σ=14.2K, CV=0.9%, CI99%=±5.3%, PI99%=±10.7% n=3, σ=31.1K, CV=2.0%, CI99%=±11.7%, PI99%=±23.4%
+2±2% GET rps P10 T1 1.224M 1.25M n=3, σ=2.53K, CV=0.2%, CI99%=±1.2%, PI99%=±2.4% n=3, σ=18.4K, CV=1.5%, CI99%=±8.4%, PI99%=±16.9%
+3±1% GET rps P10 T9 2.845M 2.92M n=3, σ=3.63K, CV=0.1%, CI99%=±0.7%, PI99%=±1.5% n=3, σ=31.3K, CV=1.1%, CI99%=±6.2%, PI99%=±12.3%
-1±1% SET rps P1 T1 222.2K 220K n=3, σ=684, CV=0.3%, CI99%=±1.8%, PI99%=±3.5% n=3, σ=3.06K, CV=1.4%, CI99%=±7.9%, PI99%=±15.9%
+1±1% SET rps P1 T9 1.49M 1.503M n=3, σ=19.0K, CV=1.3%, CI99%=±7.3%, PI99%=±14.6% n=3, σ=4.24K, CV=0.3%, CI99%=±1.6%, PI99%=±3.2%
+1.1±0.6% SET rps P10 T1 1.043M 1.055M n=3, σ=1.39K, CV=0.1%, CI99%=±0.8%, PI99%=±1.5% n=3, σ=6.09K, CV=0.6%, CI99%=±3.3%, PI99%=±6.6%
+2±1% SET rps P10 T9 1.962M 2.00M n=3, σ=2.67K, CV=0.1%, CI99%=±0.8%, PI99%=±1.6% n=3, σ=26.6K, CV=1.3%, CI99%=±7.6%, PI99%=±15.3%

data_size = 96

% Change Test unstable a75d344 unstable stats a75d344 stats
+5±2% GET rps P1 T1 217K 228K n=3, σ=3.45K, CV=1.6%, CI99%=±9.1%, PI99%=±18.3% n=3, σ=2.43K, CV=1.1%, CI99%=±6.1%, PI99%=±12.2%
+2±1% GET rps P1 T9 1.477M 1.51M n=3, σ=6.01K, CV=0.4%, CI99%=±2.3%, PI99%=±4.7% n=3, σ=17.5K, CV=1.2%, CI99%=±6.6%, PI99%=±13.3%
+6.0±0.8% GET rps P10 T1 1.162M 1.231M n=3, σ=3.81K, CV=0.3%, CI99%=±1.9%, PI99%=±3.8% n=3, σ=8.30K, CV=0.7%, CI99%=±3.9%, PI99%=±7.7%
+30±2% GET rps P10 T9 2.26M 2.95M n=3, σ=12.1K, CV=0.5%, CI99%=±3.1%, PI99%=±6.1% n=3, σ=39.6K, CV=1.3%, CI99%=±7.7%, PI99%=±15.4%
+4.0±0.4% SET rps P1 T1 212.5K 221.1K n=3, σ=610, CV=0.3%, CI99%=±1.6%, PI99%=±3.3% n=3, σ=641, CV=0.3%, CI99%=±1.7%, PI99%=±3.3%
+1±2% SET rps P1 T9 1.46M 1.48M n=3, σ=19.5K, CV=1.3%, CI99%=±7.6%, PI99%=±15.3% n=3, σ=11.4K, CV=0.8%, CI99%=±4.4%, PI99%=±8.9%
+1±1% SET rps P10 T1 1.03M 1.046M n=3, σ=10.5K, CV=1.0%, CI99%=±5.8%, PI99%=±11.7% n=3, σ=3.09K, CV=0.3%, CI99%=±1.7%, PI99%=±3.4%
+8±3% SET rps P10 T9 1.85M 2.00M n=3, σ=39.9K, CV=2.2%, CI99%=±12.4%, PI99%=±24.7% n=3, σ=21.7K, CV=1.1%, CI99%=±6.2%, PI99%=±12.5%

Configuration:

  • architecture: aarch64
  • benchmark_mode: duration
  • clients: 1600
  • cluster_mode: False
  • duration: 180
  • tls: False
  • valkey_benchmark_threads: 90
  • warmup: 30

Legend:

  • Test column: Command, metric, P=pipeline depth, T=io-threads
  • Significance: ✅ significant improvement, ❌ significant regression, ➖ not significant, ❔ insufficient data

Statistical Notes:

  • CV: Coefficient of Variation - relative variability (σ/μ × 100%)
  • CI99%: 99% Confidence Interval - range where the true population mean is likely to fall
  • PI99%: 99% Prediction Interval - range where a single future observation is likely to fall

@rainsupreme
Copy link
Copy Markdown
Contributor

I'm very curious why this is producing a 31% performance gain, but I'm not complaining either 😆

@sarthakaggarwal97
Copy link
Copy Markdown
Contributor

That's quite a gain. Curious if this is because that 96 bytes value is now treated as EMBSTR instead of RAW? I guess it reduces the number of look ups since the value is already embedded with the key? But probably still doesn't explain this much. It would great to profile this.

Explanation from Codex about the total size of the EMBSTR

base robj-ptr slot   8
embedded key        18
96B value SDS      100
total              126

@Nikhil-Manglore Nikhil-Manglore marked this pull request as ready for review March 26, 2026 21:57
@Nikhil-Manglore
Copy link
Copy Markdown
Member Author

Nikhil-Manglore commented Mar 26, 2026

The tests embedded_string_with_key and embedded_string_with_key_and_expire pass now. The failing test is an unrelated flaky test, we can rerun it.

Copy link
Copy Markdown
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jemalloc size classes in this range are in intervals of 16 bytes (64, 80, 96, 112, 128) so in the worst case, we get 15 bytes internal fragmentation in the allocation, but we save 8 bytes for the pointer.

Previously, for unembedded strings in the range 64-128, the internal fragmentation is up 15 bytes (for 96B strings, 100B sds representation, 112B bucket, wasted 12B) plus the fixed 8 bytes for the obj->ptr.

Also optimal performance in most cases, it seems. 💯

Copy link
Copy Markdown
Contributor

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We should also change our performance benchmarks to include more string sizes after this PR gets merged.

Copy link
Copy Markdown
Contributor

@rainsupreme rainsupreme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huzzah 🥳
(lgtm!)

@Nikhil-Manglore
Copy link
Copy Markdown
Member Author

Nikhil-Manglore commented Mar 30, 2026

That's quite a gain. Curious if this is because that 96 bytes value is now treated as EMBSTR instead of RAW? I guess it reduces the number of look ups since the value is already embedded with the key? But probably still doesn't explain this much. It would great to profile this.

Explanation from Codex about the total size of the EMBSTR

base robj-ptr slot   8
embedded key        18
96B value SDS      100
total              126

Yeah 96 byte values are now treated as EMBSTR. Now that the value is embedded the number of cache misses is drastically reduced and there's no pointer dereference for 96 byte values. At high throughput configurations the throughput increases due to how the per-call savings is multiplied across all io-threads.

@zuiderkwast zuiderkwast merged commit 034d16a into valkey-io:unstable Apr 7, 2026
57 of 58 checks passed
@github-project-automation github-project-automation bot moved this to To be backported in Valkey 9.1 Apr 7, 2026
@zuiderkwast zuiderkwast added the release-notes This issue should get a line item in the release notes label Apr 7, 2026
@Nikhil-Manglore Nikhil-Manglore deleted the embedded_strings_update branch April 7, 2026 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes This issue should get a line item in the release notes

Projects

Status: To be backported

Development

Successfully merging this pull request may close these issues.

[PERFORMANCE] Potential regression due to Removal of internal server object pointer overhead

4 participants