Skip to content

feat: consume asyncAggregateWithRandomness branch from lodestar-z#9342

Open
spiral-ladder wants to merge 4 commits intobing/blst-zfrom
bing/async-agg-bls
Open

feat: consume asyncAggregateWithRandomness branch from lodestar-z#9342
spiral-ladder wants to merge 4 commits intobing/blst-zfrom
bing/async-agg-bls

Conversation

@spiral-ladder
Copy link
Copy Markdown
Contributor

Unfortunately experiments with the sync version of aggregateWithRandomness, while performant on bare-metal, led to some regressions on the dockerized node - metrics showed late head imports and significantly larger peer churn.

Reverting back to asynchronously aggregating with randomness; this was previously put off due to me failing to get it up to par with the rust version (but now it is!). PR here: ChainSafe/lodestar-z#353

Link to lodekeeper's analysis on discord: https://discord.com/channels/593655374469660673/1479085402395508976/1502163320176775239

Screenshots

Note that the thing to look out for here is the date April 27 - before that, we were running bare-metal changes from #8900, and after that date, the dockerized version. Data taken from feat3-super.

BLS job wait time 2x worse on the dockerized version:

Screenshot 2026-05-08 at 12 22 40 PM

Event loop spike:

Screenshot 2026-05-08 at 12 23 36 PM

Wrong head ratio up substantially:

Screenshot 2026-05-08 at 12 33 23 PM

Gossip validation queues up:

Screenshot 2026-05-08 at 12 34 41 PM

Taking longer to process blocks:

Screenshot 2026-05-08 at 12 35 15 PM

@spiral-ladder spiral-ladder self-assigned this May 8, 2026
@spiral-ladder spiral-ladder requested a review from a team as a code owner May 8, 2026 04:35
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors BLS signature aggregation by moving the aggregateWithRandomness logic from worker threads to the main thread using an asynchronous implementation. Key changes include making jobItemWorkReq asynchronous, simplifying the BlsWorkReq and BlsWorkResult types by removing the type discriminator and specific aggregation metrics, and updating the metrics system to track asynchronous aggregation duration. Feedback suggests parallelizing the asynchronous aggregation tasks using Promise.all to prevent sequential execution from delaying worker pool processing.

// Note: This can throw, must be handled per-job.
// Pubkey and signature aggregation is defered here
workReq = jobItemWorkReq(job, this.pubkeyCache, this.metrics);
workReq = await jobItemWorkReq(job, this.pubkeyCache, this.metrics);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Awaiting jobItemWorkReq inside the loop results in sequential execution of the asynchronous aggregation tasks. If the jobs array contains multiple items of type sameMessage, this will block the preparation of the entire batch, delaying the start of the worker pool processing. Consider using Promise.all to parallelize these calls before sending the batch to the worker pool, while ensuring that per-job error handling is preserved.

Unfortunately experiments with the sync version of
aggregateWithRandomness, while performant on bare-metal, led to some
regressions on the dockerized node - metrics showed late head imports
and significantly larger peer churn.

Reverting back to asynchronously aggregating with randomness; this was
previously put off due to me failing to get it up to par with the rust
version (but now it is!). PR here: ChainSafe/lodestar-z#353

The nice thing about this reversion is that we are now a complete 1-to-1
swap on the bls side; no rearchitecting needed.
@spiral-ladder
Copy link
Copy Markdown
Contributor Author

spiral-ladder commented May 8, 2026

Also relevant comment from @twoeths on discord regarding the dockerized deployment of PR #8900

I found memory did not increase but gc did a lot and that affected all metrics

Screenshot 2026-05-08 at 8 20 45 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant