feat: consume asyncAggregateWithRandomness branch from lodestar-z#9342
feat: consume asyncAggregateWithRandomness branch from lodestar-z#9342spiral-ladder wants to merge 4 commits intobing/blst-zfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors BLS signature aggregation by moving the aggregateWithRandomness logic from worker threads to the main thread using an asynchronous implementation. Key changes include making jobItemWorkReq asynchronous, simplifying the BlsWorkReq and BlsWorkResult types by removing the type discriminator and specific aggregation metrics, and updating the metrics system to track asynchronous aggregation duration. Feedback suggests parallelizing the asynchronous aggregation tasks using Promise.all to prevent sequential execution from delaying worker pool processing.
| // Note: This can throw, must be handled per-job. | ||
| // Pubkey and signature aggregation is defered here | ||
| workReq = jobItemWorkReq(job, this.pubkeyCache, this.metrics); | ||
| workReq = await jobItemWorkReq(job, this.pubkeyCache, this.metrics); |
There was a problem hiding this comment.
Awaiting jobItemWorkReq inside the loop results in sequential execution of the asynchronous aggregation tasks. If the jobs array contains multiple items of type sameMessage, this will block the preparation of the entire batch, delaying the start of the worker pool processing. Consider using Promise.all to parallelize these calls before sending the batch to the worker pool, while ensuring that per-job error handling is preserved.
Unfortunately experiments with the sync version of aggregateWithRandomness, while performant on bare-metal, led to some regressions on the dockerized node - metrics showed late head imports and significantly larger peer churn. Reverting back to asynchronously aggregating with randomness; this was previously put off due to me failing to get it up to par with the rust version (but now it is!). PR here: ChainSafe/lodestar-z#353 The nice thing about this reversion is that we are now a complete 1-to-1 swap on the bls side; no rearchitecting needed.
19d0316 to
184a2ff
Compare
c315f78 to
be0c717
Compare

Unfortunately experiments with the sync version of aggregateWithRandomness, while performant on bare-metal, led to some regressions on the dockerized node - metrics showed late head imports and significantly larger peer churn.
Reverting back to asynchronously aggregating with randomness; this was previously put off due to me failing to get it up to par with the rust version (but now it is!). PR here: ChainSafe/lodestar-z#353
Link to lodekeeper's analysis on discord: https://discord.com/channels/593655374469660673/1479085402395508976/1502163320176775239
Screenshots
Note that the thing to look out for here is the date April 27 - before that, we were running bare-metal changes from #8900, and after that date, the dockerized version. Data taken from feat3-super.
BLS job wait time 2x worse on the dockerized version:
Event loop spike:
Wrong head ratio up substantially:
Gossip validation queues up:
Taking longer to process blocks: