fix: prevent seed state cache poisoning in loadState clone#9246
fix: prevent seed state cache poisoning in loadState clone#9246twoeths merged 7 commits intoChainSafe:unstablefrom
Conversation
The default `clone()` in `@chainsafe/ssz` transfers the source sub-view's
cache to the new instance, which means `migrated.validators` and the seed
container's cached child-view snapshot share the SAME internal `nodes[]`
and `caches[]` arrays. A subsequent `migratedState.commit()` then writes
modified validator / inactivity-score nodes into those shared arrays,
silently corrupting the seed state's cache snapshot at the modified
indices.
The corruption stays latent until the seed is later cloned with
transfer-cache enabled - the path `verifyBlock` takes via
`preState.clone({dontTransferCache: false})`. On that next-block clone,
reads at the modified index return the migrated state's validator instead
of the seed's, which surfaces in production as
"Withdrawal mismatch at index=0" divergences between Lodestar and EL.
Use `clone(true)` so the migrated sub-view starts with a fresh empty
cache and its commit cannot reach into the seed's cache arrays. A
regression test exercises the `loadState -> seedState.clone() ->
validators.getReadonly(modifiedIndex)` sequence.
Root cause was introduced by ChainSafe#8857 which added the `loadOtherState` /
shared-head seed path that exercises this clone in production.
🤖 Generated with AI assistance
There was a problem hiding this comment.
Code Review
This pull request addresses a cache aliasing bug in the loadState utility where cloning subviews with default settings could lead to corruption of the seed state's cache during a commit. The fix involves using clone(true) for inactivityScores and validators to ensure a fresh cache. A regression test has been added to prevent future occurrences. Feedback suggests using the more explicit object syntax {dontTransferCache: true} in the clone calls to improve code readability and maintain consistency with the inline documentation.
| // silently corrupting the seed container's cache snapshot. That corruption only surfaces | ||
| // on the next `seedState.clone({dontTransferCache: false})` read, producing a | ||
| // "Withdrawal mismatch at index=0"-style divergence downstream. | ||
| migratedState.inactivityScores = seedState.inactivityScores.clone(true); |
There was a problem hiding this comment.
For better readability and to avoid "magic boolean" parameters, it is recommended to use the object syntax for the clone options. This also ensures consistency with the documentation provided in the comments above (line 118) and the PR description, which both reference the {dontTransferCache: ...} syntax.
| migratedState.inactivityScores = seedState.inactivityScores.clone(true); | |
| migratedState.inactivityScores = seedState.inactivityScores.clone({dontTransferCache: true}); |
There was a problem hiding this comment.
Thanks — but this suggestion doesn't match the actual SSZ API. TreeViewDU.clone() is typed as clone(dontTransferCache?: boolean) (see @chainsafe/ssz packages/ssz/src/viewDU/abstract.ts:111 and partialListComposite.ts:156), so {dontTransferCache: true} wouldn't type-check.
The inconsistency you flagged is real, but the right fix is to update my comments to match the boolean API rather than the code to match the (incorrect) comments. Pushed dde4635 correcting the three places that referenced the object-syntax form.
| // arrays, silently corrupting the seed container's cache snapshot. That corruption only | ||
| // surfaces on the next `seedState.clone({dontTransferCache: false})` read, producing a | ||
| // "Withdrawal mismatch at index=0"-style divergence downstream. | ||
| migratedState.validators = seedState.validators.clone(true); |
There was a problem hiding this comment.
For better readability and to avoid "magic boolean" parameters, it is recommended to use the object syntax for the clone options. This also ensures consistency with the documentation provided in the comments above (line 201) and the PR description, which both reference the {dontTransferCache: ...} syntax.
| migratedState.validators = seedState.validators.clone(true); | |
| migratedState.validators = seedState.validators.clone({dontTransferCache: true}); |
There was a problem hiding this comment.
Same as above — TreeViewDU.clone() only accepts a boolean, so the object syntax wouldn't compile. I updated the inline comments here and on the inactivity-scores path in dde4635 so they no longer reference a {dontTransferCache: false} form that doesn't exist.
The inline comments referenced a `seedState.clone({dontTransferCache: false})`
object-syntax call that does not exist — the SSZ `TreeViewDU.clone()` signature
is `clone(dontTransferCache?: boolean)`. Clarify that the corruption surfaces
on the next default `seedState.clone()` read, and that `clone(true)` is the
dontTransferCache flag.
🤖 Generated with AI assistance
|
@twoeths are we good with this one? waiting for your final approval |
| // migratedState starts with the same inactivityScores as seed state. | ||
| // `clone(true)` (dontTransferCache) is REQUIRED: the default transfer-cache clone would alias | ||
| // the seed subview's internal `nodes[]` / `caches[]` arrays, and a subsequent | ||
| // `migratedState.commit()` would write the modified score nodes into those shared arrays, |
There was a problem hiding this comment.
I don't see the nodes are shared after a transferred cache https://github.com/ChainSafe/ssz/blob/07fb1216144c99c2070282bbc07b6d438ba88453/packages/ssz/src/viewDU/abstract.ts#L115
the cache is actually transferred as the variable name inferred
There was a problem hiding this comment.
Good point — the semantics here are: (default) transfers the cache to the new instance, while () leaves the new clone with a fresh empty cache. The nodes (tree structure) follow copy-on-write semantics regardless.
The stale multi-line comment that described this incorrectly was the root of the confusion. Fixed in 6779e05 — both sites now use the concise form , consistent with the rest of the codebase. See the reply on the adjacent thread for more context.
twoeths
left a comment
There was a problem hiding this comment.
while I agree with the change to use clone(true), I don't see the cached nodes are shared between the two, it's transferred instead
/**
* Return a new ViewDU instance referencing the same internal `Node`.
*
* By default it will transfer the cache of this ViewDU to the new cloned instance. Set `dontTransferCache` to true
* to NOT transfer the cache to the cloned instance.
*/
clone(dontTransferCache?: boolean): this {
if (dontTransferCache) return this.type.getViewDU(this.node) as this;
const cache = this.cache;
this.clearCache();
return this.type.getViewDU(this.node, cache) as this;
}@lodekeeper please double check this, and update the code comment if needed
Clarify that default clone() transfers cache off the source subview, but still reuses the arrays referenced by the seed container's cached child snapshot.\n\n🤖 Generated with AI assistance
|
Double-checked it — you're right that The bug is a bit narrower: the transferred I pushed |
| // the next default `seedState.clone()` read, producing a "Withdrawal mismatch at index=0"- | ||
| // style divergence downstream. | ||
| // `clone(true)` (dontTransferCache) is REQUIRED: the default transfer-cache clone reuses the | ||
| // same `nodes[]` / `caches[]` arrays already referenced by the seed container's cached |
There was a problem hiding this comment.
this comment is wrong, nodes[] are not shared but transferred
I think having these multiline comments are not needed
instead, just add 1 line comment: true = do not transfer cache
similar to other places
There was a problem hiding this comment.
Good catch — you're right that nodes[] are transferred (not shared), and the multi-line explanation overcomplicated it. Replaced both blocks (line 115 inactivityScores + line 197 validators) with the one-line true = do not transfer cache convention used elsewhere in the codebase. Pushed in 6779e05.
There was a problem hiding this comment.
Addressed in 6779e05 — see reply on the parent thread.
There was a problem hiding this comment.
Addressed in the parent thread — see r3198891374. Replaced both multiline blocks with true = do not transfer cache (equivalent to dontTransferCache=true).
Replace multi-line cache-transfer rationale at loadState.ts:115 / 197 with the one-line convention used elsewhere (`true = do not transfer cache`), per ChainSafe#9246 review feedback. 🤖 Generated with AI assistance
|
@twoeths you're right that at the ViewDU level the cache is transferred, not shared — The sharing I described in the PR body is actually one level higher, at the parent container's cache. Here's the precise path:
The SSZ source even has a TODO at // TODO: Consider these properties are not accessible in the cache object persisted in the parent's cache.
// nodes, caches, _length, and nodesPopulated are mutated. Consider having them in a _cache object such that
// mutations affect the cache already found in the parent objectSo the code comment // Pass true (dontTransferCache) to avoid the parent container's cached nodes[] array
// being shared with the clone — a default clone() would poison seedState's validator cache on commitLet me know if you'd like a follow-up PR to improve the comment, or if you're satisfied leaving it as-is since the root cause is well-documented in the PR description. |
Summary
Fixes the v1.42.0 "Withdrawal mismatch at index=0" regression by changing the two
clone()calls inloadState.tstoclone(true)(i.e.dontTransferCache=true).The default
clone()in@chainsafe/ssztransfers the source sub-view's cache to the new instance, which means themigratedState.validatorssub-view and the seed container's cached child-view snapshot share the same internalnodes[]andcaches[]arrays. A subsequentmigratedState.commit()writes modified validator / inactivity-score nodes into those shared arrays, silently corrupting the seed state's cache snapshot at the modified indices.The corruption stays latent until the seed state is later cloned with transfer-cache enabled — the path
verifyBlocktakes viapreState.clone({dontTransferCache: false}). On that next-block clone, reads at the modified index return the migrated state's validator instead of the seed's, which surfaces asWithdrawal mismatch at index=0divergences between Lodestar and EL.Timeline of the corruption
On the next block:
Root cause
Introduced by #8857 (
chore: consume BeaconStateView) which added theloadOtherState/ shared-head seed path that exercises this clone in production.Test plan
loadState does not poison seed state's cacheinpackages/state-transition/test/unit/util/loadState.test.ts0xaa-filled validator onpostState.clone()) and PASSES with the fixstate-transitionutil tests passcheck-typesandlintcleanRelation to #9245
PR #9245 (
fix: gate loadOtherState validators/balances preload behind opt-in) addresses a different regression from the same #8857-era changes — the eagergetAllReadonlyValues()preload causing memory spikes on the API path. The two fixes are independent and both needed:persistentCheckpointsCachealso exercisesloadState()with the problematicclone(), so fix: gate loadOtherState validators/balances preload behind opt-in #9245's preload gating alone doesn't cover the full surface.🤖 Generated with AI assistance