Skip to content

fix: prevent seed state cache poisoning in loadState clone#9246

Merged
twoeths merged 7 commits intoChainSafe:unstablefrom
lodekeeper:fix/loadstate-clone-cache-aliasing
May 7, 2026
Merged

fix: prevent seed state cache poisoning in loadState clone#9246
twoeths merged 7 commits intoChainSafe:unstablefrom
lodekeeper:fix/loadstate-clone-cache-aliasing

Conversation

@lodekeeper
Copy link
Copy Markdown
Contributor

Summary

Fixes the v1.42.0 "Withdrawal mismatch at index=0" regression by changing the two clone() calls in loadState.ts to clone(true) (i.e. dontTransferCache=true).

The default clone() in @chainsafe/ssz transfers the source sub-view's cache to the new instance, which means the migratedState.validators sub-view and the seed container's cached child-view snapshot share the same internal nodes[] and caches[] arrays. A subsequent migratedState.commit() writes modified validator / inactivity-score nodes into those shared arrays, silently corrupting the seed state's cache snapshot at the modified indices.

The corruption stays latent until the seed state is later cloned with transfer-cache enabled — the path verifyBlock takes via preState.clone({dontTransferCache: false}). On that next-block clone, reads at the modified index return the migrated state's validator instead of the seed's, which surfaces as Withdrawal mismatch at index=0 divergences between Lodestar and EL.

Timeline of the corruption

// Inside loadState() with seedState = head state:
migratedState.validators = seedState.validators.clone();
// ^ migrated.validators.nodes === seedState.caches[validatorsIndex].nodes (shared)

for (const i of modifiedValidators) {
  migratedState.validators.set(i, loadValidator(...));  // staged in viewsChanged
}

migratedState.commit();
// ^ arrayComposite.js commit(): `this.nodes[index] = node`
//   writes newValidator into the SHARED nodes[] array.
//   seedState.caches[validatorsIndex].nodes[modifiedIndex] is now poisoned.

On the next block:

const preState = headState.clone();  // default dontTransferCache=false
// ^ transfers the poisoned caches[] snapshot to preState
preState.validators.getReadonly(i);  // returns migrated validator, not seed's
// -> getExpectedWithdrawals reads wrong validator at index 0
// -> "Withdrawal mismatch at index=0"

Root cause

Introduced by #8857 (chore: consume BeaconStateView) which added the loadOtherState / shared-head seed path that exercises this clone in production.

Test plan

  • New regression test loadState does not poison seed state's cache in packages/state-transition/test/unit/util/loadState.test.ts
  • Verified the test FAILS without the fix (reads 0xaa-filled validator on postState.clone()) and PASSES with the fix
  • All 176 existing state-transition util tests pass
  • check-types and lint clean

Relation to #9245

PR #9245 (fix: gate loadOtherState validators/balances preload behind opt-in) addresses a different regression from the same #8857-era changes — the eager getAllReadonlyValues() preload causing memory spikes on the API path. The two fixes are independent and both needed:

🤖 Generated with AI assistance

The default `clone()` in `@chainsafe/ssz` transfers the source sub-view's
cache to the new instance, which means `migrated.validators` and the seed
container's cached child-view snapshot share the SAME internal `nodes[]`
and `caches[]` arrays. A subsequent `migratedState.commit()` then writes
modified validator / inactivity-score nodes into those shared arrays,
silently corrupting the seed state's cache snapshot at the modified
indices.

The corruption stays latent until the seed is later cloned with
transfer-cache enabled - the path `verifyBlock` takes via
`preState.clone({dontTransferCache: false})`. On that next-block clone,
reads at the modified index return the migrated state's validator instead
of the seed's, which surfaces in production as
"Withdrawal mismatch at index=0" divergences between Lodestar and EL.

Use `clone(true)` so the migrated sub-view starts with a fresh empty
cache and its commit cannot reach into the seed's cache arrays. A
regression test exercises the `loadState -> seedState.clone() ->
validators.getReadonly(modifiedIndex)` sequence.

Root cause was introduced by ChainSafe#8857 which added the `loadOtherState` /
shared-head seed path that exercises this clone in production.

🤖 Generated with AI assistance
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a cache aliasing bug in the loadState utility where cloning subviews with default settings could lead to corruption of the seed state's cache during a commit. The fix involves using clone(true) for inactivityScores and validators to ensure a fresh cache. A regression test has been added to prevent future occurrences. Feedback suggests using the more explicit object syntax {dontTransferCache: true} in the clone calls to improve code readability and maintain consistency with the inline documentation.

// silently corrupting the seed container's cache snapshot. That corruption only surfaces
// on the next `seedState.clone({dontTransferCache: false})` read, producing a
// "Withdrawal mismatch at index=0"-style divergence downstream.
migratedState.inactivityScores = seedState.inactivityScores.clone(true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and to avoid "magic boolean" parameters, it is recommended to use the object syntax for the clone options. This also ensures consistency with the documentation provided in the comments above (line 118) and the PR description, which both reference the {dontTransferCache: ...} syntax.

Suggested change
migratedState.inactivityScores = seedState.inactivityScores.clone(true);
migratedState.inactivityScores = seedState.inactivityScores.clone({dontTransferCache: true});

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — but this suggestion doesn't match the actual SSZ API. TreeViewDU.clone() is typed as clone(dontTransferCache?: boolean) (see @chainsafe/ssz packages/ssz/src/viewDU/abstract.ts:111 and partialListComposite.ts:156), so {dontTransferCache: true} wouldn't type-check.

The inconsistency you flagged is real, but the right fix is to update my comments to match the boolean API rather than the code to match the (incorrect) comments. Pushed dde4635 correcting the three places that referenced the object-syntax form.

// arrays, silently corrupting the seed container's cache snapshot. That corruption only
// surfaces on the next `seedState.clone({dontTransferCache: false})` read, producing a
// "Withdrawal mismatch at index=0"-style divergence downstream.
migratedState.validators = seedState.validators.clone(true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and to avoid "magic boolean" parameters, it is recommended to use the object syntax for the clone options. This also ensures consistency with the documentation provided in the comments above (line 201) and the PR description, which both reference the {dontTransferCache: ...} syntax.

Suggested change
migratedState.validators = seedState.validators.clone(true);
migratedState.validators = seedState.validators.clone({dontTransferCache: true});

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as aboveTreeViewDU.clone() only accepts a boolean, so the object syntax wouldn't compile. I updated the inline comments here and on the inactivity-scores path in dde4635 so they no longer reference a {dontTransferCache: false} form that doesn't exist.

The inline comments referenced a `seedState.clone({dontTransferCache: false})`
object-syntax call that does not exist — the SSZ `TreeViewDU.clone()` signature
is `clone(dontTransferCache?: boolean)`. Clarify that the corruption surfaces
on the next default `seedState.clone()` read, and that `clone(true)` is the
dontTransferCache flag.

🤖 Generated with AI assistance
wemeetagain
wemeetagain previously approved these changes Apr 21, 2026
nflaig
nflaig previously approved these changes Apr 21, 2026
Copy link
Copy Markdown
Member

@nflaig nflaig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, leaving up to @twoeths for final approval/merge

@nflaig
Copy link
Copy Markdown
Member

nflaig commented May 5, 2026

@twoeths are we good with this one? waiting for your final approval

// migratedState starts with the same inactivityScores as seed state.
// `clone(true)` (dontTransferCache) is REQUIRED: the default transfer-cache clone would alias
// the seed subview's internal `nodes[]` / `caches[]` arrays, and a subsequent
// `migratedState.commit()` would write the modified score nodes into those shared arrays,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the nodes are shared after a transferred cache https://github.com/ChainSafe/ssz/blob/07fb1216144c99c2070282bbc07b6d438ba88453/packages/ssz/src/viewDU/abstract.ts#L115

the cache is actually transferred as the variable name inferred

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — the semantics here are: (default) transfers the cache to the new instance, while () leaves the new clone with a fresh empty cache. The nodes (tree structure) follow copy-on-write semantics regardless.

The stale multi-line comment that described this incorrectly was the root of the confusion. Fixed in 6779e05 — both sites now use the concise form , consistent with the rest of the codebase. See the reply on the adjacent thread for more context.

Copy link
Copy Markdown
Contributor

@twoeths twoeths left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I agree with the change to use clone(true), I don't see the cached nodes are shared between the two, it's transferred instead

  /**
   * Return a new ViewDU instance referencing the same internal `Node`.
   *
   * By default it will transfer the cache of this ViewDU to the new cloned instance. Set `dontTransferCache` to true
   * to NOT transfer the cache to the cloned instance.
   */
  clone(dontTransferCache?: boolean): this {
    if (dontTransferCache) return this.type.getViewDU(this.node) as this;

    const cache = this.cache;
    this.clearCache();
    return this.type.getViewDU(this.node, cache) as this;
  }

@lodekeeper please double check this, and update the code comment if needed

Clarify that default clone() transfers cache off the source subview, but still reuses the arrays referenced by the seed container's cached child snapshot.\n\n🤖 Generated with AI assistance
@lodekeeper lodekeeper dismissed stale reviews from wemeetagain and nflaig via 55e7a8c May 7, 2026 02:45
@lodekeeper
Copy link
Copy Markdown
Contributor Author

Double-checked it — you're right that clone() transfers cache off the source subview itself, it doesn't leave both subview instances sharing that source-view cache.

The bug is a bit narrower: the transferred nodes[] / caches[] arrays are still the same ones already referenced by the seed container's cached child-view snapshot, so migratedState.commit() can still mutate what a later default seedState.clone() will observe.

I pushed 55e7a8ca7b to clarify the comments/tests to describe that exact mechanism instead of saying the source subview cache is shared between both instances. No behavior change, just wording cleanup.

// the next default `seedState.clone()` read, producing a "Withdrawal mismatch at index=0"-
// style divergence downstream.
// `clone(true)` (dontTransferCache) is REQUIRED: the default transfer-cache clone reuses the
// same `nodes[]` / `caches[]` arrays already referenced by the seed container's cached
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is wrong, nodes[] are not shared but transferred

I think having these multiline comments are not needed
instead, just add 1 line comment: true = do not transfer cache
similar to other places

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right that nodes[] are transferred (not shared), and the multi-line explanation overcomplicated it. Replaced both blocks (line 115 inactivityScores + line 197 validators) with the one-line true = do not transfer cache convention used elsewhere in the codebase. Pushed in 6779e05.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6779e05 — see reply on the parent thread.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the parent thread — see r3198891374. Replaced both multiline blocks with true = do not transfer cache (equivalent to dontTransferCache=true).

Replace multi-line cache-transfer rationale at loadState.ts:115 / 197
with the one-line convention used elsewhere (`true = do not transfer
cache`), per ChainSafe#9246 review feedback.

🤖 Generated with AI assistance
@twoeths twoeths merged commit e3c96e7 into ChainSafe:unstable May 7, 2026
19 checks passed
@lodekeeper
Copy link
Copy Markdown
Contributor Author

@twoeths you're right that at the ViewDU level the cache is transferred, not shared — clone() moves V.nodes to the clone and leaves the original ViewDU with a fresh empty array. Good catch.

The sharing I described in the PR body is actually one level higher, at the parent container's cache. Here's the precise path:

  1. When seedState.validators is accessed, the parent container (seedState) stores caches[validatorsIndex] = { nodes: V.nodes, caches: V.caches, ... } — a direct reference to the same nodes[] array.
  2. V.clone() grabs cache = this.cache (a new object but with the same V.nodes array ref), then clearCache() does this.nodes = [] (a brand new array). The clone now holds the original array; the parent container's caches[validatorsIndex].nodes also still points to that original array.
  3. migratedState.commit()this.nodes[index] = node mutates the shared array in place, which simultaneously corrupts seedState.caches[validatorsIndex].nodes[index].

The SSZ source even has a TODO at arrayComposite.js lines 12–14 that explicitly flags this:

// TODO: Consider these properties are not accessible in the cache object persisted in the parent's cache.
// nodes, caches, _length, and nodesPopulated are mutated. Consider having them in a _cache object such that
// mutations affect the cache already found in the parent object

So the code comment // true = do not transfer cache is minimal but accurate. I could update it to something like:

// Pass true (dontTransferCache) to avoid the parent container's cached nodes[] array
// being shared with the clone — a default clone() would poison seedState's validator cache on commit

Let me know if you'd like a follow-up PR to improve the comment, or if you're satisfied leaving it as-is since the root cause is well-documented in the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants