Skip to content

Try updating to orc#4016

Draft
bhartnett wants to merge 4 commits intomasterfrom
try-update-to-orc
Draft

Try updating to orc#4016
bhartnett wants to merge 4 commits intomasterfrom
try-update-to-orc

Conversation

@bhartnett
Copy link
Copy Markdown
Contributor

@bhartnett bhartnett commented Feb 25, 2026

So far the following works with orc on linux:

  • Building the standalone execution client: make nimbus_execution_client
  • Apart from test_txpool test case, all tests pass when running: make test
  • The era file block import appears to work fine.
  • All eest blockchain tests pass when running: make eest_blockchain_test
  • All eest engine tests pass when running: make eest_engine_test

Comment thread tests/all_tests.nim
test_rpc,
test_snap,
test_transaction_json,
test_txpool,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is crashing with a seg fault on linux.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a stack-based segfault due to ulimit -s running as part of make test

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notably, running ./env.sh nim c -r tests/all_tests without ulimit does not reproduce it.

Comment thread tests/config.nims
# and look for .su files in "./build/", "./nimcache/" or $TMPDIR that list the
# stack size of each function.
switch("passC", "-fstack-usage -Werror=stack-usage=1048576")
switch("passL", "-fstack-usage -Werror=stack-usage=1048576")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that orc increases stack usage.

/home/user/development/status-im/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system.nim: In function ‘_ZN7eip759434validateBlobTransactionWrapper7594EN10pooled_txs17PooledTransactionE’:
/home/user/development/status-im/nimbus-eth1/execution_chain/core/eip7594.nim:19:15: error: stack usage might be 1205200 bytes [-Werror=stack-usage=]
   19 | proc validateBlobTransactionWrapper7594*(tx: PooledTransaction):
      |               ^
lto1: some warnings being treated as errors
make[1]: *** [/tmp/ccZV498U.mk:227: /tmp/ccLQlohJ.ltrans75.ltrans.o] Error 1
make[1]: *** Waiting for unfinished jobs....
lto-wrapper: fatal error: make returned 2 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status
Error: execution of an external program failed: 'g++  @all_tests_linkerArgs.txt'
stack trace: (most recent call last)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this seems to be behind the segfault

@bhartnett
Copy link
Copy Markdown
Contributor Author

bhartnett commented Feb 25, 2026

There is a compile failure in nimbus-eth2:

Building: build/nimbus
/home/user/development/status-im/nimbus-eth1/vendor/nimbus-eth2/beacon_chain/networking/peer_protocol.nim(191, 1) template/generic instantiation of `p2pProtocol` from here
/home/user/development/status-im/nimbus-eth1/vendor/nimbus-eth2/beacon_chain/networking/eth2_protocol_dsl.nim(349, 17) template/generic instantiation of `createPeerState` from here
/home/user/development/status-im/nimbus-eth1/vendor/nimbus-eth2/beacon_chain/networking/eth2_protocol_dsl.nim(187, 10) Error: expression cannot be cast to 'RootRef'
make: *** [Makefile:227: nimbus] Error 1

Casting to RootRef is no longer allowed with orc: nim-lang/Nim#20016

@agnxsh Perhaps you could have a look at fixing this compile error in the nimbus-eth2 side? I'm not too familiar with this code.

@bhartnett
Copy link
Copy Markdown
Contributor Author

The portal tests fail with a segfault:

user@pop-os:~/development/status-im/nimbus-eth1/portal/tests/beacon_network_tests$ nim compile -d:danger --verbosity:0 --hints:off --run "/home/user/development/status-im/nimbus-eth1/portal/tests/beacon_network_tests/test_beacon_content.nim"
Beacon Content Keys and Values ..Segmentation fault (core dumped)
Error: execution of an external program failed: '/home/user/development/status-im/nimbus-eth1/portal/tests/beacon_network_tests/test_beacon_content'

Appears to be caused by a stack overflow:

valgrind --leak-check=full ./test_beacon_content
Beacon Content Keys and Values ..==214801== Stack overflow in thread #1: can't grow stack to 0x1ffe801000
==214801== Can't extend stack to 0x1ffe800d08 during signal delivery for thread 1:
==214801==   no stack segment
==214801== 
==214801== Process terminating with default action of signal 11 (SIGSEGV)
==214801==  Access not within mapped region at address 0x1FFE800D08
==214801== Stack overflow in thread #1: can't grow stack to 0x1ffe801000
==214801==    at 0x237BD0: beacon_init_loader::loadNetworkData(string) [clone .constprop.0] (beacon_init_loader.nim:25)
==214801==  If you believe this happened as a result of a stack
==214801==  overflow in your program's main thread (unlikely but
==214801==  possible), you can try to increase the size of the
==214801==  main thread stack using the --main-stacksize= flag.
==214801==  The main thread stack size used in this run was 8388608.
==214801== 
==214801== HEAP SUMMARY:
==214801==     in use at exit: 1,024 bytes in 1 blocks
==214801==   total heap usage: 3 allocs, 2 frees, 2,520 bytes allocated
==214801== 
==214801== LEAK SUMMARY:
==214801==    definitely lost: 0 bytes in 0 blocks
==214801==    indirectly lost: 0 bytes in 0 blocks
==214801==      possibly lost: 0 bytes in 0 blocks
==214801==    still reachable: 1,024 bytes in 1 blocks
==214801==         suppressed: 0 bytes in 0 blocks
==214801== Reachable blocks (those to which a pointer was found) are not shown.
==214801== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==214801== 
==214801== For lists of detected and suppressed errors, rerun with: -s
==214801== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

After increasing the stack size locally (on linux) the seqfault disappears.

@tersec
Copy link
Copy Markdown
Contributor

tersec commented Feb 25, 2026

So far everything points to ORC using more stack, the question is why.

@tersec
Copy link
Copy Markdown
Contributor

tersec commented Feb 25, 2026

@bhartnett
Copy link
Copy Markdown
Contributor Author

Is this a blocker for moving to orc or do you think we could work around by just increasing stack size?

@tersec
Copy link
Copy Markdown
Contributor

tersec commented Feb 26, 2026

Is this a blocker for moving to orc or do you think we could work around by just increasing stack size?

That it occurs also with a pure-ref/heap version of this setup escalates it for me into something closer to a blocker, because it's more difficult to avoid. It's something truly invisible in the Nim source; there's no obvious stack usage in

let h = new array[8192, int]
let s = new seq[array[8192, int]]
add(s[], h[])

We've run into this issue before in other circumstances where we have to work around Nim materializing large objects on the stack. It's possible, but best avoided to put oneself into this situation because it's unbounded. We have significantly larger objects than a 128KiB blob, and it's important that these never be materialized on a stack.

@status-im-auto
Copy link
Copy Markdown
Member

status-im-auto commented Feb 26, 2026

Jenkins Builds

Click to see older builds (27)
Commit #️⃣ Finished (UTC) Duration Platform Result
⁉️ 77ec4b0 #2 2026-02-26 17:03:04 ~16 min unknown 📄log
⁉️ a9ccccc3 #3 2026-02-27 06:39:59 ~23 min unknown 📄log
⁉️ 9025639e #4 2026-02-28 09:09:46 ~23 min unknown 📄log
⁉️ 65a6705e #5 2026-03-03 08:57:46 ~11 min unknown 📄log
⁉️ f09bc5ef #6 2026-03-06 08:57:35 ~10 min unknown 📄log
⁉️ 2b120901 #7 2026-03-07 08:57:10 ~10 min unknown 📄log
⁉️ e7aa4b2f #8 2026-03-09 08:56:30 ~9 min unknown 📄log
⁉️ 7b91b6eb #9 2026-03-10 08:57:28 ~10 min unknown 📄log
⁉️ 96f34958 #10 2026-03-12 08:57:32 ~10 min unknown 📄log
⁉️ 4e1f5214 #11 2026-03-17 08:57:45 ~11 min unknown 📄log
⁉️ 9d418362 #12 2026-03-18 08:57:00 ~10 min unknown 📄log
⁉️ 9f33731f #13 2026-03-19 08:57:15 ~10 min unknown 📄log
⁉️ 040440a8 #14 2026-03-20 08:57:08 ~10 min unknown 📄log
⁉️ 26fd182e #15 2026-03-21 08:57:34 ~11 min unknown 📄log
⁉️ b8e59575 #16 2026-03-23 08:57:43 ~11 min unknown 📄log
⁉️ 5c1103fc #17 2026-03-25 08:56:35 ~10 min unknown 📄log
⁉️ f75d1552 #18 2026-03-26 08:56:20 ~9 min unknown 📄log
⁉️ 2f800551 #19 2026-03-27 08:57:56 ~11 min unknown 📄log
⁉️ 3252c3e0 #20 2026-03-28 08:57:32 ~11 min unknown 📄log
⁉️ 677abc61 #21 2026-03-30 08:57:47 ~11 min unknown 📄log
⁉️ 9972ea16 #22 2026-03-31 08:58:21 ~11 min unknown 📄log
⁉️ d444f833 #23 2026-04-01 08:57:55 ~11 min unknown 📄log
⁉️ 13e982f7 #24 2026-04-02 08:57:04 ~10 min unknown 📄log
⁉️ 96ae40cf #25 2026-04-03 08:57:52 ~11 min unknown 📄log
⁉️ 9e29caa8 #26 2026-04-04 08:58:17 ~11 min unknown 📄log
⁉️ aacc6129 #27 2026-04-08 08:57:51 ~11 min unknown 📄log
⁉️ 3d69990d #28 2026-04-10 08:57:35 ~10 min unknown 📄log
Commit #️⃣ Finished (UTC) Duration Platform Result
⁉️ c0d9ec3d #29 2026-04-11 08:58:04 ~11 min unknown 📄log
⁉️ 073bd10b #30 2026-04-12 08:57:32 ~10 min unknown 📄log

@status-im-auto
Copy link
Copy Markdown
Member

✔️ nimbus-eth1/prs/linux/x86_64/hive/PR-4016#25 🔹 ~11 min 🔹 96ae40cf 🔹 📦 null package

@status-im-auto
Copy link
Copy Markdown
Member

✔️ nimbus-eth1/prs/linux/x86_64/hive/PR-4016#26 🔹 ~11 min 🔹 9e29caa8 🔹 📦 null package

@bhartnett
Copy link
Copy Markdown
Contributor Author

@tersec So it turns out that moving to orc will be problematic for the multithreaded use case because it doesn't support atomic reference counting. Any ref types used in multiple threads concurrently will cause crashes due to ref counts being corrupted. The current parallel stateroot computation doesn't work in orc for this reason because it reads from the database from multiple threads in parallel in order to read the hashes and vertexes. The database, txFrame, and vertex types all through the code are ref types. I've found that it does work when using --mm:atomicArc so that confirms the issue is related to the reference counts.

Getting this to work in orc is possible, I could pass around ptr to object types but that would require updating much of the codebase and it leads to messy unmaintainable code in my opinion. The reason for passing these ref types into each thread is because I'm going for the shared memory model when multiple threads read and write to shared state which is generally faster than copying data between threads. When we implement full parallel execution and batch IO we need to be able to read state from the in memory layers and then the database in parallel. In order to do this, each thread needs to access the shared database and txFrame ref types.

@tersec
Copy link
Copy Markdown
Contributor

tersec commented Apr 19, 2026

Is it harder than refc or just neutral? https://nim-lang.org/docs/mm.html#other-mm-modes doesn't list either refc or ORC as atomic.

There is atomicArc, but I've never tried it. ARC in general isn't designed to collect cycles, which might be too big a constraint for Nimbus.

@bhartnett
Copy link
Copy Markdown
Contributor Author

bhartnett commented Apr 20, 2026

Is it harder than refc or just neutral? https://nim-lang.org/docs/mm.html#other-mm-modes doesn't list either refc or ORC as atomic.

There is atomicArc, but I've never tried it. ARC in general isn't designed to collect cycles, which might be too big a constraint for Nimbus.

refc is better than orc for my use case because I can use/share ref types between threads. I just need to make sure the ref type in the main thread outlives the tasks where it is used by the worker threads.

When I compile with arc or atomicArc there are some warnings about cycles so I guess we can't use arc based memory management.

That link says 'The reference counting operations (= "RC ops") do not use atomic instructions' under the arc/orc section. The fact that atomicArc exists suggests that arc is not atomic and this matches my conclusions based on my testing where I'm seeing crashes when using orc/arc.

Actually I'm not 100% sure if refc does in fact use atomic ref counts, it might actually be working because of the thread local heaps where the ref counts are stored separately on each heap and therefore the worker threads are unable to touch the ref counts of any other threads. Either way, refc works for me as does atomicArc (but atomicArc will likely be leaking memory).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants