Conversation
| test_rpc, | ||
| test_snap, | ||
| test_transaction_json, | ||
| test_txpool, |
There was a problem hiding this comment.
This test is crashing with a seg fault on linux.
There was a problem hiding this comment.
It's a stack-based segfault due to ulimit -s running as part of make test
There was a problem hiding this comment.
notably, running ./env.sh nim c -r tests/all_tests without ulimit does not reproduce it.
| # and look for .su files in "./build/", "./nimcache/" or $TMPDIR that list the | ||
| # stack size of each function. | ||
| switch("passC", "-fstack-usage -Werror=stack-usage=1048576") | ||
| switch("passL", "-fstack-usage -Werror=stack-usage=1048576") |
There was a problem hiding this comment.
It appears that orc increases stack usage.
/home/user/development/status-im/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system.nim: In function ‘_ZN7eip759434validateBlobTransactionWrapper7594EN10pooled_txs17PooledTransactionE’:
/home/user/development/status-im/nimbus-eth1/execution_chain/core/eip7594.nim:19:15: error: stack usage might be 1205200 bytes [-Werror=stack-usage=]
19 | proc validateBlobTransactionWrapper7594*(tx: PooledTransaction):
| ^
lto1: some warnings being treated as errors
make[1]: *** [/tmp/ccZV498U.mk:227: /tmp/ccLQlohJ.ltrans75.ltrans.o] Error 1
make[1]: *** Waiting for unfinished jobs....
lto-wrapper: fatal error: make returned 2 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status
Error: execution of an external program failed: 'g++ @all_tests_linkerArgs.txt'
stack trace: (most recent call last)
There was a problem hiding this comment.
Yes, this seems to be behind the segfault
|
There is a compile failure in nimbus-eth2: Casting to RootRef is no longer allowed with orc: nim-lang/Nim#20016 @agnxsh Perhaps you could have a look at fixing this compile error in the nimbus-eth2 side? I'm not too familiar with this code. |
|
The portal tests fail with a segfault: Appears to be caused by a stack overflow: After increasing the stack size locally (on linux) the seqfault disappears. |
|
So far everything points to ORC using more stack, the question is why. |
|
Is this a blocker for moving to orc or do you think we could work around by just increasing stack size? |
That it occurs also with a pure- let h = new array[8192, int]
let s = new seq[array[8192, int]]
add(s[], h[])We've run into this issue before in other circumstances where we have to work around Nim materializing large objects on the stack. It's possible, but best avoided to put oneself into this situation because it's unbounded. We have significantly larger objects than a 128KiB blob, and it's important that these never be materialized on a stack. |
Jenkins BuildsClick to see older builds (27)
|
✔️ nimbus-eth1/prs/linux/x86_64/hive/PR-4016#25 🔹 ~11 min 🔹 96ae40cf 🔹 📦 null package |
✔️ nimbus-eth1/prs/linux/x86_64/hive/PR-4016#26 🔹 ~11 min 🔹 9e29caa8 🔹 📦 null package |
|
@tersec So it turns out that moving to orc will be problematic for the multithreaded use case because it doesn't support atomic reference counting. Any ref types used in multiple threads concurrently will cause crashes due to ref counts being corrupted. The current parallel stateroot computation doesn't work in orc for this reason because it reads from the database from multiple threads in parallel in order to read the hashes and vertexes. The database, txFrame, and vertex types all through the code are ref types. I've found that it does work when using --mm:atomicArc so that confirms the issue is related to the reference counts. Getting this to work in orc is possible, I could pass around ptr to object types but that would require updating much of the codebase and it leads to messy unmaintainable code in my opinion. The reason for passing these ref types into each thread is because I'm going for the shared memory model when multiple threads read and write to shared state which is generally faster than copying data between threads. When we implement full parallel execution and batch IO we need to be able to read state from the in memory layers and then the database in parallel. In order to do this, each thread needs to access the shared database and txFrame ref types. |
|
Is it harder than refc or just neutral? https://nim-lang.org/docs/mm.html#other-mm-modes doesn't list either There is |
refc is better than orc for my use case because I can use/share ref types between threads. I just need to make sure the ref type in the main thread outlives the tasks where it is used by the worker threads. When I compile with arc or atomicArc there are some warnings about cycles so I guess we can't use arc based memory management. That link says 'The reference counting operations (= "RC ops") do not use atomic instructions' under the arc/orc section. The fact that atomicArc exists suggests that arc is not atomic and this matches my conclusions based on my testing where I'm seeing crashes when using orc/arc. Actually I'm not 100% sure if refc does in fact use atomic ref counts, it might actually be working because of the thread local heaps where the ref counts are stored separately on each heap and therefore the worker threads are unable to touch the ref counts of any other threads. Either way, refc works for me as does atomicArc (but atomicArc will likely be leaking memory). |
So far the following works with orc on linux:
make nimbus_execution_clientmake testmake eest_blockchain_testmake eest_engine_test