[v25.3.x] c/tx_gateway_frontend: hold tm_stm gate in with() and with_free()#30084
Open
vbotbuildovich wants to merge 1 commit intoredpanda-data:v25.3.xfrom
Open
[v25.3.x] c/tx_gateway_frontend: hold tm_stm gate in with() and with_free()#30084vbotbuildovich wants to merge 1 commit intoredpanda-data:v25.3.xfrom
vbotbuildovich wants to merge 1 commit intoredpanda-data:v25.3.xfrom
Conversation
Fix for this segfault:
```
Backtrace:
[Backtrace #0]
seastar::guarded_backtrace(void**, int) at ././external/+non_module_dependencies+seastar/src/util/backtrace.cc:102
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)redpanda-data#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)redpanda-data#1}&&, bool) at ./external/+non_module_dependencies+seastar/include/seastar/util/backtrace.hh:89
seastar::backtrace_buffer::append_backtrace() at ././external/+non_module_dependencies+seastar/src/core/reactor.cc:801
(inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ././external/+non_module_dependencies+seastar/src/core/reactor.cc:839
seastar::print_with_backtrace(char const*, bool, bool) at ././external/+non_module_dependencies+seastar/src/core/reactor.cc:859
seastar::sigsegv_action(siginfo_t*, ucontext_t*) at ././external/+non_module_dependencies+seastar/src/core/reactor.cc:4260
(inlined by) seastar::install_oneshot_signal_handler<11, (void (*)(siginfo_t*, ucontext_t*))(&seastar::sigsegv_action)>()::{lambda(int, siginfo_t*, void*)redpanda-data#1}::operator()(int, siginfo_t*, void*) const at ././external/+non_module_dependencies+seastar/src/core/reactor.cc:4194
(inlined by) seastar::install_oneshot_signal_handler<11, (void (*)(siginfo_t*, ucontext_t*))(&seastar::sigsegv_action)>()::{lambda(int, siginfo_t*, void*)redpanda-data#1}::__invoke(int, siginfo_t*, void*) at ././external/+non_module_dependencies+seastar/src/core/reactor.cc:4189
addr2line: '/opt/redpanda/lib/libc.so.6': No such file
/opt/redpanda/lib/libc.so.6 0x4251f
std::__1::vector<ankerl::unordered_dense::v4_4_0::bucket_type::standard, std::__1::allocator<ankerl::unordered_dense::v4_4_0::bucket_type::standard> >::operator[][abi:ne200100](unsigned long) at ./external/toolchains_llvm++llvm+current_llvm_toolchain/bin/../../toolchains_llvm++llvm+current_llvm_toolchain_llvm/bin/../include/c++/v1/__vector/vector.h:404
(inlined by) chunked_vector<ankerl::unordered_dense::v4_4_0::bucket_type::standard>::operator[](unsigned long) at ./bazel-out/k8-opt/bin/src/v/container/_virtual_includes/chunked_vector/container/chunked_vector.h:241
(inlined by) ankerl::unordered_dense::v4_4_0::detail::table<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, cluster::tm_stm::tx_wrapper, ankerl::unordered_dense::v4_4_0::hash<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, void>, std::__1::equal_to<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > >, chunked_vector<std::__1::pair<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, cluster::tm_stm::tx_wrapper> >, ankerl::unordered_dense::v4_4_0::bucket_type::standard, chunked_vector<ankerl::unordered_dense::v4_4_0::bucket_type::standard>, true>::at(chunked_vector<ankerl::unordered_dense::v4_4_0::bucket_type::standard>&, unsigned long) at ./bazel-out/k8-opt/bin/external/+non_module_dependencies+unordered_dense/_virtual_includes/unordered_dense/ankerl/unordered_dense.h:873
(inlined by) chunked_vector<std::__1::pair<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, cluster::tm_stm::tx_wrapper> >::iter<false> ankerl::unordered_dense::v4_4_0::detail::table<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, cluster::tm_stm::tx_wrapper, ankerl::unordered_dense::v4_4_0::hash<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, void>, std::__1::equal_to<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > >, chunked_vector<std::__1::pair<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, cluster::tm_stm::tx_wrapper> >, ankerl::unordered_dense::v4_4_0::bucket_type::standard, chunked_vector<ankerl::unordered_dense::v4_4_0::bucket_type::standard>, true>::do_find<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > >(detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > const&) at ./bazel-out/k8-opt/bin/external/+non_module_dependencies+unordered_dense/_virtual_includes/unordered_dense/ankerl/unordered_dense.h:1161
ankerl::unordered_dense::v4_4_0::detail::table<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, cluster::tm_stm::tx_wrapper, ankerl::unordered_dense::v4_4_0::hash<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, void>, std::__1::equal_to<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > >, chunked_vector<std::__1::pair<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, cluster::tm_stm::tx_wrapper> >, ankerl::unordered_dense::v4_4_0::bucket_type::standard, chunked_vector<ankerl::unordered_dense::v4_4_0::bucket_type::standard>, true>::find(detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > const&) at ./bazel-out/k8-opt/bin/external/+non_module_dependencies+unordered_dense/_virtual_includes/unordered_dense/ankerl/unordered_dense.h:1804
(inlined by) cluster::tm_stm::try_rm_lock(detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > const&) at ./bazel-out/k8-opt/bin/src/v/cluster/_virtual_includes/cluster/cluster/tm_stm.h:233
seastar::continuation<seastar::internal::promise_base_with_type<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >, seastar::future<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >::finally_body<cluster::with<cluster::tx_gateway_frontend::process_locally(seastar::shared_ptr<cluster::tm_stm>, cluster::try_abort_request)::$_1::operator()()::{lambda()redpanda-data#1}>(seastar::shared_ptr<cluster::tm_stm>, detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > const&, std::__1::basic_string_view<char, std::__1::char_traits<char> >, cluster::tx_gateway_frontend::process_locally(seastar::shared_ptr<cluster::tm_stm>, cluster::try_abort_request)::$_1::operator()()::{lambda()redpanda-data#1}&&)::{lambda(auto:1)redpanda-data#1}::operator()<cluster::txlock_unit>(cluster::txlock_unit)::{lambda()redpanda-data#1}, false>, seastar::future<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >::then_wrapped_nrvo<seastar::future<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >, seastar::future<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >::finally_body<cluster::with<cluster::tx_gateway_frontend::process_locally(seastar::shared_ptr<cluster::tm_stm>, cluster::try_abort_request)::$_1::operator()()::{lambda()redpanda-data#1}>(seastar::shared_ptr<cluster::tm_stm>, detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > const&, std::__1::basic_string_view<char, std::__1::char_traits<char> >, cluster::tx_gateway_frontend::process_locally(seastar::shared_ptr<cluster::tm_stm>, cluster::try_abort_request)::$_1::operator()()::{lambda()redpanda-data#1}&&)::{lambda(auto:1)redpanda-data#1}::operator()<cluster::txlock_unit>(cluster::txlock_unit)::{lambda()redpanda-data#1}, false> >(seastar::future<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >::finally_body<cluster::with<cluster::tx_gateway_frontend::process_locally(seastar::shared_ptr<cluster::tm_stm>, cluster::try_abort_request)::$_1::operator()()::{lambda()redpanda-data#1}>(seastar::shared_ptr<cluster::tm_stm>, detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > const&, std::__1::basic_string_view<char, std::__1::char_traits<char> >, cluster::tx_gateway_frontend::process_locally(seastar::shared_ptr<cluster::tm_stm>, cluster::try_abort_request)::$_1::operator()()::{lambda()redpanda-data#1}&&)::{lambda(auto:1)redpanda-data#1}::operator()<cluster::txlock_unit>(cluster::txlock_unit)::{lambda()redpanda-data#1}, false>&&)::{lambda(seastar::internal::promise_base_with_type<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >&&, seastar::future<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >::finally_body<cluster::with<cluster::tx_gateway_frontend::process_locally(seastar::shared_ptr<cluster::tm_stm>, cluster::try_abort_request)::$_1::operator()()::{lambda()redpanda-data#1}>(seastar::shared_ptr<cluster::tm_stm>, detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > const&, std::__1::basic_string_view<char, std::__1::char_traits<char> >, auto:1&&)::{lambda(auto:1)redpanda-data#1}::operator()<cluster::txlock_unit>(auto:1)::{lambda()redpanda-data#1}, false>&&, seastar::future_state<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >&&)redpanda-data#1}, boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >::run_and_dispose() at ./bazel-out/k8-opt/bin/src/v/cluster/_virtual_includes/cluster/cluster/tm_stm.h:447
```
Seems the root cause is the lock units is accessing the stm state
(via raw pointer) _after_ the stm got destroyed.
This primarily happens via with() and with_free(). So the scenario is
the the stm is shutdown and the paritition is stopped racily before
units are returned.
There are multiple solutions to this but holding the gate in
with()/with_free() and preventing the stm shutdown seems the easiest
to reason about.
The code here is very old, super convulted and hard to reason about and
carries a risk of introducing deadlocks with any deeper changes, so
intentionally kept the surface area of this change simple.
(cherry picked from commit a43cf17)
bharathv
approved these changes
Apr 6, 2026
Collaborator
Author
CI test resultstest results on build#82799
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport of PR #30081