Skip to content

[v25.3.x] c/tx_gateway_frontend: hold tm_stm gate in with() and with_free()#30084

Open
vbotbuildovich wants to merge 1 commit intoredpanda-data:v25.3.xfrom
vbotbuildovich:backport-pr-30081-v25.3.x-323
Open

[v25.3.x] c/tx_gateway_frontend: hold tm_stm gate in with() and with_free()#30084
vbotbuildovich wants to merge 1 commit intoredpanda-data:v25.3.xfrom
vbotbuildovich:backport-pr-30081-v25.3.x-323

Conversation

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Backport of PR #30081

  Fix for this segfault:

  ```
  Backtrace:
[Backtrace #0]
seastar::guarded_backtrace(void**, int) at ././external/+non_module_dependencies+seastar/src/util/backtrace.cc:102
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)redpanda-data#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)redpanda-data#1}&&, bool) at ./external/+non_module_dependencies+seastar/include/seastar/util/backtrace.hh:89
seastar::backtrace_buffer::append_backtrace() at ././external/+non_module_dependencies+seastar/src/core/reactor.cc:801
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ././external/+non_module_dependencies+seastar/src/core/reactor.cc:839
seastar::print_with_backtrace(char const*, bool, bool) at ././external/+non_module_dependencies+seastar/src/core/reactor.cc:859
seastar::sigsegv_action(siginfo_t*, ucontext_t*) at ././external/+non_module_dependencies+seastar/src/core/reactor.cc:4260
 (inlined by) seastar::install_oneshot_signal_handler<11, (void (*)(siginfo_t*, ucontext_t*))(&seastar::sigsegv_action)>()::{lambda(int, siginfo_t*, void*)redpanda-data#1}::operator()(int, siginfo_t*, void*) const at ././external/+non_module_dependencies+seastar/src/core/reactor.cc:4194
 (inlined by) seastar::install_oneshot_signal_handler<11, (void (*)(siginfo_t*, ucontext_t*))(&seastar::sigsegv_action)>()::{lambda(int, siginfo_t*, void*)redpanda-data#1}::__invoke(int, siginfo_t*, void*) at ././external/+non_module_dependencies+seastar/src/core/reactor.cc:4189
addr2line: '/opt/redpanda/lib/libc.so.6': No such file
/opt/redpanda/lib/libc.so.6 0x4251f
std::__1::vector<ankerl::unordered_dense::v4_4_0::bucket_type::standard, std::__1::allocator<ankerl::unordered_dense::v4_4_0::bucket_type::standard> >::operator[][abi:ne200100](unsigned long) at ./external/toolchains_llvm++llvm+current_llvm_toolchain/bin/../../toolchains_llvm++llvm+current_llvm_toolchain_llvm/bin/../include/c++/v1/__vector/vector.h:404
 (inlined by) chunked_vector<ankerl::unordered_dense::v4_4_0::bucket_type::standard>::operator[](unsigned long) at ./bazel-out/k8-opt/bin/src/v/container/_virtual_includes/chunked_vector/container/chunked_vector.h:241
 (inlined by) ankerl::unordered_dense::v4_4_0::detail::table<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, cluster::tm_stm::tx_wrapper, ankerl::unordered_dense::v4_4_0::hash<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, void>, std::__1::equal_to<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > >, chunked_vector<std::__1::pair<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, cluster::tm_stm::tx_wrapper> >, ankerl::unordered_dense::v4_4_0::bucket_type::standard, chunked_vector<ankerl::unordered_dense::v4_4_0::bucket_type::standard>, true>::at(chunked_vector<ankerl::unordered_dense::v4_4_0::bucket_type::standard>&, unsigned long) at ./bazel-out/k8-opt/bin/external/+non_module_dependencies+unordered_dense/_virtual_includes/unordered_dense/ankerl/unordered_dense.h:873
 (inlined by) chunked_vector<std::__1::pair<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, cluster::tm_stm::tx_wrapper> >::iter<false> ankerl::unordered_dense::v4_4_0::detail::table<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, cluster::tm_stm::tx_wrapper, ankerl::unordered_dense::v4_4_0::hash<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, void>, std::__1::equal_to<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > >, chunked_vector<std::__1::pair<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, cluster::tm_stm::tx_wrapper> >, ankerl::unordered_dense::v4_4_0::bucket_type::standard, chunked_vector<ankerl::unordered_dense::v4_4_0::bucket_type::standard>, true>::do_find<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > >(detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > const&) at ./bazel-out/k8-opt/bin/external/+non_module_dependencies+unordered_dense/_virtual_includes/unordered_dense/ankerl/unordered_dense.h:1161
ankerl::unordered_dense::v4_4_0::detail::table<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, cluster::tm_stm::tx_wrapper, ankerl::unordered_dense::v4_4_0::hash<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, void>, std::__1::equal_to<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > >, chunked_vector<std::__1::pair<detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> >, cluster::tm_stm::tx_wrapper> >, ankerl::unordered_dense::v4_4_0::bucket_type::standard, chunked_vector<ankerl::unordered_dense::v4_4_0::bucket_type::standard>, true>::find(detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > const&) at ./bazel-out/k8-opt/bin/external/+non_module_dependencies+unordered_dense/_virtual_includes/unordered_dense/ankerl/unordered_dense.h:1804
 (inlined by) cluster::tm_stm::try_rm_lock(detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > const&) at ./bazel-out/k8-opt/bin/src/v/cluster/_virtual_includes/cluster/cluster/tm_stm.h:233
seastar::continuation<seastar::internal::promise_base_with_type<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >, seastar::future<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >::finally_body<cluster::with<cluster::tx_gateway_frontend::process_locally(seastar::shared_ptr<cluster::tm_stm>, cluster::try_abort_request)::$_1::operator()()::{lambda()redpanda-data#1}>(seastar::shared_ptr<cluster::tm_stm>, detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > const&, std::__1::basic_string_view<char, std::__1::char_traits<char> >, cluster::tx_gateway_frontend::process_locally(seastar::shared_ptr<cluster::tm_stm>, cluster::try_abort_request)::$_1::operator()()::{lambda()redpanda-data#1}&&)::{lambda(auto:1)redpanda-data#1}::operator()<cluster::txlock_unit>(cluster::txlock_unit)::{lambda()redpanda-data#1}, false>, seastar::future<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >::then_wrapped_nrvo<seastar::future<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >, seastar::future<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >::finally_body<cluster::with<cluster::tx_gateway_frontend::process_locally(seastar::shared_ptr<cluster::tm_stm>, cluster::try_abort_request)::$_1::operator()()::{lambda()redpanda-data#1}>(seastar::shared_ptr<cluster::tm_stm>, detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > const&, std::__1::basic_string_view<char, std::__1::char_traits<char> >, cluster::tx_gateway_frontend::process_locally(seastar::shared_ptr<cluster::tm_stm>, cluster::try_abort_request)::$_1::operator()()::{lambda()redpanda-data#1}&&)::{lambda(auto:1)redpanda-data#1}::operator()<cluster::txlock_unit>(cluster::txlock_unit)::{lambda()redpanda-data#1}, false> >(seastar::future<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >::finally_body<cluster::with<cluster::tx_gateway_frontend::process_locally(seastar::shared_ptr<cluster::tm_stm>, cluster::try_abort_request)::$_1::operator()()::{lambda()redpanda-data#1}>(seastar::shared_ptr<cluster::tm_stm>, detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > const&, std::__1::basic_string_view<char, std::__1::char_traits<char> >, cluster::tx_gateway_frontend::process_locally(seastar::shared_ptr<cluster::tm_stm>, cluster::try_abort_request)::$_1::operator()()::{lambda()redpanda-data#1}&&)::{lambda(auto:1)redpanda-data#1}::operator()<cluster::txlock_unit>(cluster::txlock_unit)::{lambda()redpanda-data#1}, false>&&)::{lambda(seastar::internal::promise_base_with_type<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >&&, seastar::future<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >::finally_body<cluster::with<cluster::tx_gateway_frontend::process_locally(seastar::shared_ptr<cluster::tm_stm>, cluster::try_abort_request)::$_1::operator()()::{lambda()redpanda-data#1}>(seastar::shared_ptr<cluster::tm_stm>, detail::base_named_type<seastar::basic_sstring<char, unsigned int, 15u, true>, kafka::kafka_transactional_id, std::__1::integral_constant<bool, false> > const&, std::__1::basic_string_view<char, std::__1::char_traits<char> >, auto:1&&)::{lambda(auto:1)redpanda-data#1}::operator()<cluster::txlock_unit>(auto:1)::{lambda()redpanda-data#1}, false>&&, seastar::future_state<boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >&&)redpanda-data#1}, boost::outcome_v2::basic_result<cluster::tx_metadata, cluster::tx::errc, boost::outcome_v2::policy::throw_bad_result_access<cluster::tx::errc, void> > >::run_and_dispose() at ./bazel-out/k8-opt/bin/src/v/cluster/_virtual_includes/cluster/cluster/tm_stm.h:447
  ```

  Seems the root cause is the lock units is accessing the stm state
  (via raw pointer) _after_ the stm got destroyed.

  This primarily happens via with() and with_free(). So the scenario is
  the the stm is shutdown and the paritition is stopped racily before
  units are returned.

  There are multiple solutions to this but holding the gate in
  with()/with_free() and preventing the stm shutdown seems the easiest
  to reason about.

  The code here is very old, super convulted and hard to reason about and
  carries a risk of introducing deadlocks with any deeper changes, so
  intentionally kept the surface area of this change simple.

(cherry picked from commit a43cf17)
@vbotbuildovich vbotbuildovich added this to the v25.3.x-next milestone Apr 6, 2026
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Apr 6, 2026
@vbotbuildovich vbotbuildovich requested a review from bharathv April 6, 2026 21:43
@bharathv bharathv enabled auto-merge April 6, 2026 21:47
@vbotbuildovich
Copy link
Copy Markdown
Collaborator Author

CI test results

test results on build#82799
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) TestReadReplicaService test_writes_forbidden {"cloud_storage_type": 2, "partition_count": 10} integration https://buildkite.com/redpanda/redpanda/builds/82799#019d64d1-bca3-4366-b9a6-3d4690bcd76d 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TestReadReplicaService&test_method=test_writes_forbidden

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/redpanda kind/backport PRs targeting a stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants