Skip to content

[BugFix] Fallback to FE cancel API when rollback fails for PREPARE-state lingering transactions#488

Open
banmoy wants to merge 4 commits intoStarRocks:mainfrom
banmoy:cursor/starrocks-prepare-c1fb
Open

[BugFix] Fallback to FE cancel API when rollback fails for PREPARE-state lingering transactions#488
banmoy wants to merge 4 commits intoStarRocks:mainfrom
banmoy:cursor/starrocks-prepare-c1fb

Conversation

@banmoy
Copy link
Copy Markdown
Collaborator

@banmoy banmoy commented Mar 10, 2026

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Which issues of this PR fixes :

Problem Summary(Required) :

Background

When a Flink exactly-once sink job restores from a checkpoint, LingeringTransactionAborter attempts to abort any lingering transactions from the previous run. It first queries the transaction status via GET /api/{db}/get_load_state, and if the status is PREPARE or PREPARED, it sends a rollback request via POST /api/transaction/rollback.

Bug

For transactions in the PREPARE state (data loading not yet completed, as opposed to PREPARED where pre-commit has finished), the FE's TransactionWithoutChannelHandler.handleRollbackTransaction() does not handle the PREPARE case directly. Instead, it falls through to the default branch, returns null, and the request is redirected to the coordinator BE. However, after a Flink job crash, the BE's stream load context has already been cleaned up, so the BE returns TXN_NOT_EXISTS. This creates a contradiction: get_load_state says the transaction exists (PREPARE), but rollback says it does not (TXN_NOT_EXISTS).

The connector's tryAbortTransaction catch block re-queries the status, finds it is still PREPARE (neither UNKNOWN nor ABORTED), and throws a fatal exception, causing the sink to fail on startup.

Call chain:

LingeringTransactionAborter.tryAbortTransaction()
  → streamLoader.getLoadStatus()         → GET /api/{db}/get_load_state  → FE returns PREPARE ✅
  → streamLoader.rollback()              → POST /api/transaction/rollback
    → FE TransactionWithoutChannelHandler: PREPARE hits default → redirect to coordinator BE
    → BE: stream load context lost → TXN_NOT_EXISTS ❌
  → re-query status: still PREPARE
  → throw "Fail to abort lingering transaction, status: PREPARE"

Fix

When /api/transaction/rollback fails and the re-queried transaction status is still PREPARE or PREPARED, fall back to calling POST /api/{db}/{table}/_cancel?label={label}. This endpoint is handled by StarRocks FE's CancelStreamLoadAction, which directly calls GlobalTransactionMgr.abortTransaction() on the FE side without redirecting to BE, effectively bypassing the TransactionWithoutChannelHandler PREPARE-state routing issue.

Fixed call chain:

LingeringTransactionAborter.tryAbortTransaction()
  → streamLoader.rollback()              → POST /api/transaction/rollback → fails
  → re-query status: still PREPARE
  → streamLoader.cancelLoad()            → POST /api/{db}/{table}/_cancel?label={label}
    → FE CancelStreamLoadAction: GlobalTransactionMgr.abortTransaction() ✅

Changes

  • StreamLoader: Added cancelLoad(db, table, label) default method.
  • DefaultStreamLoader: Implemented cancelLoad — sends POST /api/{db}/{table}/_cancel?label={label} to FE with proper auth and redirect-following.
  • StreamLoadConstants: Added getCancelLoadUrl helper.
  • LingeringTransactionAborter: In tryAbortTransaction catch block, when rollback fails and status is still PREPARE/PREPARED, tries cancelLoad as fallback before throwing the fatal exception.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr will affect users' behaviors
  • This pr needs user documentation (for new or modified features or behaviors)
  • I have added documentation for my new feature or new function

Note

Medium Risk
Changes exactly-once sink recovery behavior by adding a new FE-side cancel fallback when rollback fails for PREPARE/PREPARED transactions, which could affect startup/recovery outcomes if the cancel endpoint or response parsing behaves differently across clusters.

Overview
Fixes lingering-transaction cleanup during Flink exactly-once sink recovery by falling back to the FE /_cancel endpoint when rollback fails and the transaction remains in PREPARE/PREPARED.

Adds StreamLoader.cancelLoad(db, table, label) (implemented in DefaultStreamLoader with a POST to /_cancel?label=... plus JSON status parsing) and extends unit tests to cover cancel success, cancel failure/exception, and mixed-label scenarios, including verifying a re-check can treat an aborted txn as success even if cancel threw.

Written by Cursor Bugbot for commit 6bf2b5d. This will update automatically on new commits. Configure here.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 10, 2026

CLA assistant check
All committers have signed the CLA.

@CelerData-Reviewer
Copy link
Copy Markdown

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a060c037f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@CelerData-Reviewer
Copy link
Copy Markdown

@codex review

1 similar comment
@CelerData-Reviewer
Copy link
Copy Markdown

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6510132f95

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .pr_body.md Outdated
banmoy added 3 commits March 10, 2026 10:50
When aborting lingering transactions during Flink exactly-once sink startup,
if /api/transaction/rollback fails (e.g. returns TXN_NOT_EXISTS because the
coordinator BE lost the stream load context), fall back to calling
POST /api/{db}/{table}/_cancel which directly invokes
GlobalTransactionMgr.abortTransaction() on the FE side, bypassing the
TransactionWithoutChannelHandler's PREPARE-state redirect-to-BE logic.

This addresses the orphan transaction scenario where:
- FE's get_load_state returns PREPARE (transaction exists in FE)
- FE's rollback handler redirects to coordinator BE for PREPARE state
- BE returns TXN_NOT_EXISTS (context lost after Flink job crash)
- The cancel API works because it goes directly through FE's
  CancelStreamLoadAction, not through TransactionWithoutChannelHandler

Co-authored-by: PengFei Li <lpengfei2016@gmail.com>
StarRocks REST APIs may return either 'OK' or 'Success' as the status
field. The rollback path already handles both. Align cancelLoad to do
the same.

Co-authored-by: PengFei Li <lpengfei2016@gmail.com>
Co-authored-by: PengFei Li <lpengfei2016@gmail.com>
@cursor cursor Bot force-pushed the cursor/starrocks-prepare-c1fb branch from aca916d to 8656e6b Compare March 10, 2026 10:50
@CelerData-Reviewer
Copy link
Copy Markdown

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8656e6b33b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

When cancelLoad throws (e.g. client timeout), FE may have already
aborted the transaction. Re-query getLoadStatus after the exception
to avoid false-negative startup failures.

Co-authored-by: PengFei Li <lpengfei2016@gmail.com>
@CelerData-Reviewer
Copy link
Copy Markdown

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 👍

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@banmoy
Copy link
Copy Markdown
Collaborator Author

banmoy commented Mar 10, 2026

@cursor review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@banmoy
Copy link
Copy Markdown
Collaborator Author

banmoy commented Mar 10, 2026

@cursor review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@banmoy
Copy link
Copy Markdown
Collaborator Author

banmoy commented Mar 12, 2026

@claude review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants