[BugFix] Fallback to FE cancel API when rollback fails for PREPARE-state lingering transactions#488
[BugFix] Fallback to FE cancel API when rollback fails for PREPARE-state lingering transactions#488banmoy wants to merge 4 commits intoStarRocks:mainfrom
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a060c037f6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
1 similar comment
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6510132f95
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
When aborting lingering transactions during Flink exactly-once sink startup,
if /api/transaction/rollback fails (e.g. returns TXN_NOT_EXISTS because the
coordinator BE lost the stream load context), fall back to calling
POST /api/{db}/{table}/_cancel which directly invokes
GlobalTransactionMgr.abortTransaction() on the FE side, bypassing the
TransactionWithoutChannelHandler's PREPARE-state redirect-to-BE logic.
This addresses the orphan transaction scenario where:
- FE's get_load_state returns PREPARE (transaction exists in FE)
- FE's rollback handler redirects to coordinator BE for PREPARE state
- BE returns TXN_NOT_EXISTS (context lost after Flink job crash)
- The cancel API works because it goes directly through FE's
CancelStreamLoadAction, not through TransactionWithoutChannelHandler
Co-authored-by: PengFei Li <lpengfei2016@gmail.com>
StarRocks REST APIs may return either 'OK' or 'Success' as the status field. The rollback path already handles both. Align cancelLoad to do the same. Co-authored-by: PengFei Li <lpengfei2016@gmail.com>
Co-authored-by: PengFei Li <lpengfei2016@gmail.com>
aca916d to
8656e6b
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8656e6b33b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
When cancelLoad throws (e.g. client timeout), FE may have already aborted the transaction. Re-query getLoadStatus after the exception to avoid false-negative startup failures. Co-authored-by: PengFei Li <lpengfei2016@gmail.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. 👍 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@cursor review |
|
@cursor review |
|
@claude review |
What type of PR is this:
Which issues of this PR fixes :
Problem Summary(Required) :
Background
When a Flink exactly-once sink job restores from a checkpoint,
LingeringTransactionAborterattempts to abort any lingering transactions from the previous run. It first queries the transaction status viaGET /api/{db}/get_load_state, and if the status isPREPAREorPREPARED, it sends a rollback request viaPOST /api/transaction/rollback.Bug
For transactions in the
PREPAREstate (data loading not yet completed, as opposed toPREPAREDwhere pre-commit has finished), the FE'sTransactionWithoutChannelHandler.handleRollbackTransaction()does not handle thePREPAREcase directly. Instead, it falls through to thedefaultbranch, returnsnull, and the request is redirected to the coordinator BE. However, after a Flink job crash, the BE's stream load context has already been cleaned up, so the BE returnsTXN_NOT_EXISTS. This creates a contradiction:get_load_statesays the transaction exists (PREPARE), but rollback says it does not (TXN_NOT_EXISTS).The connector's
tryAbortTransactioncatch block re-queries the status, finds it is stillPREPARE(neitherUNKNOWNnorABORTED), and throws a fatal exception, causing the sink to fail on startup.Call chain:
Fix
When
/api/transaction/rollbackfails and the re-queried transaction status is stillPREPAREorPREPARED, fall back to callingPOST /api/{db}/{table}/_cancel?label={label}. This endpoint is handled by StarRocks FE'sCancelStreamLoadAction, which directly callsGlobalTransactionMgr.abortTransaction()on the FE side without redirecting to BE, effectively bypassing theTransactionWithoutChannelHandlerPREPARE-state routing issue.Fixed call chain:
Changes
StreamLoader: AddedcancelLoad(db, table, label)default method.DefaultStreamLoader: ImplementedcancelLoad— sendsPOST /api/{db}/{table}/_cancel?label={label}to FE with proper auth and redirect-following.StreamLoadConstants: AddedgetCancelLoadUrlhelper.LingeringTransactionAborter: IntryAbortTransactioncatch block, when rollback fails and status is stillPREPARE/PREPARED, triescancelLoadas fallback before throwing the fatal exception.Checklist:
Note
Medium Risk
Changes exactly-once sink recovery behavior by adding a new FE-side cancel fallback when rollback fails for
PREPARE/PREPAREDtransactions, which could affect startup/recovery outcomes if the cancel endpoint or response parsing behaves differently across clusters.Overview
Fixes lingering-transaction cleanup during Flink exactly-once sink recovery by falling back to the FE
/_cancelendpoint whenrollbackfails and the transaction remains inPREPARE/PREPARED.Adds
StreamLoader.cancelLoad(db, table, label)(implemented inDefaultStreamLoaderwith a POST to/_cancel?label=...plus JSON status parsing) and extends unit tests to cover cancel success, cancel failure/exception, and mixed-label scenarios, including verifying a re-check can treat an aborted txn as success even if cancel threw.Written by Cursor Bugbot for commit 6bf2b5d. This will update automatically on new commits. Configure here.