fix: Fix dangling lock from storage based lock by linliu-code · Pull Request #18439 · apache/hudi

linliu-code · 2026-04-01T17:17:30Z

Describe the issue this Pull Request addresses

When StorageBasedLockProvider attempts to expire/release a lock and the underlying storage write returns HTTP 429 (rate-limited), S3StorageLockClient, GCSStorageLockClient, and AzureStorageLockClient were logging the warning but missing the return statement, causing them to fall through to the UNKNOWN_ERROR path. The provider had no way to distinguish a throttle from a truly unknown state, so it would report the lock release as failed without retrying — potentially leaving the lock in a dangling (non-expired) state and blocking other writers.

Summary and Changelog

The bug fix.

Add THROTTLED(3) to LockUpsertResult to distinguish rate-limit responses from other transient errors.
Fix the missing return in S3StorageLockClient, GCSStorageLockClient, and AzureStorageLockClient so HTTP 429 returns THROTTLED instead of falling through to UNKNOWN_ERROR.

Lock-release retry, restructured.

Introduce enum ExpireLockResult { SUCCESS, THROTTLED, FAILED }. tryExpireCurrentLock is now a @VisibleForTesting single-attempt primitive that returns the enum; callers own the retry policy.
Restructure StorageBasedLockProvider.unlock() so the 1-second backoff retry sleeps outside the provider monitor (releases it, sleeps, re-acquires, re-checks believesLockMightBeHeld() in case the shutdown hook expired the lock during the wait). This avoids blocking other threads on the provider monitor for a full second during a transient throttle. The shutdown-hook path still holds the monitor by design.
- Addresses the "synchronized sleeping" concern raised in the prior code review.

Other lock paths that observed THROTTLED.

renewLock(): treat THROTTLED as transient — log a warning and return true so the heartbeat retries on its next cycle.
tryLock() failure metrics: replace the single updateLockAcquirePreconditionFailureMetric call with a switch over LockUpsertResult so ACQUIRED_BY_OTHERS, THROTTLED|UNKNOWN_ERROR, and other states each map to their correct metric instead of all sharing the precondition-failure counter.

Tests.

New: testRenewLockThrottledReturnsTrue, Azure 429 → THROTTLED assertion.
Updated: testUnlockSucceedsAfterThrottledRetry, testUnlockThrowsExceptionWhenStillThrottledAfterRetry for the new monitor-release behavior.
All targeted lock tests pass locally: TestStorageBasedLockProvider (45), TestS3StorageLockClient (18), TestGCSStorageLockClient (16), TestAzureStorageLockClient (27) — 106 tests, 0 failures.

Impact

Reduces the likelihood of dangling locks in S3-, GCS-, and Azure-backed StorageBasedLockProvider deployments under high write-rate conditions (e.g. GCS's 1-write/sec per object limit). The 1s retry no longer blocks the provider monitor, so concurrent operations on the same provider instance can proceed during the wait. No behavior change for non-429 error codes.

Risk Level

Medium. Touches the lock-release fast path and changes its concurrency semantics (sleep moved outside the monitor). Verified by the unit-test set above.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Nice work — the core bug fix (missing return statements for 429 in both S3 and GCS clients) is clearly correct and the new THROTTLED enum value is a clean way to surface this. The single-retry approach in expireLock is pragmatic. Just a couple of minor points in the inline comments around log accuracy and synchronized sleeping.

yihua · 2026-04-04T14:02:16Z

    result = this.storageLockClient.tryUpsertLockFile(expiredLockData, Option.of(this.getLock()));
+    if (result.getLeft() == LockUpsertResult.THROTTLED && !fromShutdownHook) {
+      logger.warn("Owner {}: Lock expiration write was throttled, retrying after 1 second.", ownerId);
+      try {


🤖 This method is synchronized, so the 1-second sleep will block any concurrent tryLock(), renewLock(), or getLock() call on the same provider instance. In practice, is the heartbeat always stopped before we reach this point? If so, it's probably fine — just want to confirm there's no window where the heartbeat thread could be waiting on this monitor while we sleep.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR fixes the dangling-lock bug by adding a THROTTLED enum value, restoring the missing return statements in S3/GCS/Azure clients, and restructuring unlock() so the 1-second retry sleeps outside the provider monitor. The round 1 concerns (log-message accuracy and synchronized sleeping) appear to be addressed. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor naming suggestion below; the overall change is clean and the threading design in the refactored unlock() is well-commented.

cc @yihua

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The fix correctly addresses the dangling-lock scenario by adding a THROTTLED result code, returning it from each cloud client's 429 path, and restructuring unlock() to retry once outside the provider monitor with a reference-equality bail-out against concurrent tryLock(). No critical correctness issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of comment phrasing nits in the concurrent locking logic; code structure is otherwise clean.

cc @yihua

hudi-agent · 2026-05-04T21:23:15Z

-      logger.debug("Owner {}: Gracefully shutting down heartbeat.", ownerId);
-      believesNoLongerHoldsLock &= heartbeatManager.stopHeartbeat(true);
+    // If throttled, retry once after sleeping outside the monitor to avoid blocking other threads.
+    // Note: when unlock() is called via close() -> shutdown(), the caller's synchronized methods


🤖 nit: "the sleep will still block in that path" is a bit ambiguous — it reads as if the sleep is itself blocking, but the actual concern is that the outer synchronized close()/shutdown() caller still holds the monitor throughout the sleep. Could you rephrase to something like "the outer synchronized caller still holds the monitor through the sleep, so other threads remain blocked in that scenario"?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

When StorageBasedLockProvider attempts to expire/release a lock and the underlying storage write returns HTTP 429 (rate-limited), the S3, GCS, and Azure storage lock clients were logging the warning but missing the return statement, causing them to fall through to the UNKNOWN_ERROR path. The provider had no way to distinguish a throttle from a truly unknown state, so it would report the lock release as failed without retrying — potentially leaving the lock in a dangling (non-expired) state and blocking other writers. Changes: - Add THROTTLED(3) to the LockUpsertResult enum to distinguish rate-limit responses from other transient errors. - Fix the missing return in S3StorageLockClient, GCSStorageLockClient, and AzureStorageLockClient so HTTP 429 responses now correctly return THROTTLED instead of falling through to UNKNOWN_ERROR. - Introduce ExpireLockResult { SUCCESS, THROTTLED, FAILED } to make tryExpireCurrentLock a single-attempt primitive; callers own the retry policy. - Restructure StorageBasedLockProvider.unlock() so the 1-second backoff retry sleeps OUTSIDE the provider monitor (releases the lock, sleeps, re-acquires and re-checks believesLockMightBeHeld). Avoids blocking other threads during the retry. The shutdown-hook path still holds the monitor by design. - Refine tryLock() failure metrics: switch over LockUpsertResult so ACQUIRED_BY_OTHERS, THROTTLED/UNKNOWN_ERROR, and other states are accounted in their own metric instead of all sharing the precondition failure counter. - Handle THROTTLED in renewLock(): treat as transient, log a warning and return true so the heartbeat retries on its next cycle. - Add unit tests: throttled-then-success unlock, throttled-then-still- throttled unlock, renewLock under THROTTLED, and Azure 429 → THROTTLED.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the careful follow-up here! The fix correctly adds the missing return for HTTP 429 paths and restructures unlock() so the retry sleep happens outside the provider monitor, with the lock-identity check guarding against a concurrent tryLock() replacing the lock during the sleep window. I traced the relevant race scenarios (concurrent unlock succeeding, natural expiration plus new acquirer, heartbeat-already-stopped invariant) and the logic holds up. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. One minor simplification suggestion below.

cc @yihua

voonhous

The fix LGTM, however in addressing HTTP status code 429.

However, i was thinking of other HTTP status/error codes that other providers might return. (Which may not be in the scope in this PR, we can track it separately and fix it in another PR).

S3 returns 503

AWS S3 documents 503 Slow Down as the rate-limit response for object operations S3 rarely returns 429 for puts, so on the S3 path, this PR fixes a case that almost never fires while leaving the actual S3 throttle path (503 SlowDown) falling into UNKNOWN_ERROR. (CMIIW)

Source:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html#:~:text=you%20may%20see%20some%20503%20(Slow%20Down)%20errors

Azure does something similar: https://learn.microsoft.com/en-us/rest/api/storageservices/common-rest-api-error-codes))

400

GCS retry strategy doc (https://cloud.google.com/storage/docs/retry-strategy) lists both 408, 429, and 5xx as retryable.

408

No source, but 408 is a request timeout and might be due to slow or network jitters which we might want to retry too instead of throwing UNKNOWN_ERROR.

We might want to address these separately.

yihua · 2026-05-20T05:18:33Z

However, i was thinking of other HTTP status/error codes that other providers might return. (Which may not be in the scope in this PR, we can track it separately and fix it in another PR).

We might want to address these separately.

In production deployment of the storage-based lock provider, so far we only see throttling on conditional writes on the same object on GCS due to the maximum rate of writes to the same object name (one write per second) (https://docs.cloud.google.com/storage/quotas#objects). So these follow-ups are low-priority.

Per review feedback, distinguish THROTTLED responses from generic "state unknown" outcomes by adding a dedicated lock.throttled counter. The three THROTTLED branches in StorageBasedLockProvider (tryLock, tryExpireCurrentLock, renewLock) now increment the new metric instead of reusing updateLockStateUnknownMetric.

Replace the single 1-second retry in unlock() with up to 3 retries on exponential backoff (1s, 2s, 4s). The other lock-provider paths (tryLock, renewLock) already retry many times against THROTTLED responses since the underlying SDK clients are configured with maxAttempts=1; this brings unlock closer to that pattern while keeping the wait bounded for callers. Extracted the sleep into sleepForThrottleRetry to keep tests fast and to allow verifying the backoff schedule via ArgumentCaptor.

hudi-bot · 2026-05-20T05:57:46Z

CI report:

cfdb474 Azure: SUCCESS
23d30cd Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

I made revision of adding a new metric and 3 retries with exponential backoff during unlock. LGTM!

codecov-commenter · 2026-05-20T07:11:59Z

Codecov Report

❌ Patch coverage is 1.66667% with 59 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.47%. Comparing base (4d0e9cd) to head (23d30cd).
⚠️ Report is 61 commits behind head on master.

Files with missing lines	Patch %	Lines
...ent/transaction/lock/StorageBasedLockProvider.java	0.00%	52 Missing ⚠️
...nt/transaction/lock/metrics/HoodieLockMetrics.java	25.00%	3 Missing ⚠️
...ient/transaction/lock/models/LockUpsertResult.java	0.00%	2 Missing ⚠️
...hudi/aws/transaction/lock/S3StorageLockClient.java	0.00%	1 Missing ⚠️
...udi/gcp/transaction/lock/GCSStorageLockClient.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18439      +/-   ##
============================================
- Coverage     68.08%   65.47%   -2.61%     
+ Complexity    28940    23719    -5221     
============================================
  Files          2519     2058     -461     
  Lines        140646   117991   -22655     
  Branches      17427    15381    -2046     
============================================
- Hits          95757    77259   -18498     
+ Misses        37030    33701    -3329     
+ Partials       7859     7031     -828

Flag	Coverage Δ
common-and-other-modules	`?`
hadoop-mr-java-client	`44.90% <1.72%> (-0.06%)`	⬇️
spark-client-hadoop-common	`48.23% <1.72%> (-0.21%)`	⬇️
spark-java-tests	`48.84% <1.66%> (+0.19%)`	⬆️
spark-scala-tests	`44.93% <0.00%> (+0.17%)`	⬆️
utilities	`37.46% <1.66%> (-0.24%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...hudi/aws/transaction/lock/S3StorageLockClient.java	`0.00% <0.00%> (-74.83%)`	⬇️
...udi/gcp/transaction/lock/GCSStorageLockClient.java	`0.00% <0.00%> (-86.41%)`	⬇️
...ient/transaction/lock/models/LockUpsertResult.java	`0.00% <0.00%> (-100.00%)`	⬇️
...nt/transaction/lock/metrics/HoodieLockMetrics.java	`58.62% <25.00%> (-32.95%)`	⬇️
...ent/transaction/lock/StorageBasedLockProvider.java	`0.00% <0.00%> (-90.56%)`	⬇️

... and 884 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

github-actions Bot added the size:S PR with lines of changes in (10, 100] label Apr 1, 2026

linliu-code marked this pull request as ready for review April 1, 2026 19:44

yihua reviewed Apr 4, 2026

View reviewed changes

linliu-code force-pushed the fix_dangling_lock branch from eefd11f to 9a7cfae Compare May 4, 2026 18:22

github-actions Bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels May 4, 2026

hudi-agent reviewed May 4, 2026

View reviewed changes

Comment thread ...t-common/src/main/java/org/apache/hudi/client/transaction/lock/StorageBasedLockProvider.java

linliu-code force-pushed the fix_dangling_lock branch from 9a7cfae to b05ad36 Compare May 4, 2026 20:04

hudi-agent reviewed May 4, 2026

View reviewed changes

linliu-code force-pushed the fix_dangling_lock branch from b05ad36 to cfdb474 Compare May 5, 2026 16:28

hudi-agent reviewed May 5, 2026

View reviewed changes

Comment thread ...t-common/src/main/java/org/apache/hudi/client/transaction/lock/StorageBasedLockProvider.java

yihua added this to the release-1.2.0 milestone May 15, 2026

voonhous reviewed May 20, 2026

View reviewed changes

yihua reviewed May 20, 2026

View reviewed changes

Comment thread ...t-common/src/main/java/org/apache/hudi/client/transaction/lock/StorageBasedLockProvider.java Outdated

Comment thread ...t-common/src/main/java/org/apache/hudi/client/transaction/lock/StorageBasedLockProvider.java Outdated

yihua reviewed May 20, 2026

View reviewed changes

Comment thread ...t-common/src/main/java/org/apache/hudi/client/transaction/lock/StorageBasedLockProvider.java Outdated

yihua approved these changes May 20, 2026

View reviewed changes

github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels May 20, 2026

yihua merged commit 12e87b8 into apache:master May 20, 2026
62 of 63 checks passed

yihua added a commit that referenced this pull request May 20, 2026

fix: Fix dangling lock from storage based lock (#18439)

83f99c8

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

Conversation

linliu-code commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yihua Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hudi-agent May 4, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

voonhous left a comment

Choose a reason for hiding this comment

S3 returns 503

400

408

Uh oh!

yihua commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudi-bot commented May 20, 2026

CI report:

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 20, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

linliu-code commented Apr 1, 2026 •

edited

Loading

yihua commented May 20, 2026 •

edited

Loading