Skip to content

SDSTOR-21760: Fix table not found issue during index recovery#877

Open
yuwmao wants to merge 1 commit intoeBay:masterfrom
yuwmao:index_crash
Open

SDSTOR-21760: Fix table not found issue during index recovery#877
yuwmao wants to merge 1 commit intoeBay:masterfrom
yuwmao:index_crash

Conversation

@yuwmao
Copy link
Copy Markdown
Contributor

@yuwmao yuwmao commented Apr 17, 2026

Journal–table metadata mismatch due to CP vs destroy table ordering

Here is the issue description:
A split hits crash flip and marks its parent buffer with m_crash_flag_on during transact_bufs (src/lib/index/wb_cache.cpp:237-247). The same logical window removes the table: index_table::destroy() immediately removes its superblock from meta via MetaBlkService::remove_sub_sb (src/include/homestore/index/index_table.hpp:135-147 →
src/lib/meta/meta_blk_service.cpp:872+).
CP flush later starts and writes the txn_journal to meta first, then begins flushing dirty buffers; when the flagged parent buffer is reached, it crashes (src/lib/index/wb_cache.cpp:860-871, 896-903). On restart, recovery replays the persisted txn_journal and attempts to repair the table by ordinal, but the table superblock is gone and the table isn’t loaded → HS_DBG_ASSERT in repair_index_node (src/lib/index/index_service.cpp:205-212).

Key ordering rules that cause the mismatch
- Table destroy persistence is immediate at destroy(): meta superblock is removed synchronously (not tied to CP).
- Index CP flush ordering is fixed: (1) persist txn_journal; (2) flush dirty buffers; crash can occur at (2).
- Thus it’s possible to have a persisted journal entry for a table whose superblock was already removed.

Solution:
Trigger cp flush when deleting index table to force separate the deletion and other modification.

Here is the issue description:
journal–table metadata mismatch due to CP vs destroy ordering

A split hits crash flip and marks its parent buffer with m_crash_flag_on during transact_bufs (src/lib/index/wb_cache.cpp:237-247).
The same logical window removes the table: index_table::destroy() immediately removes its superblock from meta via MetaBlkService::remove_sub_sb (src/include/homestore/index/index_table.hpp:135-147 →
  src/lib/meta/meta_blk_service.cpp:872+).
CP flush later starts and writes the txn_journal to meta first, then begins flushing dirty buffers; when the flagged parent buffer is reached, it crashes (src/lib/index/wb_cache.cpp:860-871, 896-903).
On restart, recovery replays the persisted txn_journal and attempts to repair the table by ordinal, but the table superblock is gone and the table isn’t loaded → HS_DBG_ASSERT in repair_index_node (src/lib/index/index_service.cpp:205-212).
Key ordering rules that cause the mismatch
    - Table destroy persistence is immediate at destroy(): meta superblock is removed synchronously (not tied to CP).
    - Index CP flush ordering is fixed: (1) persist txn_journal; (2) flush dirty buffers; crash can occur at (2).
    - Thus it’s possible to have a persisted journal entry for a table whose superblock was already removed.
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 48.15%. Comparing base (1a0cef8) to head (d590f42).
⚠️ Report is 327 commits behind head on master.

Files with missing lines Patch % Lines
src/include/homestore/index/index_table.hpp 0.00% 0 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #877      +/-   ##
==========================================
- Coverage   56.51%   48.15%   -8.37%     
==========================================
  Files         108      110       +2     
  Lines       10300    12905    +2605     
  Branches     1402     6207    +4805     
==========================================
+ Hits         5821     6214     +393     
+ Misses       3894     2573    -1321     
- Partials      585     4118    +3533     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants