Skip to content

ct/l1: integrate metastore with cloud cache#30045

Open
andrwng wants to merge 10 commits intoredpanda-data:devfrom
andrwng:ct-l1-cache
Open

ct/l1: integrate metastore with cloud cache#30045
andrwng wants to merge 10 commits intoredpanda-data:devfrom
andrwng:ct-l1-cache

Conversation

@andrwng
Copy link
Copy Markdown
Contributor

@andrwng andrwng commented Apr 2, 2026

Switches the metastore's LSM data persistence from staging-directory-based to cloud-cache-backed. SST files are now written through a new staging_file handle that manages cache reservations and commits into the cloud cache after upload, so reads can hit locally cached files instead of re-downloading.

Some things to note:

  • commit_staging_file uses link()+unlink() instead of O_EXCL+rename — a crash between the two steps leaves real data at both paths rather than an empty sentinel at the destination.
  • trim_exhaustive() no longer deletes .part files. Their space is tracked via reservations, and deleting them would break in-flight readers. This matters more now that staging_file creates longer-lived .part files.
  • All persistence writers now use O_EXCL — the LSM guarantees unique file handles, so duplicate writes are bugs, not something to silently handle. Important to iron out since allowing overwrites is less trivial for the cloud cache persistence impl.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

Improvements

  • Cloud topics long-term storage metadata will now be managed as a part of the cloud cache.

@andrwng andrwng force-pushed the ct-l1-cache branch 9 times, most recently from 2c47aff to 28e7af4 Compare April 7, 2026 23:21
andrwng added 2 commits April 7, 2026 16:25
Plumb an optional lowres_clock deadline through the reserve_space
path so blocking waits respect it instead of waiting indefinitely.

Upcoming work to integrate the metastore with the cloud cache will use
this to bound reservation waits during staging file SST writes.
Lets a single guard accumulate reservations from multiple chunks.

Upcoming work will add a staging_file handle that reserves cache space
incrementally as data is appended, consolidating each chunk into
one guard so commits have exactly one reservation to finalize.
@andrwng andrwng marked this pull request as ready for review April 7, 2026 23:30
Copilot AI review requested due to automatic review settings April 7, 2026 23:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Integrates the L1 metastore’s LSM SST persistence with the cloud cache by introducing a cache-backed data_persistence implementation and a new cloud_io::staging_file write path that reserves cache space, uploads to object storage, and then commits into the cache for local-hit reads.

Changes:

  • Add cloud_cache_persistence for LSM data files, plus a shared cloud_data_persistence_base to deduplicate cloud CRUD logic.
  • Extend cloud cache with staging_file/commit APIs and deadline-aware reservation/trim throttling behavior.
  • Update cloud_topics (domain/db/snapshot paths) and tests to use cloud-cache-backed persistence instead of a staging-directory-backed implementation.

Reviewed changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/v/lsm/io/tests/persistence_test.cc Adds duplicate-write semantics test and introduces a cloud-cache-backed persistence factory for parametrized tests.
src/v/lsm/io/tests/BUILD Adds deps needed for cloud-cache-backed persistence tests.
src/v/lsm/io/persistence.h Updates open_sequential_writer contract to require unique file handles.
src/v/lsm/io/memory_persistence.cc Enforces unique-handle behavior by throwing on duplicate writer creation.
src/v/lsm/io/disk_persistence.cc Switches SST creation to exclusive creation semantics.
src/v/lsm/io/cloud_persistence.cc Refactors cloud persistence to use shared base + exclusive local staging creation.
src/v/lsm/io/cloud_data_persistence_base.h Introduces shared base for cloud-backed SST persistence operations.
src/v/lsm/io/cloud_data_persistence_base.cc Implements shared cloud operations (RTC, result checking, upload helper, remove/list/close).
src/v/lsm/io/cloud_cache_persistence.h Declares cloud-cache-backed SST persistence opener.
src/v/lsm/io/cloud_cache_persistence.cc Implements cache-backed SST reads/writes using cloud_io::staging_file.
src/v/lsm/io/BUILD Adds build targets for cloud_data_persistence_base and cloud_cache_persistence.
src/v/cloud_topics/read_replica/snapshot_manager.cc Switches snapshot manager’s LSM data persistence to cloud-cache-backed implementation.
src/v/cloud_topics/read_replica/BUILD Adds dependency on cloud_cache_persistence.
src/v/cloud_topics/level_one/metastore/lsm/tests/replicated_db_test.cc Updates replicated DB tests to initialize/use a cloud cache and new persistence.
src/v/cloud_topics/level_one/metastore/lsm/tests/BUILD Adds cache + persistence deps for updated tests.
src/v/cloud_topics/level_one/metastore/lsm/replicated_db.h Changes replicated DB open API to take a cache pointer instead of staging directory.
src/v/cloud_topics/level_one/metastore/lsm/replicated_db.cc Uses cloud-cache-backed persistence when opening databases.
src/v/cloud_topics/level_one/metastore/lsm/BUILD Adds deps for cache + cloud-cache persistence integration.
src/v/cloud_topics/level_one/domain/tests/db_domain_manager_test.cc Updates domain manager tests to initialize/use cloud cache.
src/v/cloud_topics/level_one/domain/tests/BUILD Adds cache + persistence deps for updated tests.
src/v/cloud_topics/level_one/domain/domain_supervisor.h Changes supervisor ctor to accept cache pointer (replaces staging dir).
src/v/cloud_topics/level_one/domain/domain_supervisor.cc Plumbs cache pointer through to domain manager construction.
src/v/cloud_topics/level_one/domain/db_domain_manager.h Changes db_domain_manager ctor/state to store cache pointer (replaces staging dir).
src/v/cloud_topics/level_one/domain/db_domain_manager.cc Plumbs cache into replicated DB open/reopen paths.
src/v/cloud_topics/level_one/domain/BUILD Adds cache dependencies for updated domain components.
src/v/cloud_topics/app.cc Passes cloud cache into domain supervisor construction; expands startup staging-dir cleanup behavior.
src/v/cloud_io/tests/cache_test.cc Adds tests for staging_file lifecycle, trim behavior, double-commit safety, and reservation release.
src/v/cloud_io/staging_file.h Introduces cloud_io::staging_file API for reserved staged writes into cache.
src/v/cloud_io/staging_file.cc Implements staging_file append/flush/close/commit behavior with reservations.
src/v/cloud_io/cache_service.h Exposes staging_file creation/commit and deadline-aware reservation plumbing.
src/v/cloud_io/cache_service.cc Implements staging file creation/commit, deadline-aware reserve/trim behavior, and .part handling changes.
src/v/cloud_io/BUILD Adds staging_file implementation/headers and required deps.
src/v/cloud_io/basic_cache_service_api.h Adds reservation guard merge support.
src/v/cloud_io/basic_cache_service_api.cc Implements reservation guard merge and keeps template instantiations.

andrwng added 5 commits April 7, 2026 16:50
Open files with O_EXCL instead of truncate in disk_persistence and
cloud_persistence. Reject duplicate file handles in
memory_persistence. The LSM engine guarantees unique file handles,
so duplicates indicate a bug and should fail loudly.

This is useful because I intend on adding another persistence type with
this behavior, and it will be easier for it and the other persistence
impls to have this guarantee, rather than having the overwrite behavior
(especially since the LSM should guarantee unique IDs).
.part files represent in-flight writes whose space is already
accounted for via reservations. Deleting them during trim_exhaustive
would cause subsequent reads to fail, only to retry the download and
recreate the file, so it doesn't really help much.

Upcoming work will add a staging_file handle that creates new .part
files for writes, making this race more likely.

The original behavior was added (redpanda-data#11860) more as a pre-caution against
hypothetical runtime bugs, rather than a functional fix of behavior.
As is, .part files are cleaned up at cache startup anyway.
Adds a handle bundling everything needed to write a file into the cache.
Callers append data (cache space is reserved automatically in chunks),
then commit or close depending on whether the appends were successful.

This will be used to write SST staging files in an upcoming LSM
persistence implementation that uses the cloud cache. The idea will be
that SSTs will append to these staging files (rather than in the L1
staging directory) and then after uploading, commit them into the cloud
cache so subsequent reads can use the local file.
Factor reusable code out of cloud_persistence.cc into
cloud_data_persistence_base so the cache-backed implementation can reuse
it. At a high level, this is code that interacts with the cloud
directly, with the idea that we'll use this in a persistence
implementation that is similar to the cloud persistence but uses cloud
cache instead of the staging directory for local files. No behavior
change.
Cache-backed data_persistence. Reads check the cache first and
download on miss. Writes stage through the cache and upload to cloud,
so subsequent reads hit locally.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 35 changed files in this pull request and generated 4 comments.

Comment on lines +2038 to +2051
auto key_path = _cache_dir / key;
auto filename = key_path.filename();
auto dir_path = key_path;
dir_path.remove_filename();

// TODO: share code with put().
auto tmp_filename = std::filesystem::path(
ss::format(
"{}_{}_{}{}",
filename.native(),
ss::this_shard_id(),
(++_cnt),
cache_tmp_file_extension));
auto tmp_filepath = dir_path / tmp_filename;
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_staging_path() builds key_path using _cache_dir / key without the path normalization / containment check used by put(). If key is absolute or contains .., this can create .part files outside the cache directory (path traversal). Validate/normalize key (and reject out-of-cache paths) similarly to put() before using it to construct filesystem paths.

Copilot uses AI. Check for mistakes.
Comment on lines +2093 to +2106
auto guard = _gate.hold();

auto dest_path = _cache_dir / key;
vlog(
log.debug, "Committing staging file {} to {}", staging_path, dest_path);

// We use link() rather than O_EXCL create + rename because a crash between
// the O_EXCL create and the rename would leave an empty file at the
// destination: on restart the cache would see a zero-byte entry for this
// key. With link(), a crash between link and unlink leaves the real data
// at the destination.
auto link_fut = co_await ss::coroutine::as_future(
ss::link_file(staging_path.native(), dest_path.native()));
if (link_fut.failed()) {
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit_staging_file() uses dest_path = _cache_dir / key without validating/normalizing key. A malicious or buggy caller could commit (link) arbitrary paths outside the cache directory. Apply the same containment validation as put() (and consider rejecting absolute/parent-relative keys) before linking into place.

Copilot uses AI. Check for mistakes.
Comment on lines +2120 to +2125
}
co_await ss::remove_file(staging_path.native());

auto file_size = co_await ss::file_size(dest_path.native());
reservation.wrote_data(file_size, 1);

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After link_file() succeeds, commit_staging_file() immediately remove_file(staging_path) and only then updates reservation/accounting. If the unlink throws, the destination file is already committed but the reservation/accounting update is skipped, leaving cache usage undercounted and the key present. Consider making staging cleanup best-effort (ignore/log failures) and ensure reservation.wrote_data(...) / access_time_tracker updates still run once the destination exists.

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +116
// Writing to the same file handle twice must not corrupt the original data.
TEST_P(PersistenceTest, DuplicateWritePreservesOriginal) {
{
auto w = persistence->open_sequential_writer({}).get();
auto _ = ss::defer([&w] { w->close().get(); });
w->append(iobuf::from("original data")).get();
}
// A second write to the same handle may throw or silently no-op
// depending on the backend. Either way, the original must survive.
try {
auto w = persistence->open_sequential_writer({}).get();
auto _ = ss::defer([&w] { w->close().get(); });
w->append(iobuf::from(fmt::format("hello, world: {}", i))).get();
w->append(iobuf::from("replacement")).get();
} catch (...) {
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated contract in data_persistence::open_sequential_writer says handles must be unique, but this test allows a second write to the same handle to succeed silently. For the cloud-cache backend a duplicate writer can still overwrite the remote object during upload while the cache commit becomes a no-op, so this test can pass while introducing latent corruption. Prefer asserting the second writer fails (exception) or otherwise verifying the remote object was not modified when a duplicate write is attempted.

Copilot uses AI. Check for mistakes.
andrwng added 3 commits April 7, 2026 17:59
replicated_db::open() takes cloud_io::cache* instead of a staging
directory path. Thread the cache pointer through domain_supervisor
and db_domain_manager in place of the staging directory.
The LSM previously staged SST files in the l1_staging directory via
cloud_data_persistence. Now that it uses the cloud cache, those files
are orphaned. Widen the startup cleanup to delete all regular files
in the staging directory, not just .tmp files.
Switch the read replica database_refresher to cache-backed LSM
persistence. The staging directory is still passed through for
l1::file_io which needs it for L1 object staging.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants