Skip to content

feat(iceberg): support order key#25202

Open
xxhZs wants to merge 2 commits intomainfrom
xxh/add-sort-id-for-iceberg
Open

feat(iceberg): support order key#25202
xxhZs wants to merge 2 commits intomainfrom
xxh/add-sort-id-for-iceberg

Conversation

@xxhZs
Copy link
Copy Markdown
Contributor

@xxhZs xxhZs commented Mar 30, 2026

Summary

This PR adds order_key support for Iceberg tables and ensures the configured sort order is carried through both file writing and Iceberg compaction metadata.

Main changes:

  • parse and validate order_key for Iceberg table creation
  • reject invalid expressions and unsupported/system columns early in frontend/connector validation
  • build Iceberg SortOrder from the configured key when creating the table
  • write sort_order_id into generated Iceberg data files so the sort order is visible in metadata
  • preserve the same sort-order metadata on compaction outputs
  • add e2e coverage for Iceberg engine, append-only Iceberg engine, and Iceberg sink

In practice, this means a table like:

WITH (order_key = 'v1 desc nulls last, id asc') ENGINE = ICEBERG

will materialize an Iceberg sort order in table metadata, and the produced data files can be observed from rw_iceberg_files.sort_order_id.

Validation

Validated locally with Iceberg engine + storage catalog.

Test case:

  • order_key = 'v1 desc nulls last, v2 asc nulls first, id desc'
  • inserted rows in 3 batches with FLUSH after each batch:
    • (1, 100, 2, 'a'), (2, 100, 1, 'b')
    • (3, 100, 1, 'c'), (4, 90, 2, 'd')
    • (5, 90, 2, 'e'), (6, 90, 1, 'f')

Before compaction:

  • rw_iceberg_files showed 6 data files
  • all data files had non-null sort_order_id = 1

After starting a dedicated iceberg compactor and running:

VACUUM FULL t_order_multi;

After compaction:

  • rw_iceberg_files showed 1 data file
  • the compacted file still had non-null sort_order_id = 1

The compacted parquet file was read directly and its physical row order was:

(3, 100, 1, 'c')
(2, 100, 1, 'b')
(1, 100, 2, 'a')
(6, 90, 1, 'f')
(5, 90, 2, 'e')
(4, 90, 2, 'd')

This matches the configured multi-column order key:

  • v1 desc
  • v2 asc
  • id desc

Dependency

The compaction-side ordering behavior depends on the upstream change in:

This PR should wait for that upstream PR to be merged.

@xxhZs xxhZs requested a review from a team as a code owner March 30, 2026 09:48
@xxhZs xxhZs requested review from MrCroxx and removed request for a team March 30, 2026 09:48
@xxhZs xxhZs requested review from Li0k and chenzl25 March 30, 2026 09:49
let data_files = result
.into_iter()
.map(|f| {
let f = apply_sort_order_id_to_data_file(f, table_sort_order_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can't apply sort order key to the parquet file written by iceberg sink because there is no order guarantee in the streaming sink.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds order_key support for Iceberg table creation and sink file metadata, ensuring the configured sort order is propagated through table metadata and written data files (and preserved through compaction via updated compaction-core integration).

Changes:

  • Add order_key option parsing/validation (frontend + connector) and build Iceberg SortOrder during table auto-creation.
  • Write sort_order_id into produced Iceberg data files (visible via rw_iceberg_files) and update compaction integration configuration.
  • Add e2e coverage for Iceberg engine, append-only engine, and Iceberg sink.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/storage/src/hummock/compactor/iceberg_compaction/iceberg_compactor_runner.rs Update compaction planning config API usage (max_input_parallelism).
src/storage/Cargo.toml Bump iceberg-compaction-core git rev.
src/frontend/src/handler/create_table.rs Parse/validate order_key during Iceberg engine table creation; forward option to sink and strip from source props.
src/connector/src/sink/iceberg/mod.rs Add order_key to IcebergConfig, parse/validate it, build Iceberg SortOrder, and set sort_order_id on written data files; add unit tests.
e2e_test/iceberg/test_case/pure_slt/iceberg_sink.slt Add sink coverage verifying sort_order_id is populated.
e2e_test/iceberg/test_case/pure_slt/iceberg_engine.slt Add engine table coverage verifying sort_order_id is populated.
e2e_test/iceberg/test_case/pure_slt/iceberg_engine_append_only.slt Add append-only engine coverage verifying sort_order_id is populated.
Cargo.toml Bump iceberg / catalog crates git rev (aligned with compaction-core).
Cargo.lock Lockfile updates for the bumped git dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

#[serde(default)]
pub partition_by: Option<String>,

#[serde(default)]
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IcebergConfig derives WithOptions and a new order_key field is added, but the checked-in auto-generated src/connector/with_options_sink.yaml does not include order_key (no matches for order_key in that file). Please regenerate and commit the updated YAML (via ./risedev generate-with-options) so CI/docs stay in sync with the Rust option definitions.

Suggested change
#[serde(default)]
#[serde(default)]
#[with_option(skip)]

Copilot uses AI. Check for mistakes.
Comment on lines +3088 to +3094
let column = tokens[0];
let valid_column = Regex::new(r"^[A-Za-z_][A-Za-z0-9_]*$").unwrap();
if !valid_column.is_match(column) {
bail!(
"Invalid order key column `{column}`\nHINT: Only plain column names are supported in order_key"
);
}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_order_key_exprs recompiles Regex::new(r"^[A-Za-z_][A-Za-z0-9_]*$") for every item and uses unwrap(). Please precompile this regex once (e.g., static Lazy<Regex>) to avoid repeated allocations/CPU and remove the per-item unwrap() in the hot path.

Copilot uses AI. Check for mistakes.
Comment on lines +975 to +979
validate_order_key_columns(
order_key,
param.columns.iter().map(|column| column.name.as_str()),
)
.context("invalid order_key")?;
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In IcebergSink::new, validate_order_key_columns(...).context("invalid order_key")? returns an anyhow::Error and will be converted via From<anyhow::Error> into SinkError::Internal. For a user-provided WITH option this should be reported as SinkError::Config (with context) so invalid configuration is surfaced correctly instead of looking like an internal failure.

Suggested change
validate_order_key_columns(
order_key,
param.columns.iter().map(|column| column.name.as_str()),
)
.context("invalid order_key")?;
if let Err(e) = validate_order_key_columns(
order_key,
param.columns.iter().map(|column| column.name.as_str()),
)
.context("invalid order_key")
{
return Err(SinkError::Config(e));
}

Copilot uses AI. Check for mistakes.
@chenzl25
Copy link
Copy Markdown
Contributor

chenzl25 commented Apr 2, 2026

@xxhZs any updates?

@xxhZs
Copy link
Copy Markdown
Contributor Author

xxhZs commented Apr 7, 2026

can review againt @chenzl25

Copy link
Copy Markdown
Contributor

@chenzl25 chenzl25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants