Skip to content

Add codec pipeline framework for raw forward index encoding#18229

Open
xiangfu0 wants to merge 4 commits intoapache:masterfrom
xiangfu0:codec-v2
Open

Add codec pipeline framework for raw forward index encoding#18229
xiangfu0 wants to merge 4 commits intoapache:masterfrom
xiangfu0:codec-v2

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented Apr 15, 2026

Summary

Introduces a new codec-pipeline forward index format (version 7) for SV fixed-byte raw columns (INT, LONG), replacing the legacy compressionCodec field with an extensible codecSpec DSL. Both paths remain supported in 1.5.0 for mixed-version cluster safety.

What's in this PR

Codec pipeline SPI (pinot-segment-spi)

  • New types: CodecDefinition, CodecContext, CodecOptions, CodecKind, CodecInvocation, CodecPipeline, CodecSpecParser
  • DSL examples: "DELTA", "ZSTD(3)", "LZ4", "CODEC(DELTA,ZSTD(3))", "CODEC(DELTA,LZ4)"

Codec pipeline runtime (pinot-segment-local)

  • CodecRegistry — immutable singleton with built-in DELTA, ZSTD, LZ4 codecs
  • CodecPipelineExecutor — bakes pipeline flags at construction; thread-safe compress/decompress
  • DeltaCodecDefinition, ZstdCodecDefinition, Lz4CodecDefinition
  • FixedByteChunkForwardIndexWriterV7 — writes version-7 files with canonical spec in header
  • FixedByteChunkSVForwardIndexReaderV7 — reads version-7 files; validates sizeOfEntry header field; wires ForwardIndexReaderFactory dispatch
  • SingleValueFixedByteCodecPipelineIndexCreator — segment creator integration

FieldConfig / ForwardIndexConfig wiring

  • New codecSpec field on FieldConfig (with @JsonCreator on the 10-arg constructor to prevent silent null during ZK/HTTP deserialization)
  • compressionCodec and codecSpec are mutually exclusive (enforced by Preconditions and TableConfigUtils.validateCodecSpecIfPresent)
  • getCompressionCodec() / withCompressionCodec() deprecated on both FieldConfig and ForwardIndexConfig

Migration tooling

  • CompressionCodecMigrator — maps LZ4→"LZ4", ZSTANDARD→"ZSTD(3)", DELTA→"CODEC(DELTA,LZ4)"; annotated @Deprecated with 2.0 removal TODO

Tests

  • CodecPipelineForwardIndexTest — INT/LONG roundtrips for DELTA, ZSTD, LZ4, CODEC(DELTA,ZSTD), CODEC(DELTA,LZ4); canonical spec header; partial last chunk; ForwardIndexReaderFactory dispatch
  • CompressionCodecMigrationRoundtripTest — legacy codecs (PASS_THROUGH, SNAPPY, LZ4, ZSTANDARD, GZIP) written with old path, read back with V4 reader; migratable codecs verified to produce identical values via new V7 path
  • CompressionCodecMigratorTest — 14 unit tests covering toCodecSpec, isMigratable, migrate(FieldConfig), migrateTableConfig
  • FieldConfigTest — guards @JsonCreator regression; mutual exclusion enforcement
  • ForwardIndexTypeTest, TableConfigUtilsTest, backward-compat integration tests

Compatibility

  • New version-7 files can only be read by 1.5.0+ nodes; segment-level, not table-level
  • Old format (V4 legacy codecs) continues to be readable on all versions
  • Mixed-version rolling upgrade safe: both reader paths coexist

Test plan

  • ./mvnw -pl pinot-segment-local,pinot-segment-spi,pinot-spi -Dtest="CodecPipelineForwardIndexTest,CompressionCodecMigrationRoundtripTest,CompressionCodecMigratorTest,FieldConfigTest" test
  • Existing forward index tests in pinot-segment-local
  • Integration test: ForwardIndexHandlerTest

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an initial “codec pipeline” framework for raw (no-dict) single-value forward index encoding, introducing a DSL (DELTA, ZSTD(N), CODEC(...)) and a new self-describing on-disk format (writer/reader v7) that persists the canonical codec spec in the file header. This is wired through ForwardIndexConfig.codecSpec and the forward-index creator/reader factories, with tests covering parsing, validation, and INT/LONG round-trips.

Changes:

  • Introduce codec DSL AST + parser in pinot-segment-spi, plus ForwardIndexConfig.codecSpec (mutually exclusive with compressionCodec) and writer-version forcing.
  • Add codec registry/validator/executor in pinot-segment-local and implement v7 fixed-byte chunk writer/reader that stores the canonical spec in the header.
  • Wire new creator/reader paths via ForwardIndexCreatorFactory and ForwardIndexReaderFactory, and add comprehensive unit tests.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
pinot-segment-spi/src/test/java/org/apache/pinot/segment/spi/index/ForwardIndexConfigTest.java Adds tests for codecSpec JSON round-trip, equality, and mutual-exclusion/ordering guards.
pinot-segment-spi/src/test/java/org/apache/pinot/segment/spi/codec/CodecSpecParserTest.java New unit tests for codec DSL parsing/canonicalization and invalid specs.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/ForwardIndexConfig.java Adds codecSpec, forces raw writer version 7 when set, updates equals/hashCode and Builder guards.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/codec/CodecSpecParser.java Implements structural (phase-1) recursive-descent parser for the DSL.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/codec/CodecPipeline.java New AST node representing an ordered pipeline with canonical spec rendering.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/codec/CodecOptions.java Marker interface for typed, validated per-codec options.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/codec/CodecKind.java Enum classifying codecs as TRANSFORM vs COMPRESSION for pipeline rules.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/codec/CodecInvocation.java New AST node representing a single codec invocation + args.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/codec/CodecDefinition.java SPI interface describing codec parsing/validation/canonicalization.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/codec/CodecContext.java Context object for per-column type validation during pipeline validation.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/forward/CodecPipelineForwardIndexTest.java Integration tests for v7 writer/reader round-trip, header spec storage, factory dispatch, partial chunk.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/io/codec/CodecPipelineValidatorTest.java Tests pipeline validation rules (ordering, type checks, unknown codecs, arg ranges).
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/forward/FixedByteChunkSVForwardIndexReaderV7.java New v7 reader that reads canonical spec from header and decodes chunks via executor.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/forward/ForwardIndexReaderFactory.java Dispatches fixed-width SV version-7 raw indexes to the new v7 reader.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/forward/ForwardIndexCreatorFactory.java Creates codec-pipeline raw forward index creators when codecSpec is present and supported.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/fwd/SingleValueFixedByteCodecPipelineIndexCreator.java New creator wiring v7 writer + executor for INT/LONG SV raw forward indexes.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/FixedByteChunkForwardIndexWriterV7.java New v7 writer that embeds canonical spec in header and writes variable-size encoded chunks with long offsets.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/codec/ZstdCodecDefinition.java Adds ZSTD codec definition with typed options and canonicalization.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/codec/DeltaCodecDefinition.java Adds DELTA transform codec definition with INT/LONG validation and canonicalization.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/codec/CodecRegistry.java Introduces immutable default registry + mutable registry for tests/future plugins, with reserved keyword protection.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/codec/CodecPipelineValidator.java Validates structural rules and per-codec context compatibility for pipelines.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/codec/CodecPipelineExecutor.java Executes the validated pipeline per chunk (DELTA + ZSTD v1 wiring) and produces canonical spec.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 67.37720% with 352 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.54%. Comparing base (3eff094) to head (4673d78).

Files with missing lines Patch % Lines
...ment/local/io/codec/DeltaDeltaCodecDefinition.java 56.29% 49 Missing and 10 partials ⚠️
...t/segment/local/io/codec/DeltaCodecDefinition.java 58.58% 35 Missing and 6 partials ⚠️
.../forward/FixedByteChunkSVForwardIndexReaderV7.java 61.36% 23 Missing and 11 partials ⚠️
...ot/segment/local/io/codec/ZstdCodecDefinition.java 60.31% 17 Missing and 8 partials ⚠️
...ot/segment/local/io/codec/GzipCodecDefinition.java 72.83% 16 Missing and 6 partials ⚠️
...he/pinot/segment/local/utils/TableConfigUtils.java 13.04% 19 Missing and 1 partial ⚠️
...riter/impl/FixedByteChunkForwardIndexWriterV7.java 79.31% 10 Missing and 8 partials ⚠️
...SingleValueFixedByteCodecPipelineIndexCreator.java 0.00% 15 Missing ⚠️
...pache/pinot/segment/spi/codec/CodecSpecParser.java 82.75% 6 Missing and 9 partials ⚠️
.../segment/local/io/codec/CodecPipelineExecutor.java 78.78% 12 Missing and 2 partials ⚠️
... and 15 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18229      +/-   ##
============================================
+ Coverage     63.48%   63.54%   +0.05%     
- Complexity     1627     1659      +32     
============================================
  Files          3244     3263      +19     
  Lines        197365   198427    +1062     
  Branches      30540    30704     +164     
============================================
+ Hits         125306   126098     +792     
- Misses        62019    62207     +188     
- Partials      10040    10122      +82     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.52% <67.37%> (+0.06%) ⬆️
java-21 63.51% <67.37%> (+0.04%) ⬆️
temurin 63.54% <67.37%> (+0.05%) ⬆️
unittests 63.54% <67.37%> (+0.05%) ⬆️
unittests1 55.20% <12.78%> (-0.27%) ⬇️
unittests2 35.09% <56.25%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 10 comments.

Comment thread pinot-spi/src/test/java/org/apache/pinot/spi/config/table/FieldConfigTest.java Outdated
@xiangfu0 xiangfu0 changed the title Add codec pipeline framework for raw forward index encoding (v1) Add codec pipeline framework for raw forward index encoding Apr 16, 2026
Copy link
Copy Markdown
Contributor Author

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-confidence correctness issue; see inline comment.

codec.validateContext(options, ctx);

if (codec.kind() == CodecKind.COMPRESSION) {
compressionCount++;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only counts compression stages. The executor collapses transform stages into booleans, so CODEC(DELTA,DELTA) would pass validation, get encoded as a single delta layer, and then be decoded once, silently corrupting values. Please reject duplicate transform stages here or preserve the full ordered stage list in the executor.


/**
* Integration test for the codec pipeline forward index (version 7).
* Writes an offline table with INT and LONG raw columns encoded via the CODEC(DELTA,ZSTD(3)) pipeline,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test for string columns with raw dictionary + compression for codec.

* <li>{@link #maxCompressedSize}: returns an upper bound on encoded size.</li>
* </ul>
*/
public final class CodecPipelineExecutor {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This executor should be a generic executor init with a list of codec implementations. It should be codec agnostic.

Introduces a self-describing codec DSL (DELTA, DELTADELTA, ZSTD(N),
LZ4, SNAPPY, GZIP, CODEC(...)) and a new on-disk format version 7 that
stores the canonical codec spec in the file header. The pipeline is
codec-agnostic: each codec implements the ChunkCodecHandler SPI
(encode/decode/decodeInto/maxEncodedSize) and is dispatched
polymorphically by CodecPipelineExecutor.

Key components:
- Codec DSL AST + parser (CodecSpecParser) in pinot-segment-spi
- ChunkCodecHandler SPI and CodecRegistry/Validator/Executor in
  pinot-segment-local
- V7 fixed-byte chunk writer (FixedByteChunkForwardIndexWriterV7) and
  reader (FixedByteChunkSVForwardIndexReaderV7) for INT/LONG SV columns
- ForwardIndexCreatorFactory + ForwardIndexReaderFactory dispatch for V7
- ForwardIndexConfig.codecSpec field (mutually exclusive with
  compressionCodec); forces writer version 7
- ForwardIndexHandler V7-to-legacy rollback detection on segment reload
- CompressionCodecMigrator: type-agnostic and schema-aware helpers to
  translate legacy compressionCodec to codecSpec
- TableConfigUtils.validateCodecSpecIfPresent for table-level validation
- FieldConfig.withCodecSpec builder method
- CodecPipelineIntegrationTest: end-to-end segment build + query test
  covering INT/LONG codec-pipeline columns and STRING dictionary column

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
xiangfu0 and others added 3 commits April 19, 2026 06:25
…remature deprecations

- SnappyCodecDefinition.decodeInto: add dst.clear() to satisfy ChunkCodecHandler contract
  (other impls already did this; missing here could corrupt reads on buffer reuse)
- CodecKind Javadoc: correct "any number of TRANSFORM stages" to "at most one" to match
  the validator's enforced constraint
- FieldConfig.CompressionCodec: remove @deprecated from SNAPPY/LZ4/ZSTANDARD/GZIP since
  codecSpec only supports SV INT/LONG — these codecs remain the only valid option for
  STRING, BYTES, and multi-value columns; DELTA/DELTADELTA keep their deprecation (SV INT/LONG
  only, codecSpec is the correct replacement)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…on dispatch, add totalDocs bounds check

- CodecPipelineExecutor.buildCanonical: reuse already-parsed BoundStage options instead of
  calling parseOptions a second time per codec during executor construction
- ForwardIndexReaderFactory: replace confusing >=/>= guard chain with explicit equality checks
  for V7/V4 and an explicit fallthrough-to-old-reader for versions < V4, making the
  dispatch table self-documenting and easy to extend
- FixedByteChunkSVForwardIndexReaderV7: store totalDocs from header and validate docId
  bounds in getInt/getLong rather than silently discarding the header field

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…ard totalDocs < 0

- Move canonicalize() into BoundStage as an instance method, eliminating the raw cast and
  @SuppressWarnings in buildCanonical — each BoundStage already holds typed handler+options
- Move docId bounds check from getInt/getLong into getChunkBuffer so future accessors are
  protected by default; also validate totalDocs >= 0 during header parsing
- Add comment in ForwardIndexReaderFactory clarifying why versions 5–6 throw for fixed-byte SV

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@xiangfu0 xiangfu0 requested a review from Copilot April 20, 2026 03:44
@xiangfu0 xiangfu0 added feature New functionality index Related to indexing (general) labels Apr 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 40 out of 40 changed files in this pull request and generated 6 comments.

int compressedSize = encoded.remaining();

// Per-chunk header: compressedSize (int) + uncompressedSize (int)
ByteBuffer chunkHeader = ByteBuffer.allocateDirect(CHUNK_HEADER_BYTES);
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeChunk() allocates a new direct ByteBuffer for the per-chunk header on every chunk. For large segments this can create substantial direct-memory allocation/cleanup overhead. Consider reusing a single (heap or direct) 8-byte buffer for the chunk header, or writing the two ints via a reusable ByteBuffer field.

Suggested change
ByteBuffer chunkHeader = ByteBuffer.allocateDirect(CHUNK_HEADER_BYTES);
ByteBuffer chunkHeader = ByteBuffer.allocate(CHUNK_HEADER_BYTES);

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +60
Map<String, ChunkCodecHandler<?>> m = new LinkedHashMap<>();
m.put(DeltaCodecDefinition.INSTANCE.name().toUpperCase(), DeltaCodecDefinition.INSTANCE);
m.put(DeltaDeltaCodecDefinition.INSTANCE.name().toUpperCase(), DeltaDeltaCodecDefinition.INSTANCE);
m.put(ZstdCodecDefinition.INSTANCE.name().toUpperCase(), ZstdCodecDefinition.INSTANCE);
m.put(Lz4CodecDefinition.INSTANCE.name().toUpperCase(), Lz4CodecDefinition.INSTANCE);
m.put(SnappyCodecDefinition.INSTANCE.name().toUpperCase(), SnappyCodecDefinition.INSTANCE);
m.put(GzipCodecDefinition.INSTANCE.name().toUpperCase(), GzipCodecDefinition.INSTANCE);
DEFAULT = new CodecRegistry(Collections.unmodifiableMap(m));
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says the built-in codec registry contains DELTA/ZSTD/LZ4, but the default registry here also registers DELTADELTA, SNAPPY, and GZIP. Please either update the PR description to match the actual shipped built-ins, or drop these codecs from DEFAULT if they are not intended to be part of the initial contract.

Copilot uses AI. Check for mistakes.
Comment on lines +142 to +145
if (!Character.isLetter(first) && first != '_') {
throw new IllegalArgumentException(
"Expected codec name starting with letter at position " + _pos + " in: " + _input);
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CodecSpecParser.parseName(), the error message says the name must start with a letter, but the implementation also allows '_' as the first character. This makes diagnostics misleading for specs like "_FOO". Please update the exception message to match the accepted grammar (letter or underscore).

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +59
// Fixed seed for deterministic test data — this test verifies correctness, not random coverage
private static final Random RANDOM = new Random(42L);

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses a single static shared Random instance. Because it is stateful (and not thread-safe), the generated expected values depend on test execution order and can become flaky under parallel TestNG execution. Use a per-test/per-invocation Random seeded deterministically (e.g., derive a seed from spec+type) or precompute deterministic inputs without shared mutable RNG state.

Suggested change
// Fixed seed for deterministic test data — this test verifies correctness, not random coverage
private static final Random RANDOM = new Random(42L);
// Base seed for deterministic per-invocation test data generation.
private static final long RANDOM_SEED = 42L;
private static Random createRandom(String spec, DataType dataType) {
long seed = RANDOM_SEED;
seed = 31 * seed + spec.hashCode();
seed = 31 * seed + dataType.hashCode();
return new Random(seed);
}

Copilot uses AI. Check for mistakes.
* rewritten back to the legacy format — enabling rollback to pre-V7 servers.
*/
@Test
public void testComputeOperationNoChangeCompressionForV7CodecPipelineColumn()
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test method name is misleading: the Javadoc and assertions verify that a rewrite is scheduled for a V7 codec-pipeline segment when the config is reverted to legacy compressionCodec, but the name says "NoChangeCompression". Please rename the test to reflect the rollback/rewrite behavior being asserted.

Suggested change
public void testComputeOperationNoChangeCompressionForV7CodecPipelineColumn()
public void testComputeOperationChangeCompressionForV7CodecPipelineRollbackToLegacyCompressionCodec()

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +87
_numChunks = dataBuffer.getInt(offset);
offset += Integer.BYTES;

_numDocsPerChunk = dataBuffer.getInt(offset);
offset += Integer.BYTES;
if (_numDocsPerChunk <= 0 || (_numDocsPerChunk & (_numDocsPerChunk - 1)) != 0) {
throw new IllegalArgumentException(
"Invalid numDocsPerChunk in forward index header: " + _numDocsPerChunk
+ ". Expected a positive power of two.");
}
_shift = Integer.numberOfTrailingZeros(_numDocsPerChunk);

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FixedByteChunkSVForwardIndexReaderV7 reads numChunks from the header but never uses it to validate header consistency. If totalDocs, numDocsPerChunk, and numChunks disagree (corrupt or mismatched file), getChunkOffset() can read past the chunk-offset table. Consider validating numChunks == ceil(totalDocs/numDocsPerChunk) during construction and using numChunks to bounds-check chunkId before reading offsets.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New functionality index Related to indexing (general)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants