Add codec pipeline framework for raw forward index encoding#18229
Add codec pipeline framework for raw forward index encoding#18229xiangfu0 wants to merge 4 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an initial “codec pipeline” framework for raw (no-dict) single-value forward index encoding, introducing a DSL (DELTA, ZSTD(N), CODEC(...)) and a new self-describing on-disk format (writer/reader v7) that persists the canonical codec spec in the file header. This is wired through ForwardIndexConfig.codecSpec and the forward-index creator/reader factories, with tests covering parsing, validation, and INT/LONG round-trips.
Changes:
- Introduce codec DSL AST + parser in
pinot-segment-spi, plusForwardIndexConfig.codecSpec(mutually exclusive withcompressionCodec) and writer-version forcing. - Add codec registry/validator/executor in
pinot-segment-localand implement v7 fixed-byte chunk writer/reader that stores the canonical spec in the header. - Wire new creator/reader paths via
ForwardIndexCreatorFactoryandForwardIndexReaderFactory, and add comprehensive unit tests.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| pinot-segment-spi/src/test/java/org/apache/pinot/segment/spi/index/ForwardIndexConfigTest.java | Adds tests for codecSpec JSON round-trip, equality, and mutual-exclusion/ordering guards. |
| pinot-segment-spi/src/test/java/org/apache/pinot/segment/spi/codec/CodecSpecParserTest.java | New unit tests for codec DSL parsing/canonicalization and invalid specs. |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/ForwardIndexConfig.java | Adds codecSpec, forces raw writer version 7 when set, updates equals/hashCode and Builder guards. |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/codec/CodecSpecParser.java | Implements structural (phase-1) recursive-descent parser for the DSL. |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/codec/CodecPipeline.java | New AST node representing an ordered pipeline with canonical spec rendering. |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/codec/CodecOptions.java | Marker interface for typed, validated per-codec options. |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/codec/CodecKind.java | Enum classifying codecs as TRANSFORM vs COMPRESSION for pipeline rules. |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/codec/CodecInvocation.java | New AST node representing a single codec invocation + args. |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/codec/CodecDefinition.java | SPI interface describing codec parsing/validation/canonicalization. |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/codec/CodecContext.java | Context object for per-column type validation during pipeline validation. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/forward/CodecPipelineForwardIndexTest.java | Integration tests for v7 writer/reader round-trip, header spec storage, factory dispatch, partial chunk. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/io/codec/CodecPipelineValidatorTest.java | Tests pipeline validation rules (ordering, type checks, unknown codecs, arg ranges). |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/forward/FixedByteChunkSVForwardIndexReaderV7.java | New v7 reader that reads canonical spec from header and decodes chunks via executor. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/forward/ForwardIndexReaderFactory.java | Dispatches fixed-width SV version-7 raw indexes to the new v7 reader. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/forward/ForwardIndexCreatorFactory.java | Creates codec-pipeline raw forward index creators when codecSpec is present and supported. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/fwd/SingleValueFixedByteCodecPipelineIndexCreator.java | New creator wiring v7 writer + executor for INT/LONG SV raw forward indexes. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/FixedByteChunkForwardIndexWriterV7.java | New v7 writer that embeds canonical spec in header and writes variable-size encoded chunks with long offsets. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/codec/ZstdCodecDefinition.java | Adds ZSTD codec definition with typed options and canonicalization. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/codec/DeltaCodecDefinition.java | Adds DELTA transform codec definition with INT/LONG validation and canonicalization. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/codec/CodecRegistry.java | Introduces immutable default registry + mutable registry for tests/future plugins, with reserved keyword protection. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/codec/CodecPipelineValidator.java | Validates structural rules and per-codec context compatibility for pipelines. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/codec/CodecPipelineExecutor.java | Executes the validated pipeline per chunk (DELTA + ZSTD v1 wiring) and produces canonical spec. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18229 +/- ##
============================================
+ Coverage 63.48% 63.54% +0.05%
- Complexity 1627 1659 +32
============================================
Files 3244 3263 +19
Lines 197365 198427 +1062
Branches 30540 30704 +164
============================================
+ Hits 125306 126098 +792
- Misses 62019 62207 +188
- Partials 10040 10122 +82
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
xiangfu0
left a comment
There was a problem hiding this comment.
Found one high-confidence correctness issue; see inline comment.
| codec.validateContext(options, ctx); | ||
|
|
||
| if (codec.kind() == CodecKind.COMPRESSION) { | ||
| compressionCount++; |
There was a problem hiding this comment.
This only counts compression stages. The executor collapses transform stages into booleans, so CODEC(DELTA,DELTA) would pass validation, get encoded as a single delta layer, and then be decoded once, silently corrupting values. Please reject duplicate transform stages here or preserve the full ordered stage list in the executor.
|
|
||
| /** | ||
| * Integration test for the codec pipeline forward index (version 7). | ||
| * Writes an offline table with INT and LONG raw columns encoded via the CODEC(DELTA,ZSTD(3)) pipeline, |
There was a problem hiding this comment.
Add test for string columns with raw dictionary + compression for codec.
| * <li>{@link #maxCompressedSize}: returns an upper bound on encoded size.</li> | ||
| * </ul> | ||
| */ | ||
| public final class CodecPipelineExecutor { |
There was a problem hiding this comment.
This executor should be a generic executor init with a list of codec implementations. It should be codec agnostic.
Introduces a self-describing codec DSL (DELTA, DELTADELTA, ZSTD(N), LZ4, SNAPPY, GZIP, CODEC(...)) and a new on-disk format version 7 that stores the canonical codec spec in the file header. The pipeline is codec-agnostic: each codec implements the ChunkCodecHandler SPI (encode/decode/decodeInto/maxEncodedSize) and is dispatched polymorphically by CodecPipelineExecutor. Key components: - Codec DSL AST + parser (CodecSpecParser) in pinot-segment-spi - ChunkCodecHandler SPI and CodecRegistry/Validator/Executor in pinot-segment-local - V7 fixed-byte chunk writer (FixedByteChunkForwardIndexWriterV7) and reader (FixedByteChunkSVForwardIndexReaderV7) for INT/LONG SV columns - ForwardIndexCreatorFactory + ForwardIndexReaderFactory dispatch for V7 - ForwardIndexConfig.codecSpec field (mutually exclusive with compressionCodec); forces writer version 7 - ForwardIndexHandler V7-to-legacy rollback detection on segment reload - CompressionCodecMigrator: type-agnostic and schema-aware helpers to translate legacy compressionCodec to codecSpec - TableConfigUtils.validateCodecSpecIfPresent for table-level validation - FieldConfig.withCodecSpec builder method - CodecPipelineIntegrationTest: end-to-end segment build + query test covering INT/LONG codec-pipeline columns and STRING dictionary column Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…remature deprecations - SnappyCodecDefinition.decodeInto: add dst.clear() to satisfy ChunkCodecHandler contract (other impls already did this; missing here could corrupt reads on buffer reuse) - CodecKind Javadoc: correct "any number of TRANSFORM stages" to "at most one" to match the validator's enforced constraint - FieldConfig.CompressionCodec: remove @deprecated from SNAPPY/LZ4/ZSTANDARD/GZIP since codecSpec only supports SV INT/LONG — these codecs remain the only valid option for STRING, BYTES, and multi-value columns; DELTA/DELTADELTA keep their deprecation (SV INT/LONG only, codecSpec is the correct replacement) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…on dispatch, add totalDocs bounds check - CodecPipelineExecutor.buildCanonical: reuse already-parsed BoundStage options instead of calling parseOptions a second time per codec during executor construction - ForwardIndexReaderFactory: replace confusing >=/>= guard chain with explicit equality checks for V7/V4 and an explicit fallthrough-to-old-reader for versions < V4, making the dispatch table self-documenting and easy to extend - FixedByteChunkSVForwardIndexReaderV7: store totalDocs from header and validate docId bounds in getInt/getLong rather than silently discarding the header field Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…ard totalDocs < 0 - Move canonicalize() into BoundStage as an instance method, eliminating the raw cast and @SuppressWarnings in buildCanonical — each BoundStage already holds typed handler+options - Move docId bounds check from getInt/getLong into getChunkBuffer so future accessors are protected by default; also validate totalDocs >= 0 during header parsing - Add comment in ForwardIndexReaderFactory clarifying why versions 5–6 throw for fixed-byte SV Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
| int compressedSize = encoded.remaining(); | ||
|
|
||
| // Per-chunk header: compressedSize (int) + uncompressedSize (int) | ||
| ByteBuffer chunkHeader = ByteBuffer.allocateDirect(CHUNK_HEADER_BYTES); |
There was a problem hiding this comment.
writeChunk() allocates a new direct ByteBuffer for the per-chunk header on every chunk. For large segments this can create substantial direct-memory allocation/cleanup overhead. Consider reusing a single (heap or direct) 8-byte buffer for the chunk header, or writing the two ints via a reusable ByteBuffer field.
| ByteBuffer chunkHeader = ByteBuffer.allocateDirect(CHUNK_HEADER_BYTES); | |
| ByteBuffer chunkHeader = ByteBuffer.allocate(CHUNK_HEADER_BYTES); |
| Map<String, ChunkCodecHandler<?>> m = new LinkedHashMap<>(); | ||
| m.put(DeltaCodecDefinition.INSTANCE.name().toUpperCase(), DeltaCodecDefinition.INSTANCE); | ||
| m.put(DeltaDeltaCodecDefinition.INSTANCE.name().toUpperCase(), DeltaDeltaCodecDefinition.INSTANCE); | ||
| m.put(ZstdCodecDefinition.INSTANCE.name().toUpperCase(), ZstdCodecDefinition.INSTANCE); | ||
| m.put(Lz4CodecDefinition.INSTANCE.name().toUpperCase(), Lz4CodecDefinition.INSTANCE); | ||
| m.put(SnappyCodecDefinition.INSTANCE.name().toUpperCase(), SnappyCodecDefinition.INSTANCE); | ||
| m.put(GzipCodecDefinition.INSTANCE.name().toUpperCase(), GzipCodecDefinition.INSTANCE); | ||
| DEFAULT = new CodecRegistry(Collections.unmodifiableMap(m)); | ||
| } |
There was a problem hiding this comment.
The PR description says the built-in codec registry contains DELTA/ZSTD/LZ4, but the default registry here also registers DELTADELTA, SNAPPY, and GZIP. Please either update the PR description to match the actual shipped built-ins, or drop these codecs from DEFAULT if they are not intended to be part of the initial contract.
| if (!Character.isLetter(first) && first != '_') { | ||
| throw new IllegalArgumentException( | ||
| "Expected codec name starting with letter at position " + _pos + " in: " + _input); | ||
| } |
There was a problem hiding this comment.
In CodecSpecParser.parseName(), the error message says the name must start with a letter, but the implementation also allows '_' as the first character. This makes diagnostics misleading for specs like "_FOO". Please update the exception message to match the accepted grammar (letter or underscore).
| // Fixed seed for deterministic test data — this test verifies correctness, not random coverage | ||
| private static final Random RANDOM = new Random(42L); | ||
|
|
There was a problem hiding this comment.
The test uses a single static shared Random instance. Because it is stateful (and not thread-safe), the generated expected values depend on test execution order and can become flaky under parallel TestNG execution. Use a per-test/per-invocation Random seeded deterministically (e.g., derive a seed from spec+type) or precompute deterministic inputs without shared mutable RNG state.
| // Fixed seed for deterministic test data — this test verifies correctness, not random coverage | |
| private static final Random RANDOM = new Random(42L); | |
| // Base seed for deterministic per-invocation test data generation. | |
| private static final long RANDOM_SEED = 42L; | |
| private static Random createRandom(String spec, DataType dataType) { | |
| long seed = RANDOM_SEED; | |
| seed = 31 * seed + spec.hashCode(); | |
| seed = 31 * seed + dataType.hashCode(); | |
| return new Random(seed); | |
| } |
| * rewritten back to the legacy format — enabling rollback to pre-V7 servers. | ||
| */ | ||
| @Test | ||
| public void testComputeOperationNoChangeCompressionForV7CodecPipelineColumn() |
There was a problem hiding this comment.
This test method name is misleading: the Javadoc and assertions verify that a rewrite is scheduled for a V7 codec-pipeline segment when the config is reverted to legacy compressionCodec, but the name says "NoChangeCompression". Please rename the test to reflect the rollback/rewrite behavior being asserted.
| public void testComputeOperationNoChangeCompressionForV7CodecPipelineColumn() | |
| public void testComputeOperationChangeCompressionForV7CodecPipelineRollbackToLegacyCompressionCodec() |
| _numChunks = dataBuffer.getInt(offset); | ||
| offset += Integer.BYTES; | ||
|
|
||
| _numDocsPerChunk = dataBuffer.getInt(offset); | ||
| offset += Integer.BYTES; | ||
| if (_numDocsPerChunk <= 0 || (_numDocsPerChunk & (_numDocsPerChunk - 1)) != 0) { | ||
| throw new IllegalArgumentException( | ||
| "Invalid numDocsPerChunk in forward index header: " + _numDocsPerChunk | ||
| + ". Expected a positive power of two."); | ||
| } | ||
| _shift = Integer.numberOfTrailingZeros(_numDocsPerChunk); | ||
|
|
There was a problem hiding this comment.
FixedByteChunkSVForwardIndexReaderV7 reads numChunks from the header but never uses it to validate header consistency. If totalDocs, numDocsPerChunk, and numChunks disagree (corrupt or mismatched file), getChunkOffset() can read past the chunk-offset table. Consider validating numChunks == ceil(totalDocs/numDocsPerChunk) during construction and using numChunks to bounds-check chunkId before reading offsets.
Summary
Introduces a new codec-pipeline forward index format (version 7) for SV fixed-byte raw columns (INT, LONG), replacing the legacy
compressionCodecfield with an extensiblecodecSpecDSL. Both paths remain supported in 1.5.0 for mixed-version cluster safety.What's in this PR
Codec pipeline SPI (
pinot-segment-spi)CodecDefinition,CodecContext,CodecOptions,CodecKind,CodecInvocation,CodecPipeline,CodecSpecParser"DELTA","ZSTD(3)","LZ4","CODEC(DELTA,ZSTD(3))","CODEC(DELTA,LZ4)"Codec pipeline runtime (
pinot-segment-local)CodecRegistry— immutable singleton with built-in DELTA, ZSTD, LZ4 codecsCodecPipelineExecutor— bakes pipeline flags at construction; thread-safe compress/decompressDeltaCodecDefinition,ZstdCodecDefinition,Lz4CodecDefinitionFixedByteChunkForwardIndexWriterV7— writes version-7 files with canonical spec in headerFixedByteChunkSVForwardIndexReaderV7— reads version-7 files; validates sizeOfEntry header field; wiresForwardIndexReaderFactorydispatchSingleValueFixedByteCodecPipelineIndexCreator— segment creator integrationFieldConfig / ForwardIndexConfig wiring
codecSpecfield onFieldConfig(with@JsonCreatoron the 10-arg constructor to prevent silent null during ZK/HTTP deserialization)compressionCodecandcodecSpecare mutually exclusive (enforced byPreconditionsandTableConfigUtils.validateCodecSpecIfPresent)getCompressionCodec()/withCompressionCodec()deprecated on bothFieldConfigandForwardIndexConfigMigration tooling
CompressionCodecMigrator— mapsLZ4→"LZ4",ZSTANDARD→"ZSTD(3)",DELTA→"CODEC(DELTA,LZ4)"; annotated@Deprecatedwith 2.0 removal TODOTests
CodecPipelineForwardIndexTest— INT/LONG roundtrips for DELTA, ZSTD, LZ4, CODEC(DELTA,ZSTD), CODEC(DELTA,LZ4); canonical spec header; partial last chunk; ForwardIndexReaderFactory dispatchCompressionCodecMigrationRoundtripTest— legacy codecs (PASS_THROUGH, SNAPPY, LZ4, ZSTANDARD, GZIP) written with old path, read back with V4 reader; migratable codecs verified to produce identical values via new V7 pathCompressionCodecMigratorTest— 14 unit tests covering toCodecSpec, isMigratable, migrate(FieldConfig), migrateTableConfigFieldConfigTest— guards@JsonCreatorregression; mutual exclusion enforcementForwardIndexTypeTest,TableConfigUtilsTest, backward-compat integration testsCompatibility
Test plan
./mvnw -pl pinot-segment-local,pinot-segment-spi,pinot-spi -Dtest="CodecPipelineForwardIndexTest,CompressionCodecMigrationRoundtripTest,CompressionCodecMigratorTest,FieldConfigTest" testpinot-segment-localForwardIndexHandlerTest🤖 Generated with Claude Code