feat(spark-4): add Apache Spark 4.x connector + batch-ingestion plugin (JDK 21+) by xiangfu0 · Pull Request #18261 · apache/pinot

xiangfu0 · 2026-04-20T08:32:07Z

Summary

Adds first-class Apache Spark 4.x support to Pinot, alongside the existing Spark 3 connector and batch-ingestion plugin. The two new modules live next to their Spark 3 siblings:

New module	Replaces (for Spark 4 users)	Scope
`pinot-connectors/pinot-spark-4-connector`	`pinot-spark-3-connector`	Spark DataSourceV2 read + write
`pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark-4`	`pinot-batch-ingestion-spark-3`	Segment generation + push runners

Both modules are activated only on JDK 21+ via the pinot-spark-4-connector / pinot-batch-ingestion-spark-4 profile in their respective parent poms (<jdk>[21,)</jdk>). On JDK 11/17 the reactor still builds Spark 3 and completely skips Spark 4, so existing Spark 3 users are unaffected.

Why a separate module?

Apache Spark 4.0 requires JDK 17+, is Scala 2.13 only (no more 2.12), ships Jetty 12 / Jakarta Servlet 5, and made a few small DataSourceV2 API tweaks. Running the Spark 3 connector under Spark 4 doesn't work cleanly. The modules sit side-by-side so:

Spark 3.5.x users keep the existing pinot-spark-3-connector unchanged.
Spark 4.0.x users pick up pinot-spark-4-connector.
The Maven coordinates, artifact names, and shaded-jar layout mirror the Spark 3 pair, so deployment automation (docker assembly, plugin discovery) doesn't need special cases beyond a JDK choice.

End-to-end tutorial

The End-to-end Docker tutorial walks through:

Starting Pinot via docker run apachepinot/pinot:latest QuickStart
Building the shaded connector jar
Building a apache/spark:4.0.0 + JDK 21 image
Reading baseballStats (97,889 rows) from Spark 4 via PySpark
Creating a target table and writing a DataFrame through the connector
Pushing the produced segment tars to the controller (REST or via the SparkSegment*PushJobRunner runners)
Reading the written data back through the connector

Every command was executed against apachepinot/pinot:latest and apache/spark:4.0.0 before being checked in.

Quick usage (read)

from pyspark.sql import SparkSession
spark = (SparkSession.builder.appName("pinot-read").getOrCreate())

df = (spark.read.format("pinot")
      .option("table", "baseballStats")
      .option("tableType", "offline")
      .option("controller", "pinot-controller:9000")
      .option("broker", "pinot-broker:8000")
      .load())

df.filter("yearID >= 2000").groupBy("teamID").count().show()

Invoke with spark-submit (Spark 4 on JDK 21):

spark-submit \
  --jars pinot-spark-4-connector-<VER>-shaded.jar,commons-lang3-3.20.0.jar \
  --conf spark.driver.extraClassPath=commons-lang3-3.20.0.jar \
  --conf spark.executor.extraClassPath=commons-lang3-3.20.0.jar \
  my_read_job.py

The prepended commons-lang3-3.20.0.jar works around Spark 4 shipping an older version that lacks ObjectUtils.getIfNull(...). See the tutorial's Known gotchas for the full list.

Quick usage (write)

(df.write.format("pinot")
    .mode("append")
    .option("table", "myTable")
    .option("tableType", "OFFLINE")
    .option("segmentNameFormat", "{table}_{partitionId:03}")
    .option("invertedIndexColumns", "name,category")
    .option("timeColumnName", "ts")
    .option("timeFormat", "EPOCH|MILLISECONDS")
    .option("timeGranularity", "1:MILLISECONDS")
    .option("controller", "pinot-controller:9000")
    .save("s3a://my-bucket/pinot-segments/myTable"))

This produces segment tar files at the save(...) path. Then push them to the controller — either REST for one-offs:

for seg in /path/to/*.tar.gz; do
  curl -X POST -H 'UPLOAD_TYPE: SEGMENT' -F "file=@${seg}" \
    "http://pinot-controller:9000/v2/segments?tableName=myTable&tableType=OFFLINE"
done

…or in production use the runners from pinot-batch-ingestion-spark-4 via a SegmentGenerationJobSpec YAML (see the module's README):

SparkSegmentGenerationJobRunner — build segments from Avro/CSV/Parquet input
SparkSegmentTarPushJobRunner — upload tars to the controller
SparkSegmentUriPushJobRunner — register tars already staged in deep storage (S3/GCS/HDFS)
SparkSegmentMetadataPushJobRunner — metadata-only push for large segments (avoid streaming the whole tar through the controller)

⚠️ Overwrite semantics

df.write.mode("overwrite").save(...), df.writeTo(...).overwrite(...), and SQL INSERT OVERWRITE all fail fast with UnsupportedOperationException. Pinot's write path can only append new segments — it cannot atomically drop or replace rows matching a Spark predicate. To replace data, drop the table first (controller REST) or use a push runner with REFRESH / consistent-push enabled. Prior to the fixes in this PR, overwrite calls silently appended without replacement, leaving stale data queryable.

Compatibility matrix

Module	Spark	Scala	JDK (build & runtime)
`pinot-spark-3-connector` / `pinot-batch-ingestion-spark-3`	3.5.x	2.12 / 2.13	8 / 11 / 17
`pinot-spark-4-connector` / `pinot-batch-ingestion-spark-4`	4.0.x	2.13 only	21 only

The Spark 4 jars are compiled with --release 21 (class file major version 65) so they cannot be loaded on a JDK 17 runtime even though Spark 4 itself supports JDK 17.

Correctness fixes (in addition to the Spark 4 port)

Review of the ported Spark 4 code surfaced several pre-existing issues that also exist in the Spark 3 connector. They are fixed symmetrically in both modules (Spark 3 changes are contained to the classes where the bug lives):

FilterPushDown compound filter soundness. Or / And / Not were marked as unconditionally supported even if a child was a filter type the connector does not recognize. Spark would remove the compound from its residual filter list, while compileFilter silently returned None, so Pinot received no filter at all and returned unfiltered rows. Fix: isFilterSupported is now recursive — compounds are accepted only when every child is supported.
LIKE pushdown escaping + backslash rejection. StringStartsWith / StringEndsWith / StringContains interpolated raw user-supplied strings into LIKE '...' without escaping %, _, \, or '. Values containing these changed the predicate meaning (or produced invalid SQL). Fix: escapeLikeLiteral escapes LIKE wildcards, backslashes, and single-quotes; each LIKE predicate now carries an ESCAPE '\' clause. Additionally, values containing a literal \ are rejected upstream (isFilterSupported returns false) and fall through to Spark's post-scan filter evaluation because Pinot's runtime RegexpPatternConverterUtils#likeToRegexpLike does not round-trip \\ correctly.
SupportsOverwrite[V2] silently dropped the overwrite intent. The writer advertised V1 SupportsOverwrite (Spark 3) and V2 SupportsOverwriteV2 (Spark 4) but ignored the filters/predicates and returned an append-only writer. Spark callers using df.write.mode("overwrite"), df.writeTo(...).overwrite(...), or SQL INSERT OVERWRITE believed matching rows would be replaced while Pinot only appended new segments. Fix: both overwrite(...) and truncate() now throw UnsupportedOperationException with a message pointing at the supported alternatives. The truncate() override is load-bearing: SupportsOverwrite[V2] extends SupportsTruncate, and Spark's V2Writes rule dispatches df.write.mode("overwrite") → truncate() (which defaulted to returning this → silent append) rather than overwrite([AlwaysTrue]). Both connectors now reject both paths.
PinotDataWriter leaked the per-partition temp segment build directory on every successful task. Files.createTempDirectory in generateSegment() was never deleted, so long-running executors accumulated uncompressed segment trees under the JVM tmpdir. Fix: track the temp dir on a module-private field; wrap commit() in a try/finally that deletes it; also delete from abort() and close() (idempotent via null-out).
PinotWrite.commit/abort routed writer commit messages through println instead of SLF4J. Fix: switch to LoggerFactory.getLogger(...) so messages reach the standard log collection. Also added an inline TODO documenting the separate gap that abort() does not yet delete leftover segment tars from already-succeeded tasks at savePath (tracked as a follow-up; the contract-level guard against silent overwrite is provided by the overwrite/truncate rejection above).

Regression tests for (1), (2), and (3) are added to both pinot-spark-3-connector and pinot-spark-4-connector.

Test plan

./mvnw -pl pinot-connectors/pinot-spark-4-connector,pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark-4 -am test on JDK 21 — 3 junit + 31 scalatest cases, all green.
./mvnw -pl pinot-connectors/pinot-spark-3-connector -am test on JDK 21 — Spark 3 still passes (31 scalatest).
./mvnw -pl pinot-integration-tests -am test-compile — integration tests still compile unchanged.
spotless:apply checkstyle:check license:check apache-rat:check clean on all modified modules.
Live end-to-end validation against apachepinot/pinot:latest + apache/spark:4.0.0 (+JDK 21): read 97,889-row baseballStats table via gRPC, write a 10-row DataFrame to a fresh OFFLINE table, push via controller REST, verify roundtrip through both the Pinot SQL broker and the connector's own read path.

Known coverage gaps (tracked for follow-up)

End-to-end Spark 4 integration test — A SparkSegmentMetadataPushIntegrationTest equivalent for Spark 4 was added then dropped from this PR (commit c66b2171aa) because the Spark 4 runtime (Jakarta Servlet 5) conflicts with pinot-integration-tests' Spark 3 + javax stack in a shared Maven module. Restoring end-to-end integration coverage requires either a dedicated pinot-spark-4-integration-tests module or failsafe with a separate classpath realm. Noted in the pinot-batch-ingestion-spark-4 README.
Push-runner unit tests — SparkSegmentMetadataPushJobRunner / SparkSegmentTarPushJobRunner / SparkSegmentUriPushJobRunner have no unit tests in either the Spark 3 or the Spark 4 batch-ingestion module today; only SparkSegmentGenerationJobRunnerTest exercises the in-process driver path. Adding coverage for the push runners is orthogonal to this PR and would benefit both modules.
PinotWrite.abort(...) leftover tars at savePath — on partial task failure the writer leaves already-committed tars at the user-supplied savePath, so a retried job after the push step can produce duplicate segments. The contract-level guard against silent overwrite is provided by the overwrite/truncate rejection; this runtime-level guard is tracked separately. TODO comment added in both connectors' PinotWrite.scala.

Notes for reviewers / future follow-ups

Commons-lang3 shading — currently users must prepend commons-lang3:3.20+ on spark.{driver,executor}.extraClassPath. A follow-up PR can relocate org.apache.commons.lang3.* in the connector's shade config to eliminate this step.
Scala 2.12 — Apache Spark 4 is Scala 2.13 only. The modules have requireProperty scala.compat.version=2.13 enforcer rules so -Pscala-2.12 builds fail fast with a clear message.
-Ppinot-fastdev — activation is JDK-based, so -Ppinot-fastdev on JDK 21 also builds the Spark 4 modules. Intentional; the Spark 4 shade is disabled under -Ppinot-fastdev via each module's own profile. Explicit opt-out: -P!pinot-spark-4-connector / -P!pinot-batch-ingestion-spark-4.

🤖 Generated with Claude Code

codecov-commenter · 2026-04-20T09:19:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.38%. Comparing base (4499bf5) to head (9a6d7a5).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18261      +/-   ##
============================================
- Coverage     63.38%   63.38%   -0.01%     
+ Complexity     1668     1633      -35     
============================================
  Files          3252     3252              
  Lines        198661   198661              
  Branches      30770    30770              
============================================
- Hits         125925   125922       -3     
- Misses        62666    62667       +1     
- Partials      10070    10072       +2

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-21	`63.38% <ø> (-0.01%)`	⬇️
temurin	`63.38% <ø> (-0.01%)`	⬇️
unittests	`63.38% <ø> (-0.01%)`	⬇️
unittests1	`55.34% <ø> (+<0.01%)`	⬆️
unittests2	`34.90% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

xiangfu0

Found a few high-signal correctness issues; see inline comments.

xiangfu0 · 2026-04-21T12:10:00Z

+    case _: StringStartsWith => true
+    case _: StringEndsWith => true
+    case _: StringContains => true
+    case _: Not => true


Marking compound filters as supported without recursively checking their children can silently drop predicates. For example, an Or/And/Not containing an unsupported child is removed from Spark residual filtering, then compileFilter returns None, so Pinot can return rows that do not satisfy the original Spark filter. Recursively accept compounds only when every child compiles; otherwise return the original filter as residual.

xiangfu0 · 2026-04-21T12:10:00Z

+      case GreaterThanOrEqual(attr, value) => s"${escapeAttr(attr)} >= ${compileValue(value)}"
+      case IsNull(attr) => s"${escapeAttr(attr)} IS NULL"
+      case IsNotNull(attr) => s"${escapeAttr(attr)} IS NOT NULL"
+      case StringStartsWith(attr, value) => s"${escapeAttr(attr)} LIKE '$value%'"


The LIKE pushdowns interpolate the raw Spark value while equality predicates escape string literals. Values containing quotes, percent, or underscore change the Pinot SQL predicate semantics, which can produce silent wrong results. Escape SQL literals and LIKE wildcards for StringStartsWith/StringEndsWith/StringContains before pushing these filters.

xiangfu0 · 2026-04-21T12:10:00Z

+
+  override def build(): Write = {
+    // TODO: utilize predicates
+    new PinotWrite(logicalWriteInfo)


This advertises SupportsOverwriteV2 but drops the overwrite predicates and builds an unconditional writer. Spark overwrite-by-filter callers will believe Pinot replaced the matching data, while this path only writes new segment files, which can leave old matching segments in place and produce duplicate or stale query results. Either implement predicate-aware replacement semantics or reject overwrite predicates explicitly.

Add a new top-level `pinot-spark-4/` umbrella module containing: - `pinot-batch-ingestion-spark-4`: Spark 4.x batch ingestion runners (Java), ported from `pinot-batch-ingestion-spark-3` with the package rename `spark3 → spark4` and the Spark 4 JVM flags required on JDK 17+. - `pinot-spark-4-connector`: Spark 4.x DataSourceV2 read/write connector (Scala 2.13), ported from `pinot-spark-3-connector`. Migrates `PinotWriteBuilder` from the Spark 3 `SupportsOverwrite`/`Filter` API to the Spark 4 `SupportsOverwriteV2`/`Predicate` API; all other source files are verbatim ports with the package rename `v3 → v4`. The umbrella module is activated only when the active JDK is 21+ via a root-pom profile with `<jdk>[21,)</jdk>` activation. Spark 3 modules are untouched and continue to build on JDK 11/17/21. Coverage: - 3 unit tests for the batch-ingestion runners (port of spark-3 tests) - 25 scalatest cases for the connector (port of spark-3 + a new `PinotDataSourceRegistrationTest` that verifies the DataSourceRegister SPI wiring; the same test is added to the Spark 3 connector for parity) - A `SparkSegmentMetadataPushIntegrationTest4` end-to-end integration test in `pinot-integration-tests/src/test/java-spark4/`, wired up under a `pinot-spark-4-integration-tests` profile that activates on JDK 21 and overrides `spark-core`/`spark-sql`/`spark-launcher` to Spark 4.x at test scope (provided-scope in the plugin pom does not transit to test scope, and the root pom pins those coordinates to Spark 3 by default). Packaging: - Root pom gains `spark4.version=4.0.0` and dependencyManagement entries for the two new jars. - `pinot-distribution/pinot-assembly.xml` gains a `<fileSet>` entry that conditionally includes the Spark 4 shaded jar when its target directory exists (present on JDK 21 builds, silently absent on JDK 11 builds), so a single descriptor handles both paths without forked variants. - `pinot-spark-4` is added to the `dependency-verifier` `skipModules` list, matching the pattern used for the other plugin trees. Build & test verification on JDK 21: - `./mvnw -pl pinot-spark-4/pinot-batch-ingestion-spark-4,pinot-spark-4/pinot-spark-4-connector -am test` — all tests pass - `./mvnw -pl pinot-integration-tests test-compile` — the JDK-21 profile picks up the new Spark 4 integration test and compiles cleanly - `./mvnw -pl ... spotless:apply checkstyle:check license:check` — clean README files at `pinot-spark-4/pinot-batch-ingestion-spark-4/README.md` and `pinot-spark-4/pinot-spark-4-connector/README.md` document prerequisites, build commands, data-source options, quickstart snippets, and the Spark 3 vs Spark 4 compatibility matrix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Relocate the two Spark 4 modules from the top-level `pinot-spark-4/` umbrella into the same directories as their Spark 3 counterparts: - `pinot-spark-4/pinot-spark-4-connector/` → `pinot-connectors/pinot-spark-4-connector/` - `pinot-spark-4/pinot-batch-ingestion-spark-4/` → `pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark-4/` This matches the layout of `pinot-spark-3-connector` / `pinot-batch-ingestion-spark-3` and lets the Spark 4 modules inherit common dependencies (pinot-core as provided, testng as test) through the normal `pinot-plugins` / `pinot-batch-ingestion` parent chain, which the umbrella pom had to declare by hand. JDK 21 gating moves accordingly: - `pinot-connectors/pom.xml` and `pinot-plugins/pinot-batch-ingestion/pom.xml` each gain a profile with `<jdk>[21,)</jdk>` activation that registers the corresponding Spark 4 module. On JDK 11/17 the modules are absent from the reactor entirely; on JDK 21 they build alongside Spark 3. - The root-pom `pinot-spark-4` profile is removed — no longer needed, since gating is now handled by each parent. - `dependency-verifier` `skipModules` drops the `pinot-spark-4` entry; `pinot-connectors` and `pinot-plugins` already cover the new locations. - Each Spark 4 module pom carries `<jdk.version>21</jdk.version>` and a maven-enforcer `requireJavaVersion [21,)` + `requireProperty scala.compat.version=2.13` safety net, matching what the old umbrella pom provided when the module was invoked directly. - `pinot-distribution/pinot-assembly.xml` fileSet directory updated to the new jar location. - READMEs updated to point to the new locations and parent poms. All unit + scalatest cases still pass on JDK 21 (3 junit + 25 scalatest). The JDK-21 integration-tests profile (and its `SparkSegmentMetadataPushIntegrationTest4`) is unchanged and continues to pull in `pinot-batch-ingestion-spark-4` from its new location via dependencyManagement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@test

… 21 CI The Spark 3 `SparkSegmentMetadataPushIntegrationTest` was failing in the "Pinot Integration Test Set 2 (temurin-21)" CI job with Servlet class org.glassfish.jersey.servlet.ServletContainer is not a jakarta.servlet.Servlet because the `pinot-spark-4-integration-tests` profile this PR added to `pinot-integration-tests/pom.xml` poisoned the whole module's test classpath on JDK 21: it forced `spark-core`/`spark-sql`/`spark-launcher` to Spark 4.0.0 via `<dependencyManagement>` and pulled in `jakarta.servlet-api:5.0.0`. With those on the classpath, Spark 3's Jersey 2 / javax-servlet stack could no longer register `ServletContainer` at driver startup. Two tests with conflicting Spark runtime needs cannot share one Maven module's test classpath. Rather than drop more exotic Maven plumbing (per-test classpaths, separate failsafe executions, etc.) into this PR, remove the Spark 4 integration test for now and leave spark-4 coverage to the existing in-process unit tests: - 3 @test methods in `SparkSegmentGenerationJobRunnerTest` that exercise the full Spark 4 driver path with a real `SparkContext` - 25 scalatest cases in `pinot-spark-4-connector` that exercise the DataSourceV2 read/write surface An end-to-end cluster integration test for Spark 4 can land as a follow-up PR once we pick a home that isolates its classpath -- most likely a new `pinot-spark-4-integration-tests` module, or failsafe configured with a separate classpath realm. Changes: - Delete `pinot-integration-tests/src/test/java-spark4/` - Revert the `pinot-spark-4-integration-tests` profile block in `pinot-integration-tests/pom.xml` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two small fixes from the latest code review: - pinot-integration-tests/pom.xml: revert the one-line whitespace hunk added during the earlier profile edit. File now matches master again. - pinot-spark-4-connector: replace `scala.collection.JavaConverters._` (deprecated in Scala 2.13) with `scala.jdk.CollectionConverters._` in DataExtractor.scala and PinotWriteTest.scala. Matches the already-correct import in PinotDataSourceRegistrationTest.scala, and is safe because this module is Scala-2.13-only (enforced by the maven-enforcer rule on scala.compat.version). No behavioral change; unit + scalatest suites continue to pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per the latest code review, flag the intentional gap (the Spark 4 counterpart to the Spark 3 `SparkSegmentMetadataPushIntegrationTest` is deferred to a follow-up because Spark 4's Jetty 12 / Jakarta Servlet 5 runtime cannot share `pinot-integration-tests`' classpath with the Spark 3 / Jersey 2 stack) in the module README so it doesn't get forgotten. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a new tutorial at pinot-connectors/pinot-spark-4-connector/documentation/end-to-end-docker-tutorial.md that walks through the full Spark 4 ↔ Pinot flow validated in this PR: Docker-based Pinot quickstart → build the shaded connector jar → custom Spark 4 + JDK 21 image → pyspark read of the sample baseballStats table → create a target schema/table → pyspark write to segment tars at a savePath → push tars to the controller → verify roundtrip. Includes a read/write options reference and a "Known gotchas" section capturing every issue I hit during validation (class-file-65 vs JDK 17, commons-lang3 collision, container-internal server address for gRPC, the DataFrameWriter.save() path requirement, the two-step write model, and the cosmetic jline HOME warning). Also: - Link the tutorial from the Quick Start section of the connector README so new users see it first. - Link from the pinot-batch-ingestion-spark-4 README, since the push runners in that module are the production-grade alternative to the curl step in the tutorial. Every command in the tutorial was run verbatim against apachepinot/pinot:latest and apache/spark:4.0.0 (+JDK 21) before this commit. No code changes; docs only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three correctness fixes flagged by review of pinot-connectors/pinot-spark-{3,4}-connector: 1. FilterPushDown soundness for compound filters. `acceptFilters` used to return true for any `Or`, `And`, or `Not` regardless of its children. When a child was a type the connector does not recognize, Spark dropped the compound from its residual filter list but `compileFilter` returned None, so Pinot received no filter at all and returned unfiltered rows. `isFilterSupported` is now recursive and accepts compounds only when every child is itself supported. 2. LIKE pushdown escaping for StringStartsWith / StringEndsWith / StringContains. The raw Spark value was interpolated into the Pinot `LIKE '...'` literal without escaping. Values containing `'`, `%`, `_`, or `\` changed the predicate semantics (and could break the SQL entirely). `escapeLikeLiteral` now doubles backslashes, escapes `%` and `_` with a backslash, and doubles single-quotes; each LIKE pushdown gets an explicit `ESCAPE '\'` clause. 3. Spark 4 `SupportsOverwriteV2.overwrite(...)` silently dropped its predicates and returned an append-only writer. Spark callers using `df.writeTo(...).overwrite(...)` or `mode("overwrite")` would believe Pinot replaced matching rows, but only new segments would be written, leaving stale data queryable. `overwrite` now throws `UnsupportedOperationException` with a message pointing at the supported alternatives (drop-and-replace via controller REST, or the SparkSegment*PushJobRunner with REFRESH / consistent-push enabled). The `predicates` constructor parameter on `PinotWriteBuilder` is gone since it was never consulted. Fixes 1 and 2 are applied to both spark-3 and spark-4 connectors per the consistency guideline (the code is a verbatim port). Fix 3 only applies to spark-4 (SupportsOverwriteV2 is a Spark 4 API). Also fix the CI RAT failure on Spark 4 Connector: add the Apache 2.0 license header to the newly added pinot-connectors/pinot-spark-4-connector/documentation/end-to-end-docker-tutorial.md. Test coverage: - new scalatest cases on both connectors for (a) compound filter with unsupported child being rejected, (b) compound filter with supported children still being accepted, (c) LIKE value escaping; all using `AlwaysTrue` as the unsupported leaf since `Filter` is sealed. - new scalatest case on the spark-4 connector asserting `PinotWriteBuilder.overwrite(...)` throws UnsupportedOperationException with a helpful message. Test counts now 29 / 29 on both connector modules (was 25 / 25). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Code review flagged that the spark-4 overwrite-predicate fix was not applied symmetrically to spark-3, even though spark-3 has the exact same silent-drop bug via the V1 SupportsOverwrite (Filter-based) API. A user calling `df.write.mode("overwrite").format("pinot").save(...)` on spark-3 would receive appended segments rather than overwritten data — the same correctness regression the spark-4 fix prevents. Mirror the fix: - PinotWriteBuilder.overwrite(filters) now throws UnsupportedOperationException with a message pointing at the supported alternatives (drop table via controller REST, or pinot-batch-ingestion- spark-3's SparkSegment*PushJobRunner with REFRESH / consistent-push). - Drop the unused `filters` constructor parameter that was only there to let `overwrite(...)` re-construct the builder. PinotWriteBuilder is now constructed with just LogicalWriteInfo, matching spark-4. - Update PinotTable.newWriteBuilder accordingly. - Add a regression test mirroring spark-4's: assert overwrite throws with a helpful message including the filter count. Spark 3 connector test count is now 30 (was 29). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two MINOR items from the latest review: 1. PinotDataSourceRegistrationTest in pinot-spark-3-connector imported `scala.jdk.CollectionConverters._`, which only exists in the Scala 2.13 stdlib. The rest of the spark-3 module uses `scala.collection.JavaConverters._` for cross-2.12/2.13 compat (the root pom still ships a `-Pscala-2.12` profile). Switch the import so a `-Pscala-2.12` build of just the spark-3 connector continues to compile. The spark-4 module is Scala-2.13-only and continues to use the modern `scala.jdk.CollectionConverters` import. 2. Document the pre-existing backslash round-trip caveat in the `escapeLikeLiteral` helper of both connectors. Pinot's `RegexpPatternConverterUtils#likeToRegexpLike` does not fully round-trip `\\` (translates to a regex matching two backslashes rather than one), so Spark predicates like `col.contains("a\\b")` still won't match rows containing literal `a\b` until that conversion is fixed in pinot-common. The %, _, and `'` cases this helper does handle are unaffected. Tracking this in the docstring rather than fixing the runtime conversion in this PR per scope discipline. No behavioral change. Tests still 30 / 29 in spark-3 / spark-4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…in BatchWrite Two MAJOR findings from the latest review: 1. Backslash in LIKE pushdown silently produced wrong results. Pinot's RegexpPatternConverterUtils#likeToRegexpLike does not round-trip `\\` correctly (it emits a regex matching two backslashes rather than one). The previous fix escaped backslashes at the SQL layer and added a docstring caveat, but the predicate was still pushed down and the user got silently wrong results. Fix: reject StringStartsWith / StringEndsWith / StringContains filters whose value contains a literal backslash from isFilterSupported, so Spark applies them post-scan instead. The `escapeLikeLiteral` helper keeps its backslash-escape branch as a defensive fallback in case it is reached via a future code path that does not gate on isFilterSupported. Symmetric fix in both Spark 3 and Spark 4 connectors. Regression test added on both sides asserting these filters land in the post-scan bucket. The previous "should escape" test no longer includes backslash inputs since they are now rejected upstream. 2. PinotWrite.commit/abort wrote commit messages to stdout via `messages.foreach(println)` (pre-existing, copied verbatim from Spark 3 into Spark 4). Reviewer separately flagged that abort() does no cleanup of leftover segment tars from already-succeeded tasks, which on retry produces duplicate segments after push. Fix: route both commit() and abort() through SLF4J at info / warn levels so messages reach the standard log collection. Add an explicit TODO documenting the abort-cleanup gap and pointing at PinotWriteBuilder.overwrite(...) as the contract-level guard, tracking the runtime-level guard as a follow-up. Symmetric in both Spark 3 and Spark 4. Both connectors now have 30 / 30 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… dirs Three findings from the latest code review: 1. [CRITICAL] PinotWriteBuilder.overwrite() rejection was bypassed by SupportsTruncate.truncate(). In both Spark 3 and Spark 4, SupportsOverwrite[V2] extends SupportsTruncate, and the V2Writes analyzer dispatches df.write.mode("overwrite") → truncate() rather than overwrite([AlwaysTrue]). The default truncate() returns `this`, so build() then silently appended — exactly the silent-overwrite- then-append bug the overwrite() rejection was meant to prevent. Fix: override truncate() in both connectors to throw UnsupportedOperationException with a message explaining that df.write.mode("overwrite") / INSERT OVERWRITE is not supported and pointing at df.write.mode("append") / drop-and-re-create / REFRESH-push as the alternatives. Add a regression test on both sides calling builder.truncate() directly. 2. [MAJOR] PinotDataWriter leaked the per-partition temp segment build directory on every successful task. Files.createTempDirectory in generateSegment() was never deleted by commit()/abort()/close(). On a long-running executor producing thousands of segments this accumulates uncompressed segment trees under the JVM tmpdir. Fix: track the temp dir on a module-private field, wrap commit() in a try/finally that deletes it, and delete it from abort() and close() as well (idempotent via null-out). Applied symmetrically to Spark 3 and Spark 4. 3. [MAJOR] Document pinot-fastdev activation behavior. The new Spark 4 JDK-21 profile in pinot-connectors/pom.xml and pinot-plugins/pinot-batch-ingestion/pom.xml is activated by JDK alone, so -Ppinot-fastdev on JDK 21 also builds the Spark 4 modules. This is intentional (both modules are small and their shade is disabled under -Ppinot-fastdev via their own profile) but wasn't documented. Add clarifying comments to both parent pom activation sites, pointing at `-P!pinot-spark-4-connector` and `-P!pinot-batch-ingestion-spark-4` as the opt-out toggles. Minor polish also rolled into this commit: - Spark 4 connector + batch pom comments said "JDK 17/11" but the enforcer rejects [21,), so update to "any JDK below 21". Test counts are now 31 / 31 in both spark-3 and spark-4 connectors (was 30 / 30). The new truncate regression tests pin the fix across both Spark 3.x SupportsOverwrite and Spark 4 SupportsOverwriteV2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two MAJOR review findings addressed: 1. PinotWrite.abort() implements best-effort leftover-tar cleanup. When one task fails the job aborts but the connector previously only logged the leftover tars at savePath; on retry users would get duplicate segments after the controller push step. Now abort() iterates the SuccessWriterCommitMessage entries, derives each segment tar path under savePath, and deletes via the same Hadoop FileSystem machinery PinotDataWriter uses for upload. Errors are swallowed and logged at WARN — the driver already has a failure to surface, and the user can recover by hand if cleanup misses. Applied symmetrically in spark-3 and spark-4. The SuccessWriterCommitMessage class gains a public `val segmentName` accessor (was a private constructor parameter) so abort() can reach the segment name without parsing toString. 2. PinotDataWriterTest gains three temp-dir lifecycle tests in both connectors: - commit() success path: snapshot diff of tmpdir entries shows no new pinot-spark-writer leftovers after commit. - abort() path: a generated segment dir captured before abort() is gone after. - close()-after-abort idempotency: explicit `noException should be thrownBy writer.close()` after a prior abort. The new tests use the package-private generateSegment() to capture the temp dir directly, so they pin both the cleanup and the lifecycle field reset behavior. A future refactor that drops the finally{} block or forgets to null out segmentOutputDir will fail these tests. Test counts now 34 / 34 in both connectors (was 31 / 31). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spark's `BatchWrite.abort(WriterCommitMessage[])` contract allows null entries — Spark inserts a null for any task that failed before producing a commit message. The current pattern match in PinotWrite.abort fell through `case other => logger.warn("unknown commit message type: {}", other)`, producing one misleading WARN log line per failed task during a partial-failure abort. Add an explicit `case null =>` branch (no-op, since there's nothing to clean up for a writer that never committed). Symmetric in spark-3 and spark-4. No behavioral change beyond log-noise reduction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer flagged that the abort() runtime cleanup logic added in the previous round had no unit tests despite C6.3 (bug fixes require regression tests). Add 4 cases on each connector covering the branches of the new abort() method: 1. empty savePath — abort short-circuits without any FS call 2. malformed scheme (`not-a-real-scheme://...`) — FileSystem.get throws and abort swallows + logs without rethrowing 3. happy path — a mix of SuccessWriterCommitMessage entries each correspond to an existing tar file under savePath, and abort deletes them all 4. null + unknown subclass — abort tolerates Spark's documented null entries (writers that failed before producing a message) and an anonymous WriterCommitMessage subclass without throwing, while still cleaning up the SuccessWriterCommitMessage entries that appear alongside them Test counts now 38 / 38 in both connectors (was 34 / 34). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two MINOR items from the latest code review: 1. The PinotWriteBuilder Scaladoc claimed "the default truncate() returns this, so without an explicit override the silent-append bug is reachable via mode("overwrite") alone." Inspecting the pinned Spark versions (3.5.8 SupportsOverwrite and 4.0.0 SupportsOverwriteV2) shows the default truncate() actually delegates to overwrite([AlwaysTrue]) — so the override above would already throw transitively, even without our explicit truncate() override. The explicit override is still good defense (a) for a tailored error message on the df.write.mode("overwrite") path and (b) against future Spark default-implementation changes. Update the comment to reflect this on both spark-3 and spark-4 builders. 2. The new abort() test files declared their imports inside the class body under a section comment. Move them to the top of the file alongside the existing imports, matching Pinot Scala convention (CLAUDE.md "Prefer imports over fully qualified class names"). No behavior change. Tests still 38 / 38 in both connectors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mantics Three MAJORs from the latest review (the CRITICAL one — stale rebase — was already addressed by this commit's parent rebase onto current upstream/master). 1. escapeLikeLiteral(null) used to return null, which Scala's s"" interpolation then renders as the four-char string "null". The defensive comment said this was "kept defensively so this helper produces well-formed SQL even if called via a future code path that does not gate on isFilterSupported" — but the actual emitted SQL would be `LIKE 'null%' ESCAPE '\'`, silently matching the literal string "null". Replace the silent-degradation branch with `require(value != null, ...)` so a future caller that bypasses isFilterSupported fails loudly instead. 2. PinotDataWriter had no class-level Javadoc note about thread safety, despite adding a new mutable `segmentOutputDir` field. Per CLAUDE.md the new module's classes should describe behavior and thread-safety. Add a paragraph explaining the Spark DataWriter single-task ownership invariant and the commit/abort/close ordering contract. 3. close()-after-abort idempotence relied on PinotBufferedRecordReader.close() being idempotent, which it currently is but the contract is not declared. Add a `closed` flag and consolidate abort()/close() into a private `closeOnce()` so future changes to the buffered reader don't silently break the writer's lifecycle. Existing regression test "close() after abort() is idempotent and does not throw" still passes. Symmetric in both Spark 3 and Spark 4 connectors. Tests still pass at 38 / 38 in both modules. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer noted the LIKE pushdowns emit `ESCAPE '\'` which Pinot's RequestContextUtils.toFilterContext currently ignores, leaving RegexpPatternConverterUtils.likeToRegexpLike's hardcoded `\` as the de-facto escape character. The pushdown is correct *only because* the connector chose `\` to match Pinot's hardcoded value. Add an inline comment near the LIKE clause documenting this implicit contract, so a future refactor that switches the escape character on either side can't silently produce wrong-row matches. Symmetric in spark-3 and spark-4. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…spath Reviewer flagged a MAJOR collision risk: pinot-spark-3-connector and pinot-spark-4-connector both register the same `"pinot"` data-source short name via META-INF/services. If both jars end up on the same Spark application classpath (fat-jar bundle, --packages misuse, etc.), Spark's DataSource.lookupDataSource resolves the format non-deterministically — a Spark 4 user could silently get a writer implementing the V1 SupportsOverwrite contract (or vice versa), with diverging overwrite/truncate semantics that the symmetric overwrite/ truncate rejection in this PR cannot itself protect against. Fix in pinot-spark-4-connector: 1. PinotDataSource (Spark 4) probes for the v3 class by name via Class.forName(...) — no compile-time dep is added between the modules. The probe runs once per JVM, cached in the companion object. If the v3 class is found, the v4 connector throws IllegalStateException with a tailored message instructing the user to remove one of the two jars. 2. README documents the mutual exclusion explicitly under a new "Mutual exclusion with pinot-spark-3-connector" section, including the guidance for fat-jar packagers. 3. Three new regression tests: - isSpark3ConnectorOnClasspath returns false in the default v4-only test classpath (defends against an accidental v3 module dependency creeping in) - PinotDataSource() constructor does not throw under the v4-only classpath - The spark3ConflictMessage text contains the actionable strings ("pinot-spark-4-connector", "pinot-spark-3-connector", "Remove") so a future docstring change can't quietly weaken the diagnostic. The guard is intentionally one-sided (Spark 4 detects Spark 3, not the other way around) because Spark 4 is the new arrival; existing Spark 3 users picking up the new v4 jar should hit the failure on their first spark.read.format("pinot") call rather than after a successful write that turned out to be wrong. Test counts: 41 in spark-4 connector (was 38), 38 in spark-3 connector (unchanged; the guard is v4-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer flagged a MAJOR race in the new classpath-collision guard: the previous two-flag @volatile pattern flipped `spark3Probed = true` *before* computing `spark3Conflict`, leaving a window where a second thread could observe `spark3Probed=true` and `spark3Conflict=false` (default) and skip the throw — the exact failure mode the guard was meant to prevent. Replace with a Scala `lazy val`. The compiler emits a synchronized initialization barrier so every caller observes the fully-computed result; there is no half-initialized window. This also simplifies the guard to a single read. Tests still pass: spark-4 41 / 41, spark-3 38 / 38. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer flagged that Class.forName can throw NoClassDefFoundError / LinkageError (in addition to ClassNotFoundException) when the v3 jar is on the classpath but a transitive dependency is missing. The probe previously only caught ClassNotFoundException, so such an error would escape the PinotDataSource constructor with a message that doesn't explain the conflict scenario. Conservatively treat LinkageError as "v3 not present" — the worst case is the guard is bypassed and the user falls back to Spark's own multi-source error, which is still preferable to leaking a bytecode-resolution failure with no context. No behavior change in the dominant case (v3 absent or v3 fully loadable). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…own + abort Five MAJORs from this round, applied symmetrically to spark-3 and spark-4 unless noted. 1. EqualNullSafe(_, null) and IN(_, array-with-null) silently leaked the literal Java string "null" into the rendered SQL via compileValue's fallback branch. Pinot would then parse `attr != null` syntactically rather than as a NULL test. Reject both at isFilterSupported so Spark evaluates them post-scan with proper three-valued logic. Compound gating (already in place) propagates the rejection through enclosing And/Or/Not. 2. escapeAttr previously passed through any column name containing a `"` unmodified. A name like `weird"col` then flowed into the SQL as raw, broken SQL — and a sufficiently adversarial catalog could exploit this as a SQL injection vector. Replace the brittle `contains("\\"")` heuristic with a regex that recognizes the already-escaped dotted-quoted form (`"col"."col"`) and otherwise wraps the name in `"..."` while doubling embedded quotes. Existing dotted-quoted callers (`"some"."nested"."column"`) still pass through unchanged; the new test pins the broken-input case. 3. PinotDataSource (spark-4) classpath-conflict probe used only `getClass.getClassLoader`, which is the v4 jar's own loader — in `--packages`/plugin-classloader Spark deployments the v3 jar can be visible to Spark's `DataSource.lookupDataSource` (which uses Thread.currentThread.getContextClassLoader) but invisible to v4's loader. Probe both loaders so the conflict guard cannot be silently bypassed. 4. PinotWrite.abort log fidelity. fs.delete returns false when the path doesn't exist, but the previous code logged "cleaned up leftover segment tar at {} (deleted=false)" at WARN — making "no leftover" indistinguishable from "deleted a real one" in production logs. Branch on the boolean: WARN only on actual deletions; INFO when no leftover existed; WARN with the exception when delete throws. 5. spark-3 PinotWriteBuilder previously silently appended on df.write.mode("overwrite"); the symmetric fix in this PR now throws UnsupportedOperationException. That is a backward-incompatible behavior change for any existing spark-3 user that depended on the (incorrect) silent-append. Add an explicit "Behavior changes since the previous release" callout at the top of pinot-spark-3-connector/README.md so the upgrade path is documented. Six new regression tests across both connectors: - EqualNullSafe(_, null) → post-scan - IN(_, array-with-null) → post-scan; all-non-null IN still pushed down - escapeAttr quotes a name containing a stray `"` correctly Test counts: spark-4 now 44 (was 41), spark-3 now 41 (was 38). Branch was also rebased onto current upstream/master (was 2 commits behind); no conflicts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…bort + classpath guard Six MAJORs from the latest review, applied symmetrically v3+v4 unless noted. 1. EqualNullSafe / IN null-leaf rejection lacked compound-context test coverage. Added tests pinning that the rejection propagates through enclosing And/Or/Not (including nested), so a future change to the gate or the compound recursion can't silently re-introduce the null-leak through compound predicates. 2. compileFilter EqualNullSafe branch evaluated `compileValue(value)` three times and `escapeAttr(attr)` four times. Bind once; same emitted SQL but no divergence risk if either helper ever becomes effectful (per C5.10). 3. In(_, value) gate accepted patterns where `value` itself was null because the conditional `value != null && value.contains(null) => false` doesn't fire on null `value`, then `_: In => true` accepted, then compileFilter's `value.isEmpty` would NPE. Tighten to `value == null || value.contains(null) => false`. Test added pinning the null-array rejection. 4. Classpath-collision guard had no escape hatch. Add system property `pinot.spark.connector.skip-conflict-guard=true` that downgrades the throw to a one-time WARN. Default behavior unchanged. Documented in spark-4 README under the existing Mutual-exclusion section. 5. PinotWrite.abort() catches `case t: Throwable` swallowed Fatal errors (OOM, StackOverflow). Switch to `case NonFatal(t)` per Scala best practice. Symmetric in both connectors. 6. Spark 4 README lacked the "Write semantics / overwrite-rejection" callout that the spark-3 README has. Added under a "Write semantics" section before "Features" so first-time spark-4 users see the contract before reading the Quick Start. Test counts: spark-4 now 46 (was 44), spark-3 now 43 (was 41). +2 regression tests each side (compound-null-leaf propagation + In(_, null array) rejection). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ment Reviewer noted (and verified against spark-catalyst_2.13-3.5.8.jar) that the comment "SupportsOverwrite extends SupportsTruncate" is incorrect for Spark 3.5.x — the actual chain is SupportsOverwrite extends SupportsOverwriteV2 extends SupportsTruncate. The transitive relationship is the same so the override rationale stands; this is a docstring accuracy fix only. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…close() throws Reviewer noted that closeOnce() set the closed flag and then called bufferedRecordReader.close() before the temp-dir delete. If the reader's close() ever throws (today PinotBufferedRecordReader is guaranteed not to, but the contract is not declared — exactly the rationale the existing comment cites for the closed flag), the closed flag is already set and subsequent close()/abort() calls become no-ops, leaving segmentOutputDir on disk. Wrap the reader close in try/finally so the temp-dir cleanup always runs. Symmetric in spark-3 and spark-4. No behavior change in the current path; this is defensive hardening that matches the same intent the closed flag itself encodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…guard Reviewer pointed out that `lazy val spark3Conflict` is computed exactly once per JVM at first construction, so a v3 jar added to the classpath later in the same session (spark-shell `:require`, custom plugin loader, post-startup mutation) won't be detected. Document the limitation in the companion-object Javadoc and point users at the existing escape hatch / JVM-restart workaround. No behavior change; pure documentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ark-3 Pinot's master branch raised the Java baseline to JDK 21 (PR apache#18046), which means the spark-3 jars now compile to class file 65 and pull in JDK-21-only pinot-core / pinot-common transitively. Spark 3.5.x users on JDK 17 (the previously-common Spark 3.5 deployment) cannot load the new jars at all — UnsupportedClassVersionError at first class resolution. Rather than reverting the JDK 21 baseline for these modules (which would also require pinning pinot-core / pinot-common back to JDK 17, undoing PR apache#18046's intent), declare pinot-spark-3-connector and pinot-batch-ingestion-spark-3 deprecated for one release cycle: - New users should adopt pinot-spark-4-connector / pinot-batch-ingestion-spark-4 (Spark 4.0.x + JDK 21). - Existing Spark 3.5.x users on JDK 21 + Spark 3.5.5+ (the first Spark 3.5 patch with official JDK 21 support) can keep using the spark-3 jars during the deprecation window. - Existing Spark 3.5.x users on JDK 17 stay on Pinot 1.5.x for the connector jar until they upgrade. The next minor release will remove both spark-3 modules. Changes: - pinot-connectors/pinot-spark-3-connector/README.md — top-of-file deprecation banner with concrete runtime requirements + migration path to pinot-spark-4-connector. - pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark-3/ README.md — new file (the module previously had none) with the same deprecation banner and migration path for the batch ingestion runners. - pom.xml — added a comment on the spark3.version property block recording the deprecation status so a future maintainer reading the pom sees it before bumping the version. No code changes; pure documentation. Tests still pass at 46 / 43 in spark-4 / spark-3 connectors respectively. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

xiangfu0 commented Apr 21, 2026

View reviewed changes

xiangfu0 force-pushed the claude/musing-neumann-2a8e62 branch 2 times, most recently from cc8bd0a to 5021c6b Compare April 23, 2026 10:08

Jackie-Jiang added ingestion Related to data ingestion pipeline feature New functionality labels Apr 24, 2026

xiangfu0 changed the title ~~feat: add Apache Spark 4.x support (JDK 21+)~~ feat(spark-4): add Apache Spark 4.x connector + batch-ingestion plugin (JDK 21+) Apr 24, 2026

xiangfu0 force-pushed the claude/musing-neumann-2a8e62 branch 5 times, most recently from 220e677 to 955dc8e Compare April 26, 2026 03:50

xiangfu0 and others added 18 commits April 26, 2026 00:13

xiangfu0 and others added 5 commits April 26, 2026 00:13

xiangfu0 force-pushed the claude/musing-neumann-2a8e62 branch from 955dc8e to ad9759c Compare April 26, 2026 07:14

xiangfu0 and others added 3 commits April 26, 2026 00:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spark-4): add Apache Spark 4.x connector + batch-ingestion plugin (JDK 21+)#18261

feat(spark-4): add Apache Spark 4.x connector + batch-ingestion plugin (JDK 21+)#18261
xiangfu0 wants to merge 26 commits intoapache:masterfrom
xiangfu0:claude/musing-neumann-2a8e62

xiangfu0 commented Apr 20, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 20, 2026 •

edited

Loading

Uh oh!

xiangfu0 left a comment

Uh oh!

xiangfu0 Apr 21, 2026

Uh oh!

xiangfu0 Apr 21, 2026

Uh oh!

xiangfu0 Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xiangfu0 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why a separate module?

End-to-end tutorial

Quick usage (read)

Quick usage (write)

⚠️ Overwrite semantics

Compatibility matrix

Correctness fixes (in addition to the Spark 4 port)

Test plan

Known coverage gaps (tracked for follow-up)

Notes for reviewers / future follow-ups

Uh oh!

codecov-commenter commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xiangfu0 left a comment

Choose a reason for hiding this comment

Uh oh!

xiangfu0 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

xiangfu0 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

xiangfu0 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xiangfu0 commented Apr 20, 2026 •

edited

Loading

codecov-commenter commented Apr 20, 2026 •

edited

Loading