Add experimental columnar indexing api by Tim-Brooks · Pull Request #15990 · apache/lucene

Tim-Brooks · 2026-04-28T01:47:37Z

This commit adds an experimental columnar indexing api to the
IndexWriter. It allows the user to provide Long, Binary, and Vector
columns to the indexing chain to significantly reduce the per field
overhead when indexing batches of documents.

…exing_chain+columns

…exing_chain+columns_simpler

Tim-Brooks · 2026-04-28T01:48:26Z

This is not ready. It is a POC to provide an example to the Lucene mailing list discussion of adding a columnar api. It would need more tests and cursor API refinement before moving forward.

Tim-Brooks · 2026-04-28T01:56:12Z

+   * <p>The default implementation calls {@link #nextLong()} in a loop. Override to provide a more
+   * efficient bulk fill (for example a {@link System#arraycopy} from a backing array).
+   */
+  public void fill(long[] dst, int offset, int length) {


This would essentially be a fast path users could optionally implement if they wanted to optimize bulk adds from whatever their binary backing source is -> the doc value writer. It would probably also make sense to add for points support.

Makes much less sense for sorted set doc values, etc where the value widths are variable.

This obviously isn't a requirement, but was helpful when I was prototyping as it made a considerable difference in the low-level performance.

msfroh · 2026-04-28T21:51:24Z

Conceptually, I'm really excited about this.

Some colleagues and I have been working on ingesting data into Lucene from Apache Arrow format. IMO, this would make that considerably easier (and ideally more efficient -- imagine if the IndexWriter could read an Arrow RecordBatch without copying).

I'll try to take a look in the coming days. This sounds cool!

Tim-Brooks · 2026-04-29T02:47:04Z

IMO, this would make that considerably easier (and ideally more efficient

Yes ideally this would be designed to work nicely with other columnar formats. I don't think it would be zero copy as there would still be a copy from format -> Lucene's buffers. But there would ideally be one copy with optimized paths for the dense batches which can be copied in bulk. And it would be up to the user to implement the bytes -> longs (endianness, unpacking, etc) step for doc values. And would have to be on-top of the sort order encoding for points in binary columns. I haven't really gone too far with points in this PR.

Tim-Brooks added 19 commits April 9, 2026 10:23

Change

d97c250

Change

dccfa0b

Chagne'

c44c47f

Change

23da317

Change

051c2f7

Change

18ec559

Change

5b26e65

Change

f0b15f8

Fix

ab724d0

Change

7473d74

Change

54fa585

Merge remote-tracking branch 'upstream/main' into parent_field_to_ind…

8f44091

…exing_chain+columns

Change

7d1b5e9

Change

5c700ff

Change

0f41ddb

Fix

efc2839

Merge remote-tracking branch 'upstream/main' into parent_field_to_ind…

2656bdc

…exing_chain+columns_simpler

Vectors

536ba06

Change

a80ae0b

github-actions Bot added the module:core/index label Apr 28, 2026

Tim-Brooks commented Apr 28, 2026

View reviewed changes

Change

43ec640

github-actions Bot added this to the 10.5.0 milestone Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental columnar indexing api#15990

Add experimental columnar indexing api#15990
Tim-Brooks wants to merge 20 commits intoapache:mainfrom
Tim-Brooks:parent_field_to_indexing_chain+columns_simpler

Tim-Brooks commented Apr 28, 2026

Uh oh!

Tim-Brooks commented Apr 28, 2026

Uh oh!

Tim-Brooks Apr 28, 2026

Uh oh!

msfroh commented Apr 28, 2026

Uh oh!

Tim-Brooks commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Tim-Brooks commented Apr 28, 2026

Uh oh!

Tim-Brooks commented Apr 28, 2026

Uh oh!

Tim-Brooks Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

msfroh commented Apr 28, 2026

Uh oh!

Tim-Brooks commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants