Skip to content

Add experimental columnar indexing api#15990

Open
Tim-Brooks wants to merge 20 commits intoapache:mainfrom
Tim-Brooks:parent_field_to_indexing_chain+columns_simpler
Open

Add experimental columnar indexing api#15990
Tim-Brooks wants to merge 20 commits intoapache:mainfrom
Tim-Brooks:parent_field_to_indexing_chain+columns_simpler

Conversation

@Tim-Brooks
Copy link
Copy Markdown
Contributor

This commit adds an experimental columnar indexing api to the
IndexWriter. It allows the user to provide Long, Binary, and Vector
columns to the indexing chain to significantly reduce the per field
overhead when indexing batches of documents.

@Tim-Brooks
Copy link
Copy Markdown
Contributor Author

This is not ready. It is a POC to provide an example to the Lucene mailing list discussion of adding a columnar api. It would need more tests and cursor API refinement before moving forward.

* <p>The default implementation calls {@link #nextLong()} in a loop. Override to provide a more
* efficient bulk fill (for example a {@link System#arraycopy} from a backing array).
*/
public void fill(long[] dst, int offset, int length) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would essentially be a fast path users could optionally implement if they wanted to optimize bulk adds from whatever their binary backing source is -> the doc value writer. It would probably also make sense to add for points support.

Makes much less sense for sorted set doc values, etc where the value widths are variable.

This obviously isn't a requirement, but was helpful when I was prototyping as it made a considerable difference in the low-level performance.

@github-actions github-actions Bot added this to the 10.5.0 milestone Apr 28, 2026
@msfroh
Copy link
Copy Markdown
Contributor

msfroh commented Apr 28, 2026

Conceptually, I'm really excited about this.

Some colleagues and I have been working on ingesting data into Lucene from Apache Arrow format. IMO, this would make that considerably easier (and ideally more efficient -- imagine if the IndexWriter could read an Arrow RecordBatch without copying).

I'll try to take a look in the coming days. This sounds cool!

@Tim-Brooks
Copy link
Copy Markdown
Contributor Author

IMO, this would make that considerably easier (and ideally more efficient

Yes ideally this would be designed to work nicely with other columnar formats. I don't think it would be zero copy as there would still be a copy from format -> Lucene's buffers. But there would ideally be one copy with optimized paths for the dense batches which can be copied in bulk. And it would be up to the user to implement the bytes -> longs (endianness, unpacking, etc) step for doc values. And would have to be on-top of the sort order encoding for points in binary columns. I haven't really gone too far with points in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants