Add experimental columnar indexing api#15990
Add experimental columnar indexing api#15990Tim-Brooks wants to merge 20 commits intoapache:mainfrom
Conversation
…exing_chain+columns
…exing_chain+columns_simpler
|
This is not ready. It is a POC to provide an example to the Lucene mailing list discussion of adding a columnar api. It would need more tests and cursor API refinement before moving forward. |
| * <p>The default implementation calls {@link #nextLong()} in a loop. Override to provide a more | ||
| * efficient bulk fill (for example a {@link System#arraycopy} from a backing array). | ||
| */ | ||
| public void fill(long[] dst, int offset, int length) { |
There was a problem hiding this comment.
This would essentially be a fast path users could optionally implement if they wanted to optimize bulk adds from whatever their binary backing source is -> the doc value writer. It would probably also make sense to add for points support.
Makes much less sense for sorted set doc values, etc where the value widths are variable.
This obviously isn't a requirement, but was helpful when I was prototyping as it made a considerable difference in the low-level performance.
|
Conceptually, I'm really excited about this. Some colleagues and I have been working on ingesting data into Lucene from Apache Arrow format. IMO, this would make that considerably easier (and ideally more efficient -- imagine if the I'll try to take a look in the coming days. This sounds cool! |
Yes ideally this would be designed to work nicely with other columnar formats. I don't think it would be zero copy as there would still be a copy from format -> Lucene's buffers. But there would ideally be one copy with optimized paths for the dense batches which can be copied in bulk. And it would be up to the user to implement the bytes -> longs (endianness, unpacking, etc) step for doc values. And would have to be on-top of the sort order encoding for points in binary columns. I haven't really gone too far with points in this PR. |
This commit adds an experimental columnar indexing api to the
IndexWriter. It allows the user to provide Long, Binary, and Vector
columns to the indexing chain to significantly reduce the per field
overhead when indexing batches of documents.