Skip to content

Add LongConsumer progress callback to Operations.determinize for memo…#15937

Open
drempapis wants to merge 2 commits intoapache:mainfrom
drempapis:feature/determinize-progress-callback
Open

Add LongConsumer progress callback to Operations.determinize for memo…#15937
drempapis wants to merge 2 commits intoapache:mainfrom
drempapis:feature/determinize-progress-callback

Conversation

@drempapis
Copy link
Copy Markdown

Problem

Lucene's Operations.determinize uses powerset construction to convert an NFA into a DFA. For patterns with combinatorial structure (e.g. abcdef*), this can cause exponential blowup in the number of DFA states, each carrying its own FrozenIntSet snapshot, HashMap entry, and backing arrays. The existing workLimit parameter bounds CPU effort but provides no memory signal, the JVM can OOM long before the work limit is reached. There is currently no way for callers to observe or react to memory growth during determinization.

Update

This PR adds a new overload to Lucene's determinization API

public static Automaton determinize(Automaton a, int workLimit, LongConsumer progressCallback)

When a non-null callback is provided, Lucene periodically invokes it with an accumulated estimate of bytes allocated since the last report. The caller can use this signal to enforce memory policies (e.g. throw a circuit-breaker exception to abort determinization)

The estimate is built from the allocations directly attributable to each newly discovered DFA state during powerset construction. Rather than invoking the callback on every new state (which would add overhead proportional to state count), estimates are accumulated and reported in chunks once a configurable byte threshold is crossed. Any remainder is flushed once at the end

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Apr 7, 2026

I'm concerned about two parameters on determinize() to control the resource usage. The PR just adds it in one place, but then its only a matter of time before bugs are filed because the additional parameter isn't exposed in WildcardQuery, RegexpQuery, and then everywhere else (tokenizers, whatever). so I think the API needs to be re-thought.

@drempapis
Copy link
Copy Markdown
Author

Thanks @rmuir , this is a fair concern.

My intent with the callback was to support callers that already control the determinization call site directly, not to immediately thread a new parameter through WildcardQuery, RegexpQuery, tokenizers, etc.

That said, I agree with your point that for a stable Lucene API, if we introduce this as a public control, we should likely propagate it consistently through higher-level entry points (similar to how determinizeWorkLimit is exposed today).

Would you prefer we move in that propagation direction, or do you have a different API shape in mind? I’m happy to adjust accordingly.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Apr 7, 2026

My intent with the callback was to support callers that already control the determinization call site directly, not to immediately thread a new parameter through WildcardQuery, RegexpQuery, tokenizers, etc.

Well, I think lucene shouldn't have even a single parameter for this resource control, and definitely not two parameters. Such parameters don't make sense to me and only come from people who think they can expose exponential algorithms safely to hostile user inputs. You can't: so just don't do that: it isn't safe and it isn't lucene's job to make it safe. Adding more resource controls and making the api even worse, will not change that situation.

I'd like to remove determinization from WildcardQuery/RegexpQuery at some point. If the input is a DFA, then you get a DFA execution, otherwise an NFA execution. The code is already there for this. We just need to flip the switch.

@drempapis
Copy link
Copy Markdown
Author

Thank you @rmuir , this makes sense to me. I agree that adding more resource-control knobs around determinization feels like the wrong direction for Lucene.

I’ll experiment with the alternative path you suggested, removing upfront determinization from WildcardQuery and RegexpQuery, and relying on existing execution-path selection. Does this look good to you?

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Apr 7, 2026

I’ll experiment with the alternative path you suggested, removing upfront determinization from WildcardQuery and RegexpQuery, and relying on existing execution-path selection. Does this look good to you?

+1, we should really try to do this. In many cases you'll get a DFA in practice today... even often a minimal one, right from these parsers, unless you are using kleene star, which is going to be slow anyway. So I think what is fast will stay fast and what is slow will stay slow :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants