Add LongConsumer progress callback to Operations.determinize for memo… by drempapis · Pull Request #15937 · apache/lucene

drempapis · 2026-04-06T14:28:56Z

Problem

Lucene's Operations.determinize uses powerset construction to convert an NFA into a DFA. For patterns with combinatorial structure (e.g. abcdef*), this can cause exponential blowup in the number of DFA states, each carrying its own FrozenIntSet snapshot, HashMap entry, and backing arrays. The existing workLimit parameter bounds CPU effort but provides no memory signal, the JVM can OOM long before the work limit is reached. There is currently no way for callers to observe or react to memory growth during determinization.

Update

This PR adds a new overload to Lucene's determinization API

public static Automaton determinize(Automaton a, int workLimit, LongConsumer progressCallback)

When a non-null callback is provided, Lucene periodically invokes it with an accumulated estimate of bytes allocated since the last report. The caller can use this signal to enforce memory policies (e.g. throw a circuit-breaker exception to abort determinization)

The estimate is built from the allocations directly attributable to each newly discovered DFA state during powerset construction. Rather than invoking the callback on every new state (which would add overhead proportional to state count), estimates are accumulated and reported in chunks once a configurable byte threshold is crossed. Any remainder is flushed once at the end

…ry-aware determinization

rmuir · 2026-04-07T04:16:44Z

I'm concerned about two parameters on determinize() to control the resource usage. The PR just adds it in one place, but then its only a matter of time before bugs are filed because the additional parameter isn't exposed in WildcardQuery, RegexpQuery, and then everywhere else (tokenizers, whatever). so I think the API needs to be re-thought.

drempapis · 2026-04-07T06:30:54Z

Thanks @rmuir , this is a fair concern.

My intent with the callback was to support callers that already control the determinization call site directly, not to immediately thread a new parameter through WildcardQuery, RegexpQuery, tokenizers, etc.

That said, I agree with your point that for a stable Lucene API, if we introduce this as a public control, we should likely propagate it consistently through higher-level entry points (similar to how determinizeWorkLimit is exposed today).

Would you prefer we move in that propagation direction, or do you have a different API shape in mind? I’m happy to adjust accordingly.

rmuir · 2026-04-07T10:23:19Z

My intent with the callback was to support callers that already control the determinization call site directly, not to immediately thread a new parameter through WildcardQuery, RegexpQuery, tokenizers, etc.

Well, I think lucene shouldn't have even a single parameter for this resource control, and definitely not two parameters. Such parameters don't make sense to me and only come from people who think they can expose exponential algorithms safely to hostile user inputs. You can't: so just don't do that: it isn't safe and it isn't lucene's job to make it safe. Adding more resource controls and making the api even worse, will not change that situation.

I'd like to remove determinization from WildcardQuery/RegexpQuery at some point. If the input is a DFA, then you get a DFA execution, otherwise an NFA execution. The code is already there for this. We just need to flip the switch.

drempapis · 2026-04-07T13:29:54Z

Thank you @rmuir , this makes sense to me. I agree that adding more resource-control knobs around determinization feels like the wrong direction for Lucene.

I’ll experiment with the alternative path you suggested, removing upfront determinization from WildcardQuery and RegexpQuery, and relying on existing execution-path selection. Does this look good to you?

rmuir · 2026-04-07T14:43:09Z

I’ll experiment with the alternative path you suggested, removing upfront determinization from WildcardQuery and RegexpQuery, and relying on existing execution-path selection. Does this look good to you?

+1, we should really try to do this. In many cases you'll get a DFA in practice today... even often a minimal one, right from these parsers, unless you are using kleene star, which is going to be slow anyway. So I think what is fast will stay fast and what is slow will stay slow :)

Add LongConsumer progress callback to Operations.determinize for memo…

90c1e40

…ry-aware determinization

github-actions bot added the module:core/other label Apr 6, 2026

Add changes.txt

df2edcd

github-actions bot added this to the 11.0.0 milestone Apr 6, 2026

drempapis mentioned this pull request Apr 6, 2026

Follow up - switch ES determinization memory accounting to Lucene callback API when available elastic/elasticsearch#145747

Open

rmuir mentioned this pull request Apr 7, 2026

remove explicit determinization from RegexpQuery #15939

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LongConsumer progress callback to Operations.determinize for memo…#15937

Add LongConsumer progress callback to Operations.determinize for memo…#15937
drempapis wants to merge 2 commits intoapache:mainfrom
drempapis:feature/determinize-progress-callback

drempapis commented Apr 6, 2026

Uh oh!

rmuir commented Apr 7, 2026

Uh oh!

drempapis commented Apr 7, 2026

Uh oh!

rmuir commented Apr 7, 2026

Uh oh!

drempapis commented Apr 7, 2026

Uh oh!

rmuir commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drempapis commented Apr 6, 2026

Problem

Update

Uh oh!

rmuir commented Apr 7, 2026

Uh oh!

drempapis commented Apr 7, 2026

Uh oh!

rmuir commented Apr 7, 2026

Uh oh!

drempapis commented Apr 7, 2026

Uh oh!

rmuir commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants