Add LongConsumer progress callback to Operations.determinize for memo…#15937
Add LongConsumer progress callback to Operations.determinize for memo…#15937drempapis wants to merge 2 commits intoapache:mainfrom
Conversation
…ry-aware determinization
|
I'm concerned about two parameters on determinize() to control the resource usage. The PR just adds it in one place, but then its only a matter of time before bugs are filed because the additional parameter isn't exposed in WildcardQuery, RegexpQuery, and then everywhere else (tokenizers, whatever). so I think the API needs to be re-thought. |
|
Thanks @rmuir , this is a fair concern. My intent with the callback was to support callers that already control the determinization call site directly, not to immediately thread a new parameter through WildcardQuery, RegexpQuery, tokenizers, etc. That said, I agree with your point that for a stable Lucene API, if we introduce this as a public control, we should likely propagate it consistently through higher-level entry points (similar to how Would you prefer we move in that propagation direction, or do you have a different API shape in mind? I’m happy to adjust accordingly. |
Well, I think lucene shouldn't have even a single parameter for this resource control, and definitely not two parameters. Such parameters don't make sense to me and only come from people who think they can expose exponential algorithms safely to hostile user inputs. You can't: so just don't do that: it isn't safe and it isn't lucene's job to make it safe. Adding more resource controls and making the api even worse, will not change that situation. I'd like to remove determinization from WildcardQuery/RegexpQuery at some point. If the input is a DFA, then you get a DFA execution, otherwise an NFA execution. The code is already there for this. We just need to flip the switch. |
|
Thank you @rmuir , this makes sense to me. I agree that adding more resource-control knobs around determinization feels like the wrong direction for Lucene. I’ll experiment with the alternative path you suggested, removing upfront determinization from WildcardQuery and RegexpQuery, and relying on existing execution-path selection. Does this look good to you? |
+1, we should really try to do this. In many cases you'll get a DFA in practice today... even often a minimal one, right from these parsers, unless you are using kleene star, which is going to be slow anyway. So I think what is fast will stay fast and what is slow will stay slow :) |
Problem
Lucene's
Operations.determinizeuses powerset construction to convert an NFA into a DFA. For patterns with combinatorial structure (e.g. abcdef*), this can cause exponential blowup in the number of DFA states, each carrying its own FrozenIntSet snapshot, HashMap entry, and backing arrays. The existingworkLimitparameter boundsCPUeffort but provides no memory signal, the JVM canOOMlong before the work limit is reached. There is currently no way for callers to observe or react to memory growth during determinization.Update
This PR adds a new overload to Lucene's determinization API
When a non-null callback is provided, Lucene periodically invokes it with an accumulated estimate of bytes allocated since the last report. The caller can use this signal to enforce memory policies (e.g. throw a circuit-breaker exception to abort determinization)
The estimate is built from the allocations directly attributable to each newly discovered DFA state during powerset construction. Rather than invoking the callback on every new state (which would add overhead proportional to state count), estimates are accumulated and reported in chunks once a configurable byte threshold is crossed. Any remainder is flushed once at the end