I've tried to create a Java UDAF function to perform custom theta sketch operations. But the final sketch is always empty. I got a correct result once I forced single-stage aggregation. Below you can find an AI summary of the observed issue.
Summary
Java UDAFs can return incorrect results when StarRocks plans aggregation as multi-stage (update serialize + merge finalize), while the same UDAF works correctly when forced to single-stage aggregation.
Observed behavior
- With the default planner, grouped aggregation uses a two-stage plan:
AGGREGATE (update serialize)
AGGREGATE (merge finalize)
- The final result is incorrect.
- In the reported theta-sketch repro, the output collapses to an empty sketch (
AQMDAAAeAAA=).
- Forcing single-stage aggregation (
new_planner_agg_stage = 1) returns the correct result.
Expected behavior
Multi-stage aggregation should produce the same result as single-stage aggregation for a Java UDAF that correctly implements:
create
update
serializeLength
serialize
merge
finalize
Why this looks like a StarRocks bug
1. Wrong Java array element type in repeated-object helper
File:
be/src/udf/java/java_udf.cpp
Function:
JVMFunctionHelper::create_object_array()
Issue:
- This helper creates a Java array using
_object_array_class as the element class.
- That means it builds an array of
Object[] elements rather than an array of Object elements.
- However, callers use it to create repeated arrays of:
ByteBuffer
- Java UDAF state objects
- This helper is used in Java UDAF multi-stage paths, including:
be/src/exprs/agg/java_udaf_function.h
convert_to_serialize_format()
batch_serialize()
merge_batch_single_state()
This likely breaks batched serialize/merge argument construction.
2. batch_create_bytebuf() ignores the slice start offset
File:
be/src/udf/java/java_udf.cpp
Function:
JVMFunctionHelper::batch_create_bytebuf(unsigned char* ptr, const uint32_t* offset, int begin, int end)
Issue:
- The function accepts
begin/end, but copies offsets starting from offset instead of offset + begin.
- As a result, when a partial range is requested, the generated direct byte buffers can point to the wrong serialized slices.
This affects batched merge logic in:
be/src/exprs/agg/java_udaf_function.h
_merge_batch_process()
merge_batch_single_state()
merge_batch()
Why this matches the user-visible symptom
- Single-stage aggregation mostly uses direct
update/finalize execution and avoids the intermediate serialize/merge path.
- Multi-stage aggregation relies on the Java UDAF batched serialization and merge helpers above.
- If those helpers build invalid argument arrays or mis-slice serialized buffers, merge can consume incorrect input and produce empty/corrupted final state.
Reproduction outline
- Register a Java UDAF whose state is serialized to
ByteBuffer and merged from ByteBuffer
- Use grouped aggregation on multiple rows per group
- Observe:
- incorrect result under the default multi-stage plan
- correct result with
set new_planner_agg_stage = 1
I've tried to create a Java UDAF function to perform custom theta sketch operations. But the final sketch is always empty. I got a correct result once I forced single-stage aggregation. Below you can find an AI summary of the observed issue.
Summary
Java UDAFs can return incorrect results when StarRocks plans aggregation as multi-stage (
update serialize+merge finalize), while the same UDAF works correctly when forced to single-stage aggregation.Observed behavior
AGGREGATE (update serialize)AGGREGATE (merge finalize)AQMDAAAeAAA=).new_planner_agg_stage = 1) returns the correct result.Expected behavior
Multi-stage aggregation should produce the same result as single-stage aggregation for a Java UDAF that correctly implements:
createupdateserializeLengthserializemergefinalizeWhy this looks like a StarRocks bug
1. Wrong Java array element type in repeated-object helper
File:
be/src/udf/java/java_udf.cppFunction:
JVMFunctionHelper::create_object_array()Issue:
_object_array_classas the element class.Object[]elements rather than an array ofObjectelements.ByteBufferbe/src/exprs/agg/java_udaf_function.hconvert_to_serialize_format()batch_serialize()merge_batch_single_state()This likely breaks batched serialize/merge argument construction.
2.
batch_create_bytebuf()ignores the slice start offsetFile:
be/src/udf/java/java_udf.cppFunction:
JVMFunctionHelper::batch_create_bytebuf(unsigned char* ptr, const uint32_t* offset, int begin, int end)Issue:
begin/end, but copies offsets starting fromoffsetinstead ofoffset + begin.This affects batched merge logic in:
be/src/exprs/agg/java_udaf_function.h_merge_batch_process()merge_batch_single_state()merge_batch()Why this matches the user-visible symptom
update/finalizeexecution and avoids the intermediate serialize/merge path.Reproduction outline
ByteBufferand merged fromByteBufferset new_planner_agg_stage = 1