Skip to content

Java UDAF returns incorrect results in multi-stage aggregation due to broken batched serialize/merge argument handling #71544

@j4k4

Description

@j4k4

I've tried to create a Java UDAF function to perform custom theta sketch operations. But the final sketch is always empty. I got a correct result once I forced single-stage aggregation. Below you can find an AI summary of the observed issue.

Summary

Java UDAFs can return incorrect results when StarRocks plans aggregation as multi-stage (update serialize + merge finalize), while the same UDAF works correctly when forced to single-stage aggregation.

Observed behavior

  • With the default planner, grouped aggregation uses a two-stage plan:
    • AGGREGATE (update serialize)
    • AGGREGATE (merge finalize)
  • The final result is incorrect.
  • In the reported theta-sketch repro, the output collapses to an empty sketch (AQMDAAAeAAA=).
  • Forcing single-stage aggregation (new_planner_agg_stage = 1) returns the correct result.

Expected behavior

Multi-stage aggregation should produce the same result as single-stage aggregation for a Java UDAF that correctly implements:

  • create
  • update
  • serializeLength
  • serialize
  • merge
  • finalize

Why this looks like a StarRocks bug

1. Wrong Java array element type in repeated-object helper

File:

  • be/src/udf/java/java_udf.cpp

Function:

  • JVMFunctionHelper::create_object_array()

Issue:

  • This helper creates a Java array using _object_array_class as the element class.
  • That means it builds an array of Object[] elements rather than an array of Object elements.
  • However, callers use it to create repeated arrays of:
    • ByteBuffer
    • Java UDAF state objects
  • This helper is used in Java UDAF multi-stage paths, including:
    • be/src/exprs/agg/java_udaf_function.h
      • convert_to_serialize_format()
      • batch_serialize()
      • merge_batch_single_state()

This likely breaks batched serialize/merge argument construction.

2. batch_create_bytebuf() ignores the slice start offset

File:

  • be/src/udf/java/java_udf.cpp

Function:

  • JVMFunctionHelper::batch_create_bytebuf(unsigned char* ptr, const uint32_t* offset, int begin, int end)

Issue:

  • The function accepts begin/end, but copies offsets starting from offset instead of offset + begin.
  • As a result, when a partial range is requested, the generated direct byte buffers can point to the wrong serialized slices.

This affects batched merge logic in:

  • be/src/exprs/agg/java_udaf_function.h
    • _merge_batch_process()
    • merge_batch_single_state()
    • merge_batch()

Why this matches the user-visible symptom

  • Single-stage aggregation mostly uses direct update/finalize execution and avoids the intermediate serialize/merge path.
  • Multi-stage aggregation relies on the Java UDAF batched serialization and merge helpers above.
  • If those helpers build invalid argument arrays or mis-slice serialized buffers, merge can consume incorrect input and produce empty/corrupted final state.

Reproduction outline

  • Register a Java UDAF whose state is serialized to ByteBuffer and merged from ByteBuffer
  • Use grouped aggregation on multiple rows per group
  • Observe:
    • incorrect result under the default multi-stage plan
    • correct result with set new_planner_agg_stage = 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions