Skip to content

feat: support binary and string types for concat UDFs#22244

Open
theirix wants to merge 1 commit into
apache:mainfrom
theirix:concat-binary-udf
Open

feat: support binary and string types for concat UDFs#22244
theirix wants to merge 1 commit into
apache:mainfrom
theirix:concat-binary-udf

Conversation

@theirix
Copy link
Copy Markdown
Contributor

@theirix theirix commented May 16, 2026

Which issue does this PR close?

Rationale for this change

While #21883 introduced binary argument support for the pipe operator, this PR targets three UDFs: concat, concat_ws, and Spark's concat to harmonise all their behaviour.

We decided to support only string+string and binary+binary and ban mixed operations, to match the behaviour of the majority of engines. Previously, mixed behaviour was allowed in #20787.

What changes are included in this PR?

  • Added support for binary types (four variants)
  • Disallowed mixed string/binary operations
  • Fixed edge cases when binary->string type coercion overrode UDF rules, so mixed calls were allowed
  • Refactored the three UDFs by extracting duplicate code into shared helpers to keep the logic centralised
  • Refined concat_ws behavior for different separator types

The diff is quite large. Detailed code changes:

  • Added a trait ConcatBuilder to abstract string/binary/view/array operations
  • Abstracted StringArray/LargeStringArray into a generic ConcatGenericStringBuilder - less duplication
  • Introduced mirrored builders for binary types in a new file binaries.rs
  • Introduced more ColumnarValueRef variants to handle nullable and non-nullable binary types
  • Extracted the ColumnarValue -> ColumnarValueRef builder to from_columnar_value - simplified call sites. Scalar code path stays mostly the same
  • Switched from Signature::variadic to Signature::UserDefined to allow different argument types.Variadic required every argument (including binaries) to be coerced to the same string type, so the UDF cannot distinguish between binary and string inputs. It's a relatively uncommon - happy to discuss
  • Simplified Spark's concat significantly by reusing the concat implementation, so it handles only Spark-specific null-handling
  • Moved SLTs from Spark SLT (spark/concat.slt) to a generic SLT, so we test both generic and Spark-specific behaviour

Are these changes tested?

  • Added more unit tests for previously uncovered major code paths
  • Added more SLTs, especially for type coercion

Are there any user-facing changes?

No

@github-actions github-actions Bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation spark labels May 16, 2026
@theirix theirix marked this pull request as ready for review May 16, 2026 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation logical-expr Logical plan and expressions spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Binary string (BYTEA, Binary) concatenation

1 participant