-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add type coercion rule for concat
and concat_ws
#3721
Conversation
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Thanks @HaoYang670. This is looking good. I ran the example query in Postgres and compared it with DataFusion: Postgres
DataFusion
I have two observations:
|
Hmm, this is also the behavior of the cast kernel in arrow-rs: https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/cast.rs#L704 |
I think 1 and 0 makes sense for Arrow. In DataFusion we'll need to add additional logic to match Postgres. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @HaoYang670
Postgres produces t and f when casting bool to string and we are producing 1 and 0
Hmm, this is also the behavior of the cast kernel in arrow-rs:
I think 1 and 0 makes sense for Arrow. In DataFusion we'll need to add additional logic to match Postgres.
I don't think this PR changes (or should change) the behavior of how booleans are cast to strings. I recommend we file a follow on issue / PR to sort that out.
Expr::ScalarFunction { fun, args } => match fun { | ||
BuiltinScalarFunction::Concat | ||
| BuiltinScalarFunction::ConcatWithSeparator => { | ||
let new_args = args |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should do something with LargeUtf8
?
Also, would it make sense to check the types before clone()
ing them to do a cast that might not be needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to check the types before clone()ing them to do a cast that might not be needed?
I think this has been done in the cast_to
function:
fn cast_to<S: ExprSchema>(self, cast_to_type: &DataType, schema: &S) -> Result<Expr> {
// TODO(kszucs): most of the operations do not validate the type correctness
// like all of the binary expressions below. Perhaps Expr should track the
// type of the expression?
let this_type = self.get_type(schema)?;
if this_type == *cast_to_type {
Ok(self)
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should do something with LargeUtf8?
This is a good suggestion. My opinion is that we could use LargeUtf8
if one of the arguments has this type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there type coercion rule for the function Concat
or ConcatWithSeparator
?
Now the type coercion are not supported in the logical phase for some expr which is Expr::ScalarFunction
, Expr::AggregateFunction
,Expr::WindowFunction
and Expr::AggregateUDF
in the follow-up pr for this #3582 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think after moving the type coercion
rule to the logical phase, this issue can be resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean about "type coercion rule for the function Concat or ConcatWithSeparator"
Since it is a Expr::ScalarFunction { fun, args }
it currently gets coerced using data_types
https://github.com/apache/arrow-datafusion/blob/3eb55e9a0510d872f6f7765b1a5f17db46486e45/datafusion/expr/src/type_coercion.rs#L44-L47
Are you suggesting we move the logic that picks what argument types (in this case string) for concat into data_types
? (I think this is a good idea, for the record)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do it in the #3582 (comment)
And this pr can be merged first.
Signed-off-by: remzi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
But I will move the type coercion
for Expr:: ScalarFunction
to logical phase, and remove the type coercion
in the physical phase.
@HaoYang670 Sorry for that commit aa0d14c |
I pushed c6e1208 both to fix this PR as well as test to see if github actions are fixed yet. Seems they are not :( |
c6e1208
to
63c2587
Compare
Thanks again @HaoYang670 |
Benchmark runs are scheduled for baseline = fef45e7 and contender = d863853. d863853 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Signed-off-by: remzi [email protected]
Which issue does this PR close?
Closes #3720.
Rationale for this change
Before
After
What changes are included in this PR?
Are there any user-facing changes?