Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type coercion rule for concat and concat_ws #3721

Merged
merged 6 commits into from
Oct 7, 2022

Conversation

HaoYang670
Copy link
Contributor

@HaoYang670 HaoYang670 commented Oct 5, 2022

Signed-off-by: remzi [email protected]

Which issue does this PR close?

Closes #3720.

Rationale for this change

Before

❯ create table t as select 1 as a;

❯ explain verbose select concat(a, 'utf8', true, false, null, 1, 0.2) from t;
+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type                                                  | plan                                                                                                                                                                                                                                        |
+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| initial_logical_plan                                       | Projection: concat(t.a, Utf8("utf8"), Boolean(true), Boolean(false), NULL, Int64(1), Float64(0.2))                                                                                                                                          |
|                                                            |   TableScan: t                                                                                                                                                                                                                              |
| logical_plan after type_coercion                           | SAME TEXT AS ABOVE                                                                                                                                                                                                                          |
| logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                                                                                                                                                                                                                          |
| logical_plan after unwrap_cast_in_comparison               | SAME TEXT AS ABOVE 
...

After

❯ create table t as select 1 as a;

❯ explain verbose select concat(a, 'utf8', true, false, null, 1, 0.2) from t;
+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type                                                  | plan                                                                                                                                                                                              |
+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| initial_logical_plan                                       | Projection: concat(t.a, Utf8("utf8"), Boolean(true), Boolean(false), NULL, Int64(1), Float64(0.2))                                                                                                |
|                                                            |   TableScan: t                                                                                                                                                                                    |
| logical_plan after type_coercion                           | Projection: concat(CAST(t.a AS Utf8), Utf8("utf8"), CAST(Boolean(true) AS Utf8), CAST(Boolean(false) AS Utf8), CAST(NULL AS Utf8), CAST(Int64(1) AS Utf8), CAST(Float64(0.2) AS Utf8))            |
|                                                            |   TableScan: t                                                                                                                                                                                    |
| logical_plan after simplify_expressions                    | Projection: concat(CAST(t.a AS Utf8), Utf8("utf8"), Utf8("1"), Utf8("0"), Utf8(NULL), Utf8("1"), Utf8("0.2")) AS concat(t.a,Utf8("utf8"),Boolean(true),Boolean(false),NULL,Int64(1),Float64(0.2)) |
|                                                            |   TableScan: t                                                                                                                                                                                    |
| logical_plan after unwrap_cast_in_comparison               | SAME TEXT AS ABOVE
...

What changes are included in this PR?

Are there any user-facing changes?

@HaoYang670 HaoYang670 marked this pull request as draft October 5, 2022 10:50
@github-actions github-actions bot added the optimizer Optimizer rules label Oct 5, 2022
@HaoYang670 HaoYang670 marked this pull request as ready for review October 5, 2022 12:06
@andygrove
Copy link
Member

Thanks @HaoYang670. This is looking good.

I ran the example query in Postgres and compared it with DataFusion:

Postgres

postgres=# select concat(a, 'utf8', true, false, null, 1, 0.2) from t;select concat(a, 'utf8', true, false, null, 1, 0.2) from t;
   concat    
-------------
 1utf8tf10.2

DataFusion

❯ select concat(a, 'utf8', true, false, null, 1, 0.2) from t;
+----------------------------------------------------------------------------------+
| concat(t.a,Utf8("utf8"),Boolean(true),Boolean(false),NULL,Int64(1),Float64(0.2)) |
+----------------------------------------------------------------------------------+
| 1utf81010.2                                                                      |
+----------------------------------------------------------------------------------+

I have two observations:

  • Postgres produces t and f when casting bool to string and we are producing 1 and 0
  • The schema name for the expression looks much nicer in Postgres. Perhaps we can change the Expr::name implementation for concat or scalar functions in general - that is unrelated to this issue though, so I filed Use simpler schema names for expressions #3722 for this

@HaoYang670
Copy link
Contributor Author

Postgres produces t and f when casting bool to string and we are producing 1 and 0

Hmm, this is also the behavior of the cast kernel in arrow-rs: https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/cast.rs#L704

@andygrove
Copy link
Member

Postgres produces t and f when casting bool to string and we are producing 1 and 0

Hmm, this is also the behavior of the cast kernel in arrow-rs: https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/cast.rs#L704

I think 1 and 0 makes sense for Arrow. In DataFusion we'll need to add additional logic to match Postgres.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @HaoYang670

Postgres produces t and f when casting bool to string and we are producing 1 and 0

Hmm, this is also the behavior of the cast kernel in arrow-rs:

I think 1 and 0 makes sense for Arrow. In DataFusion we'll need to add additional logic to match Postgres.

I don't think this PR changes (or should change) the behavior of how booleans are cast to strings. I recommend we file a follow on issue / PR to sort that out.

Expr::ScalarFunction { fun, args } => match fun {
BuiltinScalarFunction::Concat
| BuiltinScalarFunction::ConcatWithSeparator => {
let new_args = args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should do something with LargeUtf8?

Also, would it make sense to check the types before clone()ing them to do a cast that might not be needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to check the types before clone()ing them to do a cast that might not be needed?

I think this has been done in the cast_to function:

fn cast_to<S: ExprSchema>(self, cast_to_type: &DataType, schema: &S) -> Result<Expr> {
    // TODO(kszucs): most of the operations do not validate the type correctness
    // like all of the binary expressions below. Perhaps Expr should track the
    // type of the expression?
    let this_type = self.get_type(schema)?;
    if this_type == *cast_to_type {
        Ok(self)
...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should do something with LargeUtf8?

This is a good suggestion. My opinion is that we could use LargeUtf8 if one of the arguments has this type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there type coercion rule for the function Concat or ConcatWithSeparator?

Now the type coercion are not supported in the logical phase for some expr which is Expr::ScalarFunction, Expr::AggregateFunction,Expr::WindowFunction and Expr::AggregateUDF in the follow-up pr for this #3582 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think after moving the type coercion rule to the logical phase, this issue can be resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean about "type coercion rule for the function Concat or ConcatWithSeparator"

Since it is a Expr::ScalarFunction { fun, args } it currently gets coerced using data_types https://github.com/apache/arrow-datafusion/blob/3eb55e9a0510d872f6f7765b1a5f17db46486e45/datafusion/expr/src/type_coercion.rs#L44-L47

Are you suggesting we move the logic that picks what argument types (in this case string) for concat into data_types? (I think this is a good idea, for the record)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can do it in the #3582 (comment)

And this pr can be merged first.

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
But I will move the type coercion for Expr:: ScalarFunction to logical phase, and remove the type coercion in the physical phase.

@liukun4515
Copy link
Contributor

@HaoYang670 Sorry for that commit aa0d14c
Can you help to fix this?

@alamb
Copy link
Contributor

alamb commented Oct 6, 2022

I pushed c6e1208 both to fix this PR as well as test to see if github actions are fixed yet.

Seems they are not :(

@HaoYang670 HaoYang670 force-pushed the 3720_type_coercion_concat branch from c6e1208 to 63c2587 Compare October 6, 2022 23:53
@alamb
Copy link
Contributor

alamb commented Oct 7, 2022

Thanks again @HaoYang670

@alamb alamb merged commit d863853 into apache:master Oct 7, 2022
@ursabot
Copy link

ursabot commented Oct 7, 2022

Benchmark runs are scheduled for baseline = fef45e7 and contender = d863853. d863853 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@HaoYang670 HaoYang670 deleted the 3720_type_coercion_concat branch October 7, 2022 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add type coercion rule for CONCAT and CONCAT_WS
5 participants