Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement substrait for LIKE/ILIKE expr #6840

Merged
merged 7 commits into from
Jul 14, 2023

Conversation

waynexia
Copy link
Member

@waynexia waynexia commented Jul 4, 2023

Which issue does this PR close?

Closes #6731.

Rationale for this change

Support more expressions in substrait

What changes are included in this PR?

Implement Expr::Like and Expr::ILike for datafusion-substrait. This patch doesn't use this definition as our Expr::Like has one more field escape_char (but it's not used/supported in the physical plan lol).

Are these changes tested?

Yes

Are there any user-facing changes?

@github-actions github-actions bot added the substrait Changes to the substrait crate label Jul 4, 2023
@@ -1120,6 +1151,70 @@ fn make_substrait_window_function(
}
}

#[allow(deprecated)]
fn make_substrait_like_expr(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can combine Expr::Like and Expr::ILike by adding a field ignore_case? cc @alamb

Comment on lines +1199 to +1202
if negated {
let function_anchor = _register_function("not".to_string(), extension_info);

Ok(Expression {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is very common. I'm thinking of wrapping it into something like LogicalPlanBuilder

@@ -1329,3 +1272,66 @@ fn from_substrait_null(null_type: &Type) -> Result<ScalarValue> {
))
}
}

async fn make_datafusion_like(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's time to refactor these large producer.rs and consumer.rs into several small files and organize them by functionality (e.g., produce like and consume like)

@crepererum
Copy link
Contributor

This patch doesn't use this definition as our Expr::Like has one more field escape_char (but it's not used/supported in the physical plan lol).

IIRC I've ran into this issue before as well. I think we should just remove that field and bail out during the SQL->Logical lowering if the SQL parser encounters that (because it seems the logical expr. is modeled after SQL and the physical after whats possible in arrow at the moment). Or we decide that this is a feature we actually want and fix the logical->physical lowering.

Maybe we can combine Expr::Like and Expr::ILike by adding a field ignore_case?

Yes please.

Signed-off-by: Ruihang Xia <[email protected]>
@waynexia
Copy link
Member Author

waynexia commented Jul 4, 2023

This patch doesn't use this definition as our Expr::Like has one more field escape_char (but it's not used/supported in the physical plan lol).

IIRC I've ran into this issue before as well. I think we should just remove that field and bail out during the SQL->Logical lowering if the SQL parser encounters that (because it seems the logical expr. is modeled after SQL and the physical after whats possible in arrow at the moment). Or we decide that this is a feature we actually want and fix the logical->physical lowering.

This is also a solution. I have a little background about this feature. But this is supported in PG (doc), so maybe we are going to implement it in the future?

Maybe we can combine Expr::Like and Expr::ILike by adding a field ignore_case?

Yes please.

#6841

@andygrove
Copy link
Member

@nseekhao fyi

Copy link
Contributor

@nseekhao nseekhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from a few minor comments, everything LGTM!

pattern,
*escape_char,
schema,
col_ref_offset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For future improvement. Not completely related to this PR]

I think having to carry around col_ref_offset for any expression-related functions unnecessarily overcrowds the code. Once we have SubqueryAlias support implemented, this should not be necessary anymore. I'll refactor the code when that happens.

@waynexia waynexia requested a review from nseekhao July 9, 2023 12:45
let mut args: Vec<Expr> = vec![];
for arg in f.arguments.iter() {
ScalarFunctionType::Not => {
let arg = f.arguments.first().ok_or_else(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check that f.arguments.len() == 1 is true, since first().ok_or_else() would only give us an error if the vector is empty? Or do we not care if there's more than one argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. In the previous code (like deserializing ReadRel) we don't care about extra information from the proto. I.e., we only care about if we can get the data necessary to construct our plan, and just ignore other extra things. But on the other hand, redundant things sometimes indicate an error or unexpected behavior (like finding a remaining screw after recovering something 🫣).

return Err(DataFusionError::NotImplemented(
format!("Invalid arguments type for `{}` expr", fn_name)
))
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if

format!("Invalid arguments type for `{}` expr", fn_name)

or

format!("Invalid arguments type for `{fn_name}` expr")

is preferred in Rust, but maybe we should choose one to be consistent here? It may be confusing to the reader if different syntaxes are used to implement the same semantic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay 👍

BTW, clippy has a lint about this style uninlined_format_args. (but maybe it's unnecessary to enable it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, clippy has a lint about this style uninlined_format_args. (but maybe it's unnecessary to enable it?

Oh thanks for the reference! Inlining the args make sense.

Signed-off-by: Ruihang Xia <[email protected]>
Copy link
Contributor

@nseekhao nseekhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

@waynexia waynexia merged commit 16303ad into apache:main Jul 14, 2023
@waynexia waynexia deleted the substrait-like branch July 14, 2023 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Like and other familiar exprs in substrait
4 participants