-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simplify regex expressions #4646
Conversation
cbf81d2
to
c719afa
Compare
c719afa
to
d29576b
Compare
@@ -45,6 +45,7 @@ datafusion-expr = { path = "../expr", version = "15.0.0" } | |||
datafusion-physical-expr = { path = "../physical-expr", version = "15.0.0" } | |||
hashbrown = { version = "0.13", features = ["raw"] } | |||
log = "^0.4" | |||
regex-syntax = "0.6.28" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically not a new dependency (as datafusion-physical-expr already depends on regex
which depends on regex-syntax
). However, regex is an optional dependency whereas this is not optional.
I think it would be better to move this dependency into datafusion-physical-expr
so regex can still be optional. I will try to do so over the next few days
|
||
// OR-chain | ||
assert_change( | ||
regex_match(col("c1"), lit("foo|bar|baz")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to convert this to an IN <list>
which has a bunch of optimizations https://github.com/apache/arrow-datafusion/blob/master/datafusion/physical-expr/src/expressions/in_list.rs
I remember a discussion about IN <list>
vs OR
chains -- @tustvold do you have any opinion on which is better in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can express it as an IN that will be fastest, but there isn't a IN LIKE AFAIK (although we could implement such a thing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can extend the rewrite logic at one point to spot fully anchored regex expressions (i.e. the ones that span the whole string, not any substring). In this case we could emit EQ/NEQ/IN.
); | ||
} | ||
|
||
#[track_caller] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expr::Like(Like { | ||
negated: false, | ||
expr: Box::new(expr), | ||
pattern: Box::new(Expr::Literal(ScalarValue::Utf8(Some(pattern.to_owned())))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pattern: Box::new(Expr::Literal(ScalarValue::Utf8(Some(pattern.to_owned())))), | |
pattern: Box::new(lit(pattern)), |
Expr::Like(Like { | ||
negated: true, | ||
expr: Box::new(expr), | ||
pattern: Box::new(Expr::Literal(ScalarValue::Utf8(Some(pattern.to_owned())))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pattern: Box::new(Expr::Literal(ScalarValue::Utf8(Some(pattern.to_owned())))), | |
pattern: Box::new(lit(pattern)), |
Expr::ILike(Like { | ||
negated: false, | ||
expr: Box::new(expr), | ||
pattern: Box::new(Expr::Literal(ScalarValue::Utf8(Some(pattern.to_owned())))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pattern: Box::new(Expr::Literal(ScalarValue::Utf8(Some(pattern.to_owned())))), | |
pattern: Box::new(lit(pattern)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in 197e1cf
}) | ||
} | ||
|
||
fn like(expr: Expr, pattern: &str) -> Expr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are already functions like this : https://docs.rs/datafusion/15.0.0/datafusion/prelude/enum.Expr.html#method.like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 but it turns out that those make Expr::Like
rather than BinaryOperator::Like
🤔 -- I am not sure why there is duplicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #4765 which I think is fairly serious
I still need to review the rewrite logic carefully |
I wonder if we could eventually push some of this logic into the arrow-rs kernel itself 🤔 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the lowering logic and it looks very nice to me. Thank you @crepererum
|
||
if let Expr::Literal(ScalarValue::Utf8(Some(pattern))) = right.as_ref() { | ||
match regex_syntax::Parser::new().parse(pattern) { | ||
Ok(hir) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevant docs in case anyone else is looking: https://docs.rs/regex-syntax/0.6.28/regex_syntax/struct.Parser.html#method.parse
.and(not_like(col("c1"), "%baz%")), | ||
); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any tests here for MAX_REGEX_ALTERNATIONS_EXPENSION
-- maybe we could add one more check for foo|bar|baz|blarg|bozo|etc
or something to show it wasn't rewritten
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a conflict with datafusion-cli/Cargo.lock which I resolved.
Since I had to push changes anyways, I also cleaned up the tests and added coverage.
}) | ||
} | ||
|
||
fn like(expr: Expr, pattern: &str) -> Expr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 but it turns out that those make Expr::Like
rather than BinaryOperator::Like
🤔 -- I am not sure why there is duplicate
Expr::ILike(Like { | ||
negated: false, | ||
expr: Box::new(expr), | ||
pattern: Box::new(Expr::Literal(ScalarValue::Utf8(Some(pattern.to_owned())))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in 197e1cf
Benchmark runs are scheduled for baseline = ad0459e and contender = d4934e8. d4934e8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #4370.
Rationale for this change
Regex expressions are expensive to evaluate. If we can, we shall use our blazing fast arrow
like
kernels instead.What changes are included in this PR?
New expression optimizer rule to lower plan-time-known regexes to
LIKE
expressions.Are these changes tested?
test_simplify_regex
Are there any user-facing changes?
Plans run faster now.