Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify regex expressions #4646

Merged
merged 5 commits into from
Dec 29, 2022
Merged

Conversation

crepererum
Copy link
Contributor

@crepererum crepererum commented Dec 15, 2022

Which issue does this PR close?

Closes #4370.

Rationale for this change

Regex expressions are expensive to evaluate. If we can, we shall use our blazing fast arrow like kernels instead.

What changes are included in this PR?

New expression optimizer rule to lower plan-time-known regexes to LIKE expressions.

Are these changes tested?

test_simplify_regex

Are there any user-facing changes?

Plans run faster now.

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Dec 15, 2022
@crepererum crepererum force-pushed the crepererum/issue4370c branch from cbf81d2 to c719afa Compare December 15, 2022 12:04
@crepererum crepererum force-pushed the crepererum/issue4370c branch from c719afa to d29576b Compare December 16, 2022 11:41
@crepererum crepererum marked this pull request as ready for review December 16, 2022 11:42
@github-actions github-actions bot removed logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) physical-expr Physical Expressions core Core DataFusion crate labels Dec 16, 2022
@@ -45,6 +45,7 @@ datafusion-expr = { path = "../expr", version = "15.0.0" }
datafusion-physical-expr = { path = "../physical-expr", version = "15.0.0" }
hashbrown = { version = "0.13", features = ["raw"] }
log = "^0.4"
regex-syntax = "0.6.28"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically not a new dependency (as datafusion-physical-expr already depends on regex which depends on regex-syntax). However, regex is an optional dependency whereas this is not optional.

I think it would be better to move this dependency into datafusion-physical-expr so regex can still be optional. I will try to do so over the next few days


// OR-chain
assert_change(
regex_match(col("c1"), lit("foo|bar|baz")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to convert this to an IN <list> which has a bunch of optimizations https://github.com/apache/arrow-datafusion/blob/master/datafusion/physical-expr/src/expressions/in_list.rs

I remember a discussion about IN <list> vs OR chains -- @tustvold do you have any opinion on which is better in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can express it as an IN that will be fastest, but there isn't a IN LIKE AFAIK (although we could implement such a thing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can extend the rewrite logic at one point to spot fully anchored regex expressions (i.e. the ones that span the whole string, not any substring). In this case we could emit EQ/NEQ/IN.

);
}

#[track_caller]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expr::Like(Like {
negated: false,
expr: Box::new(expr),
pattern: Box::new(Expr::Literal(ScalarValue::Utf8(Some(pattern.to_owned())))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pattern: Box::new(Expr::Literal(ScalarValue::Utf8(Some(pattern.to_owned())))),
pattern: Box::new(lit(pattern)),

Expr::Like(Like {
negated: true,
expr: Box::new(expr),
pattern: Box::new(Expr::Literal(ScalarValue::Utf8(Some(pattern.to_owned())))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pattern: Box::new(Expr::Literal(ScalarValue::Utf8(Some(pattern.to_owned())))),
pattern: Box::new(lit(pattern)),

Expr::ILike(Like {
negated: false,
expr: Box::new(expr),
pattern: Box::new(Expr::Literal(ScalarValue::Utf8(Some(pattern.to_owned())))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pattern: Box::new(Expr::Literal(ScalarValue::Utf8(Some(pattern.to_owned())))),
pattern: Box::new(lit(pattern)),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in 197e1cf

})
}

fn like(expr: Expr, pattern: &str) -> Expr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 but it turns out that those make Expr::Like rather than BinaryOperator::Like 🤔 -- I am not sure why there is duplicate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #4765 which I think is fairly serious

@alamb
Copy link
Contributor

alamb commented Dec 17, 2022

I still need to review the rewrite logic carefully

@tustvold
Copy link
Contributor

I wonder if we could eventually push some of this logic into the arrow-rs kernel itself 🤔

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the lowering logic and it looks very nice to me. Thank you @crepererum


if let Expr::Literal(ScalarValue::Utf8(Some(pattern))) = right.as_ref() {
match regex_syntax::Parser::new().parse(pattern) {
Ok(hir) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.and(not_like(col("c1"), "%baz%")),
);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any tests here for MAX_REGEX_ALTERNATIONS_EXPENSION -- maybe we could add one more check for foo|bar|baz|blarg|bozo|etc or something to show it wasn't rewritten

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a conflict with datafusion-cli/Cargo.lock which I resolved.

Since I had to push changes anyways, I also cleaned up the tests and added coverage.

})
}

fn like(expr: Expr, pattern: &str) -> Expr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 but it turns out that those make Expr::Like rather than BinaryOperator::Like 🤔 -- I am not sure why there is duplicate

Expr::ILike(Like {
negated: false,
expr: Box::new(expr),
pattern: Box::new(Expr::Literal(ScalarValue::Utf8(Some(pattern.to_owned())))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in 197e1cf

@ursabot
Copy link

ursabot commented Dec 29, 2022

Benchmark runs are scheduled for baseline = ad0459e and contender = d4934e8. d4934e8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rewrite simple regex expressions
4 participants