Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix subquery where exists distinct #3732

Merged
merged 3 commits into from
Oct 7, 2022

Conversation

b41sh
Copy link
Contributor

@b41sh b41sh commented Oct 6, 2022

Which issue does this PR close?

Closes #3724

Rationale for this change

What changes are included in this PR?

If the plan is Distinct, get the Filter from Projection

Are there any user-facing changes?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @b41sh -- I can't argue with the results of this PR and the coverage, but I do wonder if there is a more fundamental problem we are missing. 🤔

@@ -137,8 +137,14 @@ fn optimize_exists(
let subqry_inputs = query_info.query.subquery.inputs();
let subqry_input = only_or_err(subqry_inputs.as_slice())
.map_err(|e| context!("single expression projection required", e))?;
let subqry_filter = Filter::try_from_plan(subqry_input)
.map_err(|e| context!("cannot optimize non-correlated subquery", e))?;
let subqry_filter = match subqry_input {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this fixes the error -- what was in the projection in the subquery that caused the problem?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been looking at this as well. The existing code works for a very simple projection but does not work if the projection is wrapped in any other operator, such as Distinct, Filter, Limit, Sort, and so on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although in this case it is now looking for a projection wrapping a filter and isn't looking for distinct so I am also confused.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what is happening now and have some suggestions for improving this rule.

This line of code looks at inputs of the subquery and does not care what type of operator the subquery is. Previously this was assumed to be a Projection but now it could be a Projection or a Distinct, or something else ... I think we should add some pattern matching here.

let subqry_inputs = query_info.query.subquery.inputs();

We are then matching on this input and previously expected a Filter buit now could be a Projection containing a Filter because everything is shifted down by one because of the root Distinct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see something with explicit pattern matching to make sure we are only supporting intended cases. Here is my attempt:

fn optimize_exists(
    query_info: &SubqueryInfo,
    outer_input: &LogicalPlan,
    outer_other_exprs: &[Expr],
) -> datafusion_common::Result<LogicalPlan> {

    let subqry_filter = match query_info.query.subquery.as_ref() {
        LogicalPlan::Distinct(subqry_distinct) => match subqry_distinct.input.as_ref() {
            LogicalPlan::Projection(subqry_proj) => Filter::try_from_plan(&*subqry_proj.input),
            _ => Err(DataFusionError::NotImplemented("todo: error message".to_string()))
        }
        LogicalPlan::Projection(subqry_proj) =>  Filter::try_from_plan(&*subqry_proj.input),
        _ => Err(DataFusionError::NotImplemented("todo: error message".to_string()))
    }.map_err(|e| context!("cannot optimize non-correlated subquery", e))?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks for your advice. @andygrove

@alamb
Copy link
Contributor

alamb commented Oct 6, 2022

labeler CI failure is unrelated #3743

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for making those changes @b41sh

@andygrove andygrove merged commit 1e1de82 into apache:master Oct 7, 2022
@ursabot
Copy link

ursabot commented Oct 7, 2022

Benchmark runs are scheduled for baseline = de9c7c5 and contender = 1e1de82. 1e1de82 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for DISTINCT projections in decorrelate_where_exists
4 participants