-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[iteration #1] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait #14553
Conversation
…. Do not implicitly add any expressions when building the LogicalPlan.
a4030e9
to
cc0fee8
Compare
self._aggregate(group_expr, aggr_expr, false) | ||
} | ||
|
||
fn _aggregate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super new to Rust -- is this an okay / conventional way to name private helpers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's need for _
since the function is already private (by virtue of not being pub fn
). Something like aggregate_inner
I think is used quite a lot.
Alternatively, given the logicalplanbuilder for aggregate doesn't do that much, we could also just inline it into the substrait consumer. That way it's not changing the LogicalPlanBuilder api, which might be easier.
Or maybe this whole add_group_by_exprs_from_dependencies
thing should move from the plan builder into the analyzer/optimizer? Intuitively it feels like the constructed logical plan shouldn't do this kind of magic, but the analyzer/optimizer can if it makes things faster to execute. But that might be a bigger undertaking, so I'd be quite fine with this PR or the alternative above first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intuitively it feels like the constructed logical plan shouldn't do this kind of magic, but the analyzer/optimizer can if it makes things faster to execute.
We're on the same page here. My first approach actually was to move this out of the LogicalPlanBuilder and into an Analyzer rule.
However, I abandoned that because it would break the supported functionality that allows you to project unique expressions that are not part of the grouping expressions set.
Analyzer rules are run after the logical plan has been constructed. The checks in place to validate projection references (https://github.com/apache/datafusion/blob/main/datafusion/sql/src/select.rs#L803) happens before that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also just inline it into the Substrait consumer.
I also took a stab at that before 😢. But the plan
Arc is private and inaccessible from the Substrait consumer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed the function to aggregate_inner
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, the plan
inside the LogicalPlanBuilder? You could just skip the builder completely:
let input = consumer.consume_rel(input).await?;
...
let group_exprs = normalize_cols(group_exprs, &input)?;
let aggr_exprs = normalize_cols(aggr_exprs, &input)?;
Ok(LogicalPlan::Aggregate(Aggregate::try_new(
Arc::new(input),
group_exprs,
aggr_exprs,
)?))
But either is fine by me. @alamb do you have preferences, or thoughts on this overall (I feel it's weird the LogicalPlanBuilder::aggregate does this magic, but changing that is break, but also adding the aggregate_without_implicit_group_by_exprs
feels a bit sad API...
Thanks, seems like a clear enough bug, appreciate both the report and the PR to fix it! |
Minor code syntax change to maintain variable immutability.
self.aggregate_inner(group_expr, aggr_expr, true) | ||
} | ||
|
||
pub fn aggregate_without_implicit_group_by_exprs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree adding LogicalPlanBuilder::aggregate_without_implicit_group_by_exprs
is not good (especially without documentation explaining the difference)
What I suggest we do (perhaps as a different PR) is to add a flag to the builder to control this behavior
struct LogicalPlanBuilder {
...
/// Should the plan builder add implicit group bys to the plan based on constraints
add_implicit_group_by_exprs: bool,
}
Then when that behavior is needed (in the sql planner) it could be enabled like
input
.with_add_implicit_group_by_exprs(true) // new method to see the flag
.aggregate(group_exprs, aggr_exprs)?
.build()
Is this something you would be willing to try @anlinc or @Blizzara ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm taking a look now!
It does indeed make sense to have this disabled by default, and enabled only on the SQL path.
I also want to experiment with @Blizzara's suggestion -- we could inline the additional expressions change on the SQL plan path instead. Part of why we may not want a variable is:
- It really only applies to one construct in the builder (aggregations).
- It's probably not a popular configuration to use.
Which issue does this PR close?
Closes #14348
Rationale for this change
Substrait plans are intended to be interpreted literally. When you see plan nodes like:
The output mapping (e.g.
[0, 3]
) contains ordinals representing the offset of the target expression(s) within the [input, output] list. If the DataFusion LogicalPlanBuilder is introducing additional input expressions, this violates the plan's intent and will produce the incorrect output mappings. Please see the issue for a concrete example.What changes are included in this PR?
In the Substrait path, do not add additional grouping expressions derived from functional dependencies.
Are these changes tested?
Added a multilayer aggregation Substrait example. The first aggregation produces a unique column with a functional dependency. Despite this, the second aggregation must not introduce any additional grouping expressions.
There should be no changes in the non-Substrait path.
Are there any user-facing changes?
No.