[iteration #1] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait #14553

anlinc · 2025-02-08T00:33:37Z

Which issue does this PR close?

Rationale for this change

Substrait plans are intended to be interpreted literally. When you see plan nodes like:

"project": {
  "common": {
    "emit": {
      "outputMapping": [0, 3]
    }
  },
...
}

The output mapping (e.g. [0, 3]) contains ordinals representing the offset of the target expression(s) within the [input, output] list. If the DataFusion LogicalPlanBuilder is introducing additional input expressions, this violates the plan's intent and will produce the incorrect output mappings. Please see the issue for a concrete example.

What changes are included in this PR?

In the Substrait path, do not add additional grouping expressions derived from functional dependencies.

Are these changes tested?

Added a multilayer aggregation Substrait example. The first aggregation produces a unique column with a functional dependency. Despite this, the second aggregation must not introduce any additional grouping expressions.

There should be no changes in the non-Substrait path.

Are there any user-facing changes?

No.

…. Do not implicitly add any expressions when building the LogicalPlan.

anlinc · 2025-02-10T22:36:39Z

datafusion/expr/src/logical_plan/builder.rs

+        self._aggregate(group_expr, aggr_expr, false)
+    }
+
+    fn _aggregate(


Super new to Rust -- is this an okay / conventional way to name private helpers?

I don't think there's need for _ since the function is already private (by virtue of not being pub fn). Something like aggregate_inner I think is used quite a lot.

Alternatively, given the logicalplanbuilder for aggregate doesn't do that much, we could also just inline it into the substrait consumer. That way it's not changing the LogicalPlanBuilder api, which might be easier.

Or maybe this whole add_group_by_exprs_from_dependencies thing should move from the plan builder into the analyzer/optimizer? Intuitively it feels like the constructed logical plan shouldn't do this kind of magic, but the analyzer/optimizer can if it makes things faster to execute. But that might be a bigger undertaking, so I'd be quite fine with this PR or the alternative above first.

Intuitively it feels like the constructed logical plan shouldn't do this kind of magic, but the analyzer/optimizer can if it makes things faster to execute.

We're on the same page here. My first approach actually was to move this out of the LogicalPlanBuilder and into an Analyzer rule.

However, I abandoned that because it would break the supported functionality that allows you to project unique expressions that are not part of the grouping expressions set.

Analyzer rules are run after the logical plan has been constructed. The checks in place to validate projection references (https://github.com/apache/datafusion/blob/main/datafusion/sql/src/select.rs#L803) happens before that.

We could also just inline it into the Substrait consumer.

I also took a stab at that before 😢. But the plan Arc is private and inaccessible from the Substrait consumer.

I renamed the function to aggregate_inner :)

Hm, the plan inside the LogicalPlanBuilder? You could just skip the builder completely:

let input = consumer.consume_rel(input).await?; ... let group_exprs = normalize_cols(group_exprs, &input)?; let aggr_exprs = normalize_cols(aggr_exprs, &input)?; Ok(LogicalPlan::Aggregate(Aggregate::try_new( Arc::new(input), group_exprs, aggr_exprs, )?))

But either is fine by me. @alamb do you have preferences, or thoughts on this overall (I feel it's weird the LogicalPlanBuilder::aggregate does this magic, but changing that is break, but also adding the aggregate_without_implicit_group_by_exprs feels a bit sad API...

datafusion/substrait/tests/cases/roundtrip_logical_plan.rs

datafusion/expr/src/logical_plan/builder.rs

Blizzara · 2025-02-12T15:12:58Z

Thanks, seems like a clear enough bug, appreciate both the report and the PR to fix it!

Minor code syntax change to maintain variable immutability.

alamb · 2025-02-14T15:11:53Z

datafusion/expr/src/logical_plan/builder.rs

+        self.aggregate_inner(group_expr, aggr_expr, true)
+    }
+
+    pub fn aggregate_without_implicit_group_by_exprs(


Yeah I agree adding LogicalPlanBuilder::aggregate_without_implicit_group_by_exprs is not good (especially without documentation explaining the difference)

What I suggest we do (perhaps as a different PR) is to add a flag to the builder to control this behavior

struct LogicalPlanBuilder { ... /// Should the plan builder add implicit group bys to the plan based on constraints add_implicit_group_by_exprs: bool, }

Then when that behavior is needed (in the sql planner) it could be enabled like

input .with_add_implicit_group_by_exprs(true) // new method to see the flag .aggregate(group_exprs, aggr_exprs)? .build()

Is this something you would be willing to try @anlinc or @Blizzara ?

I'm taking a look now!

It does indeed make sense to have this disabled by default, and enabled only on the SQL path.

I also want to experiment with @Blizzara's suggestion -- we could inline the additional expressions change on the SQL plan path instead. Part of why we may not want a variable is:

It really only applies to one construct in the builder (aggregations).

It's probably not a popular configuration to use.

anlinc · 2025-02-24T21:14:38Z

@Blizzara @alamb I am closing this in favor of the latest iteration here: #14860, which addresses the discussions in this PR.

github-actions bot added logical-expr Logical plan and expressions substrait labels Feb 8, 2025

anlinc changed the title ~~fix: Do not add implicit groupBy expressions when building logical plans from Substrait~~ fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait Feb 10, 2025

fix(substrait): Substrait input plans should be interpreted literally…

cc0fee8

…. Do not implicitly add any expressions when building the LogicalPlan.

anlinc force-pushed the anlinc/fix_logical_agg_substrait branch from a4030e9 to cc0fee8 Compare February 10, 2025 22:26

anlinc commented Feb 10, 2025

View reviewed changes

anlinc marked this pull request as ready for review February 10, 2025 22:39

Blizzara reviewed Feb 12, 2025

View reviewed changes

datafusion/substrait/tests/cases/roundtrip_logical_plan.rs Show resolved Hide resolved

Blizzara reviewed Feb 12, 2025

View reviewed changes

datafusion/expr/src/logical_plan/builder.rs Outdated Show resolved Hide resolved

Rename _aggregate helper to aggregate_inner.

ab20e44

Minor code syntax change to maintain variable immutability.

anlinc requested a review from Blizzara February 12, 2025 23:15

alamb reviewed Feb 14, 2025

View reviewed changes

anlinc mentioned this pull request Feb 24, 2025

fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait #14860

Open

anlinc closed this Feb 24, 2025

anlinc changed the title ~~fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait~~ [iteration #1] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iteration #1] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait #14553

[iteration #1] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait #14553

anlinc commented Feb 8, 2025 •

edited

Loading

anlinc Feb 10, 2025 •

edited

Loading

Blizzara Feb 12, 2025

anlinc Feb 12, 2025 •

edited

Loading

anlinc Feb 12, 2025

anlinc Feb 12, 2025

Blizzara Feb 13, 2025

Blizzara commented Feb 12, 2025

alamb Feb 14, 2025

anlinc Feb 18, 2025

anlinc commented Feb 24, 2025

[iteration #1] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait #14553

[iteration #1] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait #14553

Conversation

anlinc commented Feb 8, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

anlinc Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Blizzara Feb 12, 2025

Choose a reason for hiding this comment

anlinc Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

anlinc Feb 12, 2025

Choose a reason for hiding this comment

anlinc Feb 12, 2025

Choose a reason for hiding this comment

Blizzara Feb 13, 2025

Choose a reason for hiding this comment

Blizzara commented Feb 12, 2025

alamb Feb 14, 2025

Choose a reason for hiding this comment

anlinc Feb 18, 2025

Choose a reason for hiding this comment

anlinc commented Feb 24, 2025

anlinc commented Feb 8, 2025 •

edited

Loading

anlinc Feb 10, 2025 •

edited

Loading

anlinc Feb 12, 2025 •

edited

Loading