Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[iteration #1] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait #14553

Closed
wants to merge 2 commits into from

Conversation

anlinc
Copy link

@anlinc anlinc commented Feb 8, 2025

Which issue does this PR close?

Closes #14348

Rationale for this change

Substrait plans are intended to be interpreted literally. When you see plan nodes like:

"project": {
  "common": {
    "emit": {
      "outputMapping": [0, 3]
    }
  },
...
}

The output mapping (e.g. [0, 3]) contains ordinals representing the offset of the target expression(s) within the [input, output] list. If the DataFusion LogicalPlanBuilder is introducing additional input expressions, this violates the plan's intent and will produce the incorrect output mappings. Please see the issue for a concrete example.

What changes are included in this PR?

In the Substrait path, do not add additional grouping expressions derived from functional dependencies.

Are these changes tested?

Added a multilayer aggregation Substrait example. The first aggregation produces a unique column with a functional dependency. Despite this, the second aggregation must not introduce any additional grouping expressions.

There should be no changes in the non-Substrait path.

Are there any user-facing changes?

No.

@github-actions github-actions bot added logical-expr Logical plan and expressions substrait labels Feb 8, 2025
@anlinc anlinc changed the title fix: Do not add implicit groupBy expressions when building logical plans from Substrait fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait Feb 10, 2025
…. Do not implicitly add any expressions when building the LogicalPlan.
@anlinc anlinc force-pushed the anlinc/fix_logical_agg_substrait branch from a4030e9 to cc0fee8 Compare February 10, 2025 22:26
self._aggregate(group_expr, aggr_expr, false)
}

fn _aggregate(
Copy link
Author

@anlinc anlinc Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super new to Rust -- is this an okay / conventional way to name private helpers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's need for _ since the function is already private (by virtue of not being pub fn). Something like aggregate_inner I think is used quite a lot.

Alternatively, given the logicalplanbuilder for aggregate doesn't do that much, we could also just inline it into the substrait consumer. That way it's not changing the LogicalPlanBuilder api, which might be easier.

Or maybe this whole add_group_by_exprs_from_dependencies thing should move from the plan builder into the analyzer/optimizer? Intuitively it feels like the constructed logical plan shouldn't do this kind of magic, but the analyzer/optimizer can if it makes things faster to execute. But that might be a bigger undertaking, so I'd be quite fine with this PR or the alternative above first.

Copy link
Author

@anlinc anlinc Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively it feels like the constructed logical plan shouldn't do this kind of magic, but the analyzer/optimizer can if it makes things faster to execute.

We're on the same page here. My first approach actually was to move this out of the LogicalPlanBuilder and into an Analyzer rule.

However, I abandoned that because it would break the supported functionality that allows you to project unique expressions that are not part of the grouping expressions set.

Analyzer rules are run after the logical plan has been constructed. The checks in place to validate projection references (https://github.com/apache/datafusion/blob/main/datafusion/sql/src/select.rs#L803) happens before that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just inline it into the Substrait consumer.

I also took a stab at that before 😢. But the plan Arc is private and inaccessible from the Substrait consumer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the function to aggregate_inner :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, the plan inside the LogicalPlanBuilder? You could just skip the builder completely:

        let input = consumer.consume_rel(input).await?;
        ...
        let group_exprs = normalize_cols(group_exprs, &input)?;
        let aggr_exprs = normalize_cols(aggr_exprs, &input)?;
        
        Ok(LogicalPlan::Aggregate(Aggregate::try_new(
            Arc::new(input),
            group_exprs,
            aggr_exprs,
        )?))

But either is fine by me. @alamb do you have preferences, or thoughts on this overall (I feel it's weird the LogicalPlanBuilder::aggregate does this magic, but changing that is break, but also adding the aggregate_without_implicit_group_by_exprs feels a bit sad API...

@anlinc anlinc marked this pull request as ready for review February 10, 2025 22:39
@Blizzara
Copy link
Contributor

Thanks, seems like a clear enough bug, appreciate both the report and the PR to fix it!

Minor code syntax change to maintain variable immutability.
@anlinc anlinc requested a review from Blizzara February 12, 2025 23:15
self.aggregate_inner(group_expr, aggr_expr, true)
}

pub fn aggregate_without_implicit_group_by_exprs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree adding LogicalPlanBuilder::aggregate_without_implicit_group_by_exprs is not good (especially without documentation explaining the difference)

What I suggest we do (perhaps as a different PR) is to add a flag to the builder to control this behavior

struct LogicalPlanBuilder { 
...
  /// Should the plan builder add implicit group bys to the plan based on constraints
  add_implicit_group_by_exprs: bool,
}

Then when that behavior is needed (in the sql planner) it could be enabled like

        input
            .with_add_implicit_group_by_exprs(true) // new method to see the flag
            .aggregate(group_exprs, aggr_exprs)?
            .build()

Is this something you would be willing to try @anlinc or @Blizzara ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking a look now!

It does indeed make sense to have this disabled by default, and enabled only on the SQL path.

I also want to experiment with @Blizzara's suggestion -- we could inline the additional expressions change on the SQL plan path instead. Part of why we may not want a variable is:

  • It really only applies to one construct in the builder (aggregations).
  • It's probably not a popular configuration to use.

@anlinc
Copy link
Author

anlinc commented Feb 24, 2025

@Blizzara @alamb I am closing this in favor of the latest iteration here: #14860, which addresses the discussions in this PR.

@anlinc anlinc closed this Feb 24, 2025
@anlinc anlinc changed the title fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait [iteration #1] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[substrait] Synthetically added grouping expressions in Aggregates can cause mismatched output columns
3 participants