Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add one more projection to recover output schema #4733

Merged
merged 7 commits into from
Dec 28, 2022

Conversation

waynexia
Copy link
Member

Which issue does this PR close?

Closes #4575 .

Rationale for this change

common_subexpression_eliminate optimizer doesn't consider the situation that some plans (like Sort or Filter) rely on the input's schema. When introducing an intermediate projection schema that evaluates the common expressions, it also changes the "input schema" for those optimized plans. This patch will check if the output schema changed, and add another projection to recover the output schema if so.

What changes are included in this PR?

Except the extra projection described above, this patch also fixes the problem that the optimizer used to build Projection plan manually, rather than through Projection::try_new() constructor. This will bypass the schema check process and may produce a wrong projection plan, whose schema doesn't match its exprs.

Are these changes tested?

Yes, add a new case filter_schema_changed to cover

Are there any user-facing changes?

no

@github-actions github-actions bot added core Core DataFusion crate optimizer Optimizer rules labels Dec 26, 2022
Ok(())
}

#[test]
fn redundant_project_fields() {
let table_scan = test_table_scan().unwrap();
let affected_id: BTreeSet<Identifier> =
["c+a".to_string(), "d+a".to_string()].into_iter().collect();
let expr_set = [
["c+a".to_string(), "b+a".to_string()].into_iter().collect();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes for this and the next test case's changes:

The mock table scan doesn't have a column named "d", so I changed it to "b". And "a", "b" are ambiguous when join, thus I add qualifiers to them.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @waynexia

cc @liukun4515 and @jackwener

@@ -143,10 +143,15 @@ impl OptimizerRule for CommonSubexprEliminate {
)?;

if let Some(predicate) = pop_expr(&mut new_expr)?.pop() {
Ok(Some(LogicalPlan::Filter(Filter::try_new(
let filter = LogicalPlan::Filter(Filter::try_new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for using try_new

if sort.schema() == &input_schema {
Ok(Some(sort))
} else {
Ok(Some(build_recover_project_plan(&input_schema, sort)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be possible to reduce the replication as well as be slightly more "defensive" coding style if instead of build_recover_project_plan on each path, it could be called once to ensure the schema was always the same

Something like

impl OptimizerRule for CommonSubexprEliminate {
    fn try_optimize(
        &self,
        plan: &LogicalPlan,
        config: &dyn OptimizerConfig,
    ) -> Result<Option<LogicalPlan>> {
        let mut expr_set = ExprSet::new();
        let input_schema = Arc::clone(&plan.schema());
 
        let new_plan =  match plan {
          ...
         }?;

         if new_plan.schema() == &input_schema {
           Ok(Some(new_plan))
         } else {
           Ok(Some(build_recover_project_plan(&input_schema, new_plan)))
        }
  }

Copy link
Member Author

@waynexia waynexia Dec 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! It's really a good practice 👍 I change this in 162838c

waynexia and others added 3 commits December 27, 2022 11:10
};

// add an additional projection if the output schema changed.
if optimized_plan.schema() != &original_schema {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 326117c into apache:master Dec 28, 2022
@ursabot
Copy link

ursabot commented Dec 28, 2022

Benchmark runs are scheduled for baseline = cd4fd80 and contender = 326117c. 326117c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@waynexia waynexia deleted the fix-4575 branch December 29, 2022 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

common_sub_expression_eliminate exists bug
3 participants