Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose sql_to_statement and statement_to_plan on SessionState #4958

Merged
merged 1 commit into from
Jan 18, 2023

Conversation

avantgardnerio
Copy link
Contributor

Which issue does this PR close?

Closes #4957.

Rationale for this change

Described in issue.

What changes are included in this PR?

Splitting planning and parsing.

Are these changes tested?

Transitively, yes.

Are there any user-facing changes?

New API methods, no breaking changes.

@github-actions github-actions bot added the core Core DataFusion crate label Jan 17, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this API makes a lot of sense.

cc @tustvold who I think had some thoughts in this area, but I believe this PR keeps the same spirit of encapsulation.

Comment on lines +1663 to +1666
use crate::catalog::information_schema::INFORMATION_SCHEMA_TABLES;
use datafusion_sql::parser::Statement as DFStatement;
use sqlparser::ast::*;
use std::collections::hash_map::Entry;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These use statements seem out of place 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed? They just "came along for the ride" following the "do no harm" principle. That's not a great excuse though.

I think they are in the function because this file was becoming very long, and I suspect collisions with parser::Statement and ast::Statement and probably others was becoming untenable so at some point someone just said "I'll keep them local to this function as to avoid having to deal with all that".

I suspect the ultimate solution would be one or more of:

  1. Split this file up into multiple more reasonably sized ones
  2. Rename things such that it is easier to avoid collisions, or keep namespace names short and fully qualify them

@alamb alamb changed the title Split function Expose sql_to_statement and statement_to_plan on SessionContext Jan 17, 2023
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍

@tustvold tustvold changed the title Expose sql_to_statement and statement_to_plan on SessionContext Expose sql_to_statement and statement_to_plan on SessionState Jan 18, 2023
Comment on lines +1747 to +1751
pub async fn create_logical_plan(&self, sql: &str) -> Result<LogicalPlan> {
let statement = self.sql_to_statement(sql)?;
let plan = self.statement_to_plan(statement).await?;
Ok(plan)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be too late by now, but I have a slightly different proposal for this issue (thanks for addressing it btw!).

Namely, while this solution handles the stated case well it doesn't address the general problem of SqlToRel now being hard/impossible to instantiate properly (with the new trait bound for ContextProvider, which is only implemented for a private struct, and the entire information_schema module also being private further complicating a custom implementation of ContextProvider). This in turn forbids usage of other useful public SqlToRel methods, such as sql_to_expr.

My proposal would be to simply rename statement_to_plan to statement_to_planner, and then return SqlToRel::new(&provider) at line 1740. Subsequently, to address the original issue just revise this part:

Suggested change
pub async fn create_logical_plan(&self, sql: &str) -> Result<LogicalPlan> {
let statement = self.sql_to_statement(sql)?;
let plan = self.statement_to_plan(statement).await?;
Ok(plan)
}
pub async fn create_logical_plan(&self, sql: &str) -> Result<LogicalPlan> {
let statement = self.sql_to_statement(sql)?;
let planner = self.statement_to_planner(statement).await?;
let plan = planner.statement_to_plan(statement);
Ok(plan)
}

while anyone else can then get easy access to the planner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gruuya your suggestion makes sense and I'd be amenable to merging a PR containing the changes you described, but in the interest of expedience, given the approvals on this PR, I'd like to merge it as-is.

If you file an issue, I can take a look next time I'm free.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namely, while this solution handles the stated case well it doesn't address the general problem of SqlToRel now being hard/impossible to instantiate properly (with the new trait bound for ContextProvider, which is only implemented for a private struct, and the entire information_schema module also being private further complicating a custom implementation of ContextProvider). This in turn forbids usage of other useful public SqlToRel methods, such as sql_to_expr.

@gruuya I don't think we have contemplated using SqlToRel directly in the past. If that is a usecase that is important, what I found works well is to make an example demonstrating what you are doing. Here are some examples:

An example has several advantages:

  1. It ensures your usecase is clearly documented so people who are changing DataFusion know it is used and don't break it accidentally
  2. It then gives you an example to start from if the interface changes (you can look at the example to see how the interface may have changed)
  3. It helps others who might have the same need

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point; let me see if I can make a self-contained example for the usecase we have and then try my luck with a PR.

Namely, we use sql_to_expr to convert the WHERE clauses in UPDATE and DELETE statements from sqlparser Exprs into Datafusion Exprs during logical planning. Subsequently, during physical planning of our custom logical extension nodes for UPDATE and DELETE we employ these expressions in order to achieve the desired action using available Datafusion primitives (pruning, scanning, filtering, etc.).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In parallel you may be interested in some work @avantgardnerio is working on here #4902 and likely elsewhere to add better INSERT/UPDATE/DELETE support directly to DataFusion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting; will check it out thanks!

@avantgardnerio avantgardnerio merged commit 2fdc7b8 into apache:master Jan 18, 2023
@avantgardnerio avantgardnerio deleted the bg_split_plan branch January 18, 2023 16:22
@ursabot
Copy link

ursabot commented Jan 18, 2023

Benchmark runs are scheduled for baseline = 7062c2e and contender = 2fdc7b8. 2fdc7b8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Split create_logical_plan() into parsing and planning
5 participants