-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: sql planner creates cross join instead of inner join from select predicates #1566
Conversation
|
Currently, query 8 can't seem to pass due to select
o_year,
from
(
select
extract(year from o_orderdate) as o_year,
l_extendedprice * (1 - l_discount) as volume,
n2.n_name as nation
from
part,
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2,
region
where
p_partkey = l_partkey
and s_suppkey = l_suppkey
and l_orderkey = o_orderkey
and o_custkey = c_custkey
and c_nationkey = n1.n_nationkey
and n1.n_regionkey = r_regionkey
and r_name = 'AMERICA'
and s_nationkey = n2.n_nationkey
and o_orderdate between date '1995-01-01' and date '1996-12-31'
and p_type = 'ECONOMY ANODIZED STEEL'
) as all_nations
group by
o_year
order by
o_year; master
xudong963:fix_cross_join
|
Thank you @xudong963 -- this looks very cool. I'll try and review it carefully tomorrow 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you possibly provide some tests @xudong963 ?
I don't fully understand how this works -- I was expecting to see code that basically applied an algebraic transformation on predicates like:
(A AND X) OR (A and Y) OR (A and Z)
And rewrote them to
A AND (X OR Y OR Z)
Update:
I was totally confused with q19: https://github.com/xudong963/arrow-datafusion/blob/fix_cross_join/benchmarks/queries/q19.sql
please ignore 🤦
Sure. The resulting correctness test can be overridden by the current test. I can add a test about the logical plan.
The ticket doesn't do the transformation. It does the following thing. First of all, let's see the example:
https://github.com/apache/arrow-datafusion/blob/6f7b2d25fb75c843efed67fbd72d09b2c2d6c2eb/datafusion/src/sql/planner.rs#L718 In the ticket, we can push
|
@xudong963 This might be a bit nitpicky but this will enter an infinite loop if multiple cross joins are required for a plan. As an example, the following explain select will never finish:
In practice I'm not sure how much of a deal this is as it requires multiple cross joins with a filter that operates solely on a single table. |
Good catch, thanks @pjmore. I think we should process the case because we can't guarantee what kind of queries the user will write 😁 |
fd62c56
to
7d46587
Compare
|
||
#[test] | ||
fn cross_join_not_to_inner_join() { | ||
let sql = "select person.id from person, orders, lineitem where person.id = person.age;"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here @pjmore
datafusion/src/logical_plan/plan.rs
Outdated
@@ -214,8 +220,14 @@ pub struct Extension { | |||
pub node: Arc<dyn UserDefinedLogicalNode + Send + Sync>, | |||
} | |||
|
|||
impl PartialEq for Extension { | |||
fn eq(&self, _other: &Self) -> bool { | |||
todo!() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if need to do something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for implementing PartialEq for all the plan nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you leave todo!()
here it means the code will assert for plans with UserDefinedPlanNode
s and cross joins (which is likely not possible given the inputs are from the FROM
list, and thus shoudl be Select
(for subqueries) or TableScan
for table references
For the future, I think you can change
#[derive(Clone)]
pub struct Extension {
/// The runtime extension operator
pub node: Arc<dyn UserDefinedLogicalNode + PartialEq Send + Sync>,
}
to the following (aka force the UserDefinedLogicalNode
to implement PartialEq
):
#[derive(Clone, PartialEq)]
pub struct Extension {
/// The runtime extension operator
pub node: Arc<dyn UserDefinedLogicalNode + PartialEq Send + Sync>,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the future, I think you can change
PartialEq
isn't auto-trait, so it can't be used as an additional trait in a trait object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice performance gain @xudong963 👍
datafusion/src/logical_plan/plan.rs
Outdated
@@ -214,8 +220,14 @@ pub struct Extension { | |||
pub node: Arc<dyn UserDefinedLogicalNode + Send + Sync>, | |||
} | |||
|
|||
impl PartialEq for Extension { | |||
fn eq(&self, _other: &Self) -> bool { | |||
todo!() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for implementing PartialEq for all the plan nodes?
datafusion/src/sql/planner.rs
Outdated
.cross_join(&mut_plans[idx])? | ||
.build()?; | ||
} else { | ||
mut_plans.push(mut_plans[idx].clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be helpful to add a simple comment here to explain why pushing the same plan to the end helps. my understanding is we want to retry the join key matching after we go through all the plan once just in case this plan can be indirectly joined with left through other plans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you understand exactly right 👍
As a future follow up task, I think there is value in handling this plan rewrite in our optimizer layer so dataframe API users can benefit from it as well. |
datafusion/src/sql/planner.rs
Outdated
left = | ||
LogicalPlanBuilder::from(left).cross_join(right)?.build()?; | ||
// check [1..idx] if contains current plan to avoid infinite loop | ||
if mut_plans[1..idx].contains(&mut_plans[idx]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we use contains()
which requests Logical Plan
implement PartialEq
trait @houqp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I am thinking that we might be able to avoid this extra PartialEq and contains check overhead by breaking this up into two loops:
- extract the loop logic into its own function
- first loop through
plans
once, for each plan that doesn't have join key matches, we push its index inplans
into a retry vector - second loop to go through the retry vector
This way, we can guarantee every plan in the original plans
vector is evaluated at most twice without getting into an infinite loop while avoiding particleq compare and clones. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh never mind, just noticed @alamb already implemented a similar approach at xudong963#2 :D nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll look it carefully! 👍
@houqp what do you have in mind? With the The only thing I can think of would be to extend So like
To
And
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @xudong963 -- this is looking much better.
I am a little worried about having to implement PartialEq
for all nodes -- I think we might be able to avoid doing so with a worklist type algorithm.
Here is a sketch of doing so xudong963#2 as perhaps some inspiration
datafusion/src/logical_plan/plan.rs
Outdated
@@ -214,8 +220,14 @@ pub struct Extension { | |||
pub node: Arc<dyn UserDefinedLogicalNode + Send + Sync>, | |||
} | |||
|
|||
impl PartialEq for Extension { | |||
fn eq(&self, _other: &Self) -> bool { | |||
todo!() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you leave todo!()
here it means the code will assert for plans with UserDefinedPlanNode
s and cross joins (which is likely not possible given the inputs are from the FROM
list, and thus shoudl be Select
(for subqueries) or TableScan
for table references
For the future, I think you can change
#[derive(Clone)]
pub struct Extension {
/// The runtime extension operator
pub node: Arc<dyn UserDefinedLogicalNode + PartialEq Send + Sync>,
}
to the following (aka force the UserDefinedLogicalNode
to implement PartialEq
):
#[derive(Clone, PartialEq)]
pub struct Extension {
/// The runtime extension operator
pub node: Arc<dyn UserDefinedLogicalNode + PartialEq Send + Sync>,
}
Also, @xudong963 thank you for taking on this task. Really neat to see |
Yes, this is exactly what I have in mind. With this optimizer rule, the sql planner can just naively plan these types of queries as cross joins and let the optimizer rewrite them into inner joins whenever applicable. And dataframe api users can write dumb cross join queries that run as fast as optimized inner join queries. |
Thanks, @alamb ❤️. I look at xudong963#2 carefully and I think the implementation is very elegant, so I directly use it with only minor fixes. |
4923cb5
to
c330dc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again, @alamb PTAL. BTW, the annotation is very nice, taking you as the model!
datafusion/src/sql/planner.rs
Outdated
// Search all remaining plans for the next to | ||
// join. Prefer the first one that has a join | ||
// predicate in the predicate lists | ||
let plan_with_idx = remaining_plans.iter().enumerate().find(|(_idx, plan)| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I use plan_with_idx
to replace idx
if join_keys.is_empty() { | ||
left = | ||
LogicalPlanBuilder::from(left).cross_join(right)?.build()?; | ||
assert!(plan_with_idx.is_none()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very tight!
c330dc0
to
47ef59e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me (though I am somewhat biased :)) @houqp would you like to take another look / review ?
Nice work @xudong963 @alamb ! |
Which issue does this PR close?
Closes #1293
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?