-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redundant join elimination #3808
Redundant join elimination #3808
Conversation
I think this is ready to look at. Maybe some more clippy transgressions to amend or junk like that, but it is (was?) passing all local slt tests, though we might want a larger run just to be sure. The transformation isn't mind-bending, I don't believe, but it isn't nearly as local as other transformations, and it would be good to ensure that it is well-explained. No tests at the moment, and no rush to write any. I'm not sure that they would be anything other than the existing plans we have that have their explanations wiggled around. I'm more concerned about ensuring that we test correctness, and I'm hoping that the full SLT covers that some. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general thrust of this looks great. The approach makes sense to me, and the various match arms that I reviewed seemed sensible. What I haven't done is to build up enough understanding of this transformation myself to be comfortable asserting that it is sound in all possible cases. If you need me to do that, I can, but.. would take a good chunk of time.
Instead I'm going to see about getting the full SLT suite working again. We stopped getting notifications about it when we renamed the main branch, and looks like things have actually been busted for a while before that.
test/sqllogictest/topk.slt
Outdated
query T multiline | ||
EXPLAIN PLAN FOR SELECT state, name FROM | ||
(SELECT DISTINCT state FROM cities) grp | ||
LEFT JOIN LATERAL (SELECT name, pop FROM cities where cities.state = grp.state ORDER BY pop DESC LIMIT 3) ON state = grp.state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ON
clause in this join looks like it might still be tautological?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this, and to recap: it seems like we want true
as the join condition to test left joins as well, is that right? This still requires the additional logic that may introduce nulls, but the joins should still be optimized out (the thing this tests for).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds right to me!
src/transform/src/redundant_join.rs
Outdated
// Negate changes the sign on its multiplicities, | ||
// which means "distinct" counts would now be -1. | ||
// We set `exact` to false to inhibit the optimization, | ||
// but should probably fix `.keys` instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this hard to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, probably not. I mean, easy to break it, and I doubt we have anyone relying on the keys of x.negate().negate()
. I polled Slack for takes on this and folks seemed to go in a different direction, discussion-wise.
I'm fine with either punting on that or, better imo, getting @justinj to get his head around it too. I think it would be best if we have at least two people who understand each of the weird things in the code, going forward. |
b2a6e3e
to
78bcd48
Compare
test/sqllogictest/topk.slt
Outdated
query T multiline | ||
EXPLAIN PLAN FOR SELECT state, name FROM | ||
(SELECT DISTINCT state FROM cities) grp | ||
LEFT JOIN LATERAL (SELECT name, pop FROM cities where cities.state = grp.state ORDER BY pop DESC LIMIT 3) ON state = grp.state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds right to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 13 unresolved discussions (waiting on @benesch, @frankmcsherry, and @justinj)
src/transform/src/redundant_join.rs, line 79 at r16 (raw file):
} RelationExpr::Get { id, typ } => { // Extract the value provenance, or an empty list if unavailable.
Would this ever be unavailable if the tree is well-formed?
src/transform/src/redundant_join.rs, line 154 at r16 (raw file):
// When we reach the removed relation, we should introduce // references to the columns that are meant to replace these. // This should happen only once, and `.drain(..)` could work.
I don't understand what this comment is saying, what is meant by "could work"?
src/transform/src/redundant_join.rs, line 316 at r16 (raw file):
RelationExpr::Negate { input } => { // Negate does not guarantee that the multiplicity of // each source record it at least one. This could have
nit: is at least one
src/transform/src/redundant_join.rs, line 319 at r16 (raw file):
// been a problem in `Union`, where we might report // that the union of positive and negative records is // "exact": cancelations would make this false.
nit: cancellation
src/transform/src/redundant_join.rs, line 379 at r16 (raw file):
/// Attempts to find column bindings that make `input` redundant. /// /// This method attempts to find evidence that `input` may be redundant by searching
I don't understand this phrasing, doesn't it attempt to find proof that input
is redundant?
src/transform/src/redundant_join.rs, line 386 at r16 (raw file):
/// /// In these circumstances, the claim is that because the key columns are equated and /// determine non key columns (the meaning of a key), any matches between `input` and
nit: don't we use key more strongly than this, to mean each row has multiplicity 1 (which can be false while still determining non-key columns)?
src/transform/src/redundant_join.rs, line 388 at r16 (raw file):
/// determine non key columns (the meaning of a key), any matches between `input` and /// `other` will neither introduce new information to `other`, nor restrict the rows /// of `other`, nor alter their multplicity.
nit: multiplicity
src/transform/src/redundant_join.rs, line 397 at r16 (raw file):
input_prov: &[Vec<ProvInfo>], ) -> Option<Vec<usize>> { for provenance in input_prov[input].iter() {
I find this function pretty hard to follow, it seems like there's a lot going on, I think it probably be made more comprehensible by pulling out a notion of "does this other
imply input
" and then deferring to that?
src/transform/src/redundant_join.rs, line 413 at r16 (raw file):
} if bindings.len() == input_arities[input] { for key in keys.iter() {
I feel like this could do with being more abstract too, it would make more sense to me to be written something like
if keys.iter().any(|key| key.is_implied_by(other_cols, equivalences)) { ... }
Ideally no. But I didn't want to make it this optimization's job to enforce policy on that.
It's meant in that sense. But specifically we want to return the evidence, not the proof. Logic re-organized by extracting a closure. I don't think clarity improves with a free-standing method announcing the closure parameters, but feel free to disagree! |
LGTM |
This PR means to generalize #3783, which digs deeper into redundant join elimination. This PR goes a bit further, and tracks provenance of each collection with respect to local and global identifiers. Each collection can have sets of columns that derive from each identifier, with the guarantee that projected on to these columns they will be contained in the corresponding projection of the identified collection. In some cases the sets of columns are identical.
If we join two relations
left
andright
, andleft
is 1. distinct, 2. has all of its columns drawn from some identifier, andright
contains the same columns and they are equated by the join, then we can replace the join byright
followed by some projection nonsense.This recovers the observed
TopK
optimization intopk.slt
, and does a few interesting things intpch.slt
andchbench.slt
that I still need to look at. One thing it does not do is optimize query 04 in either, which would be a good target. There it seems that the "branch key optimization", which reduces the columns ofleft
down to those inspected in the subquery, requires the join byleft
to recover the columns. This might be "correct", in that the subquery might be better computed only on the distinct set of relevant identifiers, rather than e.g. maintain rendundant aggregates decorated with the attendant columns. In cases where this transformation would apply we know that the join is on keys that are distinct inleft
so there wouldn't be an asymptotic loss to disable the optimization, but it's all complicated.This change is