Redundant join elimination #3808

frankmcsherry · 2020-07-30T22:58:13Z

This PR means to generalize #3783, which digs deeper into redundant join elimination. This PR goes a bit further, and tracks provenance of each collection with respect to local and global identifiers. Each collection can have sets of columns that derive from each identifier, with the guarantee that projected on to these columns they will be contained in the corresponding projection of the identified collection. In some cases the sets of columns are identical.

If we join two relations left and right, and left is 1. distinct, 2. has all of its columns drawn from some identifier, and right contains the same columns and they are equated by the join, then we can replace the join by right followed by some projection nonsense.

This recovers the observed TopK optimization in topk.slt, and does a few interesting things in tpch.slt and chbench.slt that I still need to look at. One thing it does not do is optimize query 04 in either, which would be a good target. There it seems that the "branch key optimization", which reduces the columns of left down to those inspected in the subquery, requires the join by left to recover the columns. This might be "correct", in that the subquery might be better computed only on the distinct set of relevant identifiers, rather than e.g. maintain rendundant aggregates decorated with the attendant columns. In cases where this transformation would apply we know that the join is on keys that are distinct in left so there wouldn't be an asymptotic loss to disable the optimization, but it's all complicated.

This change is

test/sqllogictest/chbench.slt

test/sqllogictest/tpch.slt

frankmcsherry · 2020-07-31T13:33:25Z

I think this is ready to look at. Maybe some more clippy transgressions to amend or junk like that, but it is (was?) passing all local slt tests, though we might want a larger run just to be sure. The transformation isn't mind-bending, I don't believe, but it isn't nearly as local as other transformations, and it would be good to ensure that it is well-explained.

No tests at the moment, and no rush to write any. I'm not sure that they would be anything other than the existing plans we have that have their explanations wiggled around. I'm more concerned about ensuring that we test correctness, and I'm hoping that the full SLT covers that some.

benesch

The general thrust of this looks great. The approach makes sense to me, and the various match arms that I reviewed seemed sensible. What I haven't done is to build up enough understanding of this transformation myself to be comfortable asserting that it is sound in all possible cases. If you need me to do that, I can, but.. would take a good chunk of time.

Instead I'm going to see about getting the full SLT suite working again. We stopped getting notifications about it when we renamed the main branch, and looks like things have actually been busted for a while before that.

benesch · 2020-07-31T17:00:49Z

test/sqllogictest/topk.slt

+query T multiline
+EXPLAIN PLAN FOR SELECT state, name FROM
+    (SELECT DISTINCT state FROM cities) grp
+    LEFT JOIN LATERAL (SELECT name, pop FROM cities  where cities.state = grp.state ORDER BY pop DESC LIMIT 3) ON state = grp.state


The ON clause in this join looks like it might still be tautological?

We discussed this, and to recap: it seems like we want true as the join condition to test left joins as well, is that right? This still requires the additional logic that may introduce nulls, but the joins should still be optimized out (the thing this tests for).

That sounds right to me!

benesch · 2020-07-31T17:02:51Z

src/transform/src/redundant_join.rs

+                // Negate changes the sign on its multiplicities,
+                // which means "distinct" counts would now be -1.
+                // We set `exact` to false to inhibit the optimization,
+                // but should probably fix `.keys` instead.


Is this hard to do?

No, probably not. I mean, easy to break it, and I doubt we have anyone relying on the keys of x.negate().negate(). I polled Slack for takes on this and folks seemed to go in a different direction, discussion-wise.

frankmcsherry · 2020-08-03T12:48:02Z

What I haven't done is to build up enough understanding of this transformation myself to be comfortable asserting that it is sound in all possible cases. If you need me to do that, I can, but.. would take a good chunk of time.

I'm fine with either punting on that or, better imo, getting @justinj to get his head around it too. I think it would be best if we have at least two people who understand each of the weird things in the code, going forward.

benesch · 2020-08-03T16:31:21Z

test/sqllogictest/topk.slt

+query T multiline
+EXPLAIN PLAN FOR SELECT state, name FROM
+    (SELECT DISTINCT state FROM cities) grp
+    LEFT JOIN LATERAL (SELECT name, pop FROM cities  where cities.state = grp.state ORDER BY pop DESC LIMIT 3) ON state = grp.state


That sounds right to me!

justinj

Reviewable status: 0 of 5 files reviewed, 13 unresolved discussions (waiting on @benesch, @frankmcsherry, and @justinj)

src/transform/src/redundant_join.rs, line 79 at r16 (raw file):

            }
            RelationExpr::Get { id, typ } => {
                // Extract the value provenance, or an empty list if unavailable.

Would this ever be unavailable if the tree is well-formed?

src/transform/src/redundant_join.rs, line 154 at r16 (raw file):

                            // When we reach the removed relation, we should introduce
                            // references to the columns that are meant to replace these.
                            // This should happen only once, and `.drain(..)` could work.

I don't understand what this comment is saying, what is meant by "could work"?

src/transform/src/redundant_join.rs, line 316 at r16 (raw file):

            RelationExpr::Negate { input } => {
                // Negate does not guarantee that the multiplicity of
                // each source record it at least one. This could have

nit: is at least one

src/transform/src/redundant_join.rs, line 319 at r16 (raw file):

                // been a problem in `Union`, where we might report
                // that the union of positive and negative records is
                // "exact": cancelations would make this false.

nit: cancellation

src/transform/src/redundant_join.rs, line 379 at r16 (raw file):

/// Attempts to find column bindings that make `input` redundant.
///
/// This method attempts to find evidence that `input` may be redundant by searching

I don't understand this phrasing, doesn't it attempt to find proof that input is redundant?

src/transform/src/redundant_join.rs, line 386 at r16 (raw file):

///
/// In these circumstances, the claim is that because the key columns are equated and
/// determine non key columns (the meaning of a key), any matches between `input` and

nit: don't we use key more strongly than this, to mean each row has multiplicity 1 (which can be false while still determining non-key columns)?

src/transform/src/redundant_join.rs, line 388 at r16 (raw file):

/// determine non key columns (the meaning of a key), any matches between `input` and
/// `other` will neither introduce new information to `other`, nor restrict the rows
/// of `other`, nor alter their multplicity.

nit: multiplicity

src/transform/src/redundant_join.rs, line 397 at r16 (raw file):

    input_prov: &[Vec<ProvInfo>],
) -> Option<Vec<usize>> {
    for provenance in input_prov[input].iter() {

I find this function pretty hard to follow, it seems like there's a lot going on, I think it probably be made more comprehensible by pulling out a notion of "does this other imply input" and then deferring to that?

src/transform/src/redundant_join.rs, line 413 at r16 (raw file):

                    }
                    if bindings.len() == input_arities[input] {
                        for key in keys.iter() {

I feel like this could do with being more abstract too, it would make more sense to me to be written something like

if keys.iter().any(|key| key.is_implied_by(other_cols, equivalences)) { ... }

frankmcsherry · 2020-08-04T16:59:05Z

Would this ever be unavailable if the tree is well-formed?

Ideally no. But I didn't want to make it this optimization's job to enforce policy on that.

I don't understand this phrasing, doesn't it attempt to find proof that input is redundant?

It's meant in that sense. But specifically we want to return the evidence, not the proof.

Logic re-organized by extracting a closure. I don't think clarity improves with a free-standing method announcing the closure parameters, but feel free to disagree!

justinj · 2020-08-04T17:33:28Z

LGTM

frankmcsherry commented Jul 30, 2020

View reviewed changes

test/sqllogictest/chbench.slt Outdated Show resolved Hide resolved

frankmcsherry commented Jul 30, 2020

View reviewed changes

test/sqllogictest/tpch.slt Outdated Show resolved Hide resolved

frankmcsherry requested review from justinj and benesch July 31, 2020 13:30

benesch mentioned this pull request Jul 31, 2020

transform: remove redundant joins in idiomatic top-k query #3783

Closed

benesch reviewed Jul 31, 2020

View reviewed changes

frankmcsherry added 8 commits August 3, 2020 11:35

remove redundancy using provenance

4b05420

diagnostics

06a8323

only require keys to be equated

a903de9

reverse order of selection to minimize churn

3083430

union provenance as the meet of input provenance

fb5e8be

improve documentation, formatting

0653275

implement projection

140eada

update negate explanation

78bcd48

frankmcsherry force-pushed the redundancy_elim branch from b2a6e3e to 78bcd48 Compare August 3, 2020 15:35

benesch approved these changes Aug 3, 2020

View reviewed changes

update left lateral join test

bf18045

justinj reviewed Aug 4, 2020

View reviewed changes

re-organization

f9fc6f7

frankmcsherry merged commit d4eb6a1 into MaterializeInc:main Aug 5, 2020

frankmcsherry deleted the redundancy_elim branch March 8, 2022 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redundant join elimination #3808

Redundant join elimination #3808

frankmcsherry commented Jul 30, 2020 •

edited by benesch

Loading

frankmcsherry commented Jul 31, 2020

benesch left a comment

benesch Jul 31, 2020

frankmcsherry Aug 3, 2020

benesch Aug 3, 2020

benesch Jul 31, 2020

frankmcsherry Aug 3, 2020

frankmcsherry commented Aug 3, 2020

benesch Aug 3, 2020

justinj left a comment

frankmcsherry commented Aug 4, 2020

justinj commented Aug 4, 2020

Redundant join elimination #3808

Redundant join elimination #3808

Conversation

frankmcsherry commented Jul 30, 2020 • edited by benesch Loading

frankmcsherry commented Jul 31, 2020

benesch left a comment

Choose a reason for hiding this comment

benesch Jul 31, 2020

Choose a reason for hiding this comment

frankmcsherry Aug 3, 2020

Choose a reason for hiding this comment

benesch Aug 3, 2020

Choose a reason for hiding this comment

benesch Jul 31, 2020

Choose a reason for hiding this comment

frankmcsherry Aug 3, 2020

Choose a reason for hiding this comment

frankmcsherry commented Aug 3, 2020

benesch Aug 3, 2020

Choose a reason for hiding this comment

justinj left a comment

Choose a reason for hiding this comment

frankmcsherry commented Aug 4, 2020

justinj commented Aug 4, 2020

frankmcsherry commented Jul 30, 2020 •

edited by benesch

Loading