Skip to content

Commit

Permalink
Redundant join elimination (#3808)
Browse files Browse the repository at this point in the history
This PR means to generalize #3783, which digs deeper into redundant join elimination. This PR goes a bit further, and tracks provenance of each collection with respect to local and global identifiers. Each collection can have sets of columns that derive from each identifier, with the guarantee that projected on to these columns they will be contained in the corresponding projection of the identified collection. In some cases the sets of columns are identical.

If we join two relations left and right, and left is 1. distinct, 2. has all of its columns drawn from some identifier, and right contains the same columns and they are equated by the join, then we can replace the join by right followed by some projection nonsense.

This recovers the observed TopK optimization in topk.slt, and does a few interesting things in tpch.slt and chbench.slt that I still need to look at. One thing it does not do is optimize query 04 in either, which would be a good target. There it seems that the "branch key optimization", which reduces the columns of left down to those inspected in the subquery, requires the join by left to recover the columns. This might be "correct", in that the subquery might be better computed only on the distinct set of relevant identifiers, rather than e.g. maintain rendundant aggregates decorated with the attendant columns. In cases where this transformation would apply we know that the join is on keys that are distinct in left so there wouldn't be an asymptotic loss to disable the optimization, but it's all complicated.
  • Loading branch information
frankmcsherry authored Aug 5, 2020
1 parent 0a2f1b0 commit d4eb6a1
Show file tree
Hide file tree
Showing 5 changed files with 472 additions and 232 deletions.
Loading

0 comments on commit d4eb6a1

Please sign in to comment.