Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This PR means to generalize #3783, which digs deeper into redundant join elimination. This PR goes a bit further, and tracks provenance of each collection with respect to local and global identifiers. Each collection can have sets of columns that derive from each identifier, with the guarantee that projected on to these columns they will be contained in the corresponding projection of the identified collection. In some cases the sets of columns are identical. If we join two relations left and right, and left is 1. distinct, 2. has all of its columns drawn from some identifier, and right contains the same columns and they are equated by the join, then we can replace the join by right followed by some projection nonsense. This recovers the observed TopK optimization in topk.slt, and does a few interesting things in tpch.slt and chbench.slt that I still need to look at. One thing it does not do is optimize query 04 in either, which would be a good target. There it seems that the "branch key optimization", which reduces the columns of left down to those inspected in the subquery, requires the join by left to recover the columns. This might be "correct", in that the subquery might be better computed only on the distinct set of relevant identifiers, rather than e.g. maintain rendundant aggregates decorated with the attendant columns. In cases where this transformation would apply we know that the join is on keys that are distinct in left so there wouldn't be an asymptotic loss to disable the optimization, but it's all complicated.
- Loading branch information