Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize EXISTS subquery expressions by rewriting as semi-join #2351

Closed
andygrove opened this issue Apr 27, 2022 · 2 comments
Closed

Optimize EXISTS subquery expressions by rewriting as semi-join #2351

andygrove opened this issue Apr 27, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@andygrove
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I would like an optimizer rule that rewrites queries with EXISTS subqueries as semi-joins See discussion in #2344.

Describe the solution you'd like
See discussion in #2344 for an example.

Describe alternatives you've considered
None

Additional context
None

@andygrove andygrove added the enhancement New feature or request label Apr 27, 2022
@andygrove andygrove mentioned this issue Apr 27, 2022
15 tasks
@jon-chuang
Copy link

jon-chuang commented May 2, 2022

I'm currently halfway through implementing this, using heuristic rewrite rules, but I've decided to try to make it more generic as per DuckDB/Hyper's (and also apprently Materialize's) approach, which is to pushdown the dependent join through the subquery until all dependent/correlated predicates have been exfiltrated from the subquery.

This was the intuition which occured to me as the most powerful and complete way after a bit of fiddling around, and thankfully there are trailblazers here.

This is the best resource, DuckDB's dependent join pushdown logic for various logical operators:
https://github.com/duckdb/duckdb/blob/bee8017bdcc5e652aee26ce8cfb260990cf6a369/src/planner/subquery/flatten_dependent_join.cpp#L72


Edit: I think the more general strategy (dependent join pushdown) may be more complicated and potentially more costly than the alternative. Hence, I think I will try to do the simpler strategy first (filter pullup + pattern match against Exists(Filter(..))) properly, and leave dependent join pushdown to a future endeavour.

Here is Spark's Catalyst optimizer doing predicate pullup: https://github.com/apache/spark/blob/d3af3e5d73f8fabaa9cb320d485962d49a46f8bf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

The full subquery decorrelation for Catalyst is found here (same technique as DuckDB/Hyper): https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala#L308

@andygrove
Copy link
Member Author

I think this can be closed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants