Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Sliding Windows Joins with Symmetric Hash Join (SHJ) #5321

Closed
metesynnada opened this issue Feb 17, 2023 · 0 comments · Fixed by #5322
Closed

Support for Sliding Windows Joins with Symmetric Hash Join (SHJ) #5321

metesynnada opened this issue Feb 17, 2023 · 0 comments · Fixed by #5322
Labels
enhancement New feature or request

Comments

@metesynnada
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The ordinary hash join (OHJ) is a great solution when one side of the data is static and can fit into memory. However, the sort-merge join (SMJ) is more effective when keys in the join condition are already sorted. If the join filter expression has order guarantees, but not the join key, both OHJ, and SMJ can result in suboptimal performance.

This is where Symmetric Hash Join (SHJ) comes in. SHJ addresses the gap in join use cases by introducing support for filter expressions with order guarantees, such as sliding windows.

For example, consider the following query:

SELECT * FROM left_table, right_table
WHERE
  left_key = right_key AND
  a > b + 3 AND
  a < b + 10

In this scenario, the columns a and b are sorted. In this case, SMJ wouldn't be effective and OHJ may struggle with low cardinality join keys.

SHJ extends the join capabilities of Datafusion by handling such use cases efficiently. While ordinary hash join typically remains the preferable option when both sources are finite, the join type can be changed to SHJ using a PipelineFixer sub-rule when both sources are unbounded.

Describe the solution you'd like
At skeleton implementation of SHJ that can be improved on, maybe first limited to partitioned mode only and lacking full support for output order information, but extensible enough so that these capabilities can be implemented later on. In detail:

  • Provide a way to support sliding window semantics in PhysicalExprs
  • Add a sub-rule to PipelineFixer to replace the HashJoin if necessary.

Describe alternatives you've considered
NA

Additional context
NA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant