You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The ordinary hash join (OHJ) is a great solution when one side of the data is static and can fit into memory. However, the sort-merge join (SMJ) is more effective when keys in the join condition are already sorted. If the join filter expression has order guarantees, but not the join key, both OHJ, and SMJ can result in suboptimal performance.
This is where Symmetric Hash Join (SHJ) comes in. SHJ addresses the gap in join use cases by introducing support for filter expressions with order guarantees, such as sliding windows.
For example, consider the following query:
SELECT*FROM left_table, right_table
WHERE
left_key = right_key AND
a > b +3AND
a < b +10
In this scenario, the columns a and b are sorted. In this case, SMJ wouldn't be effective and OHJ may struggle with low cardinality join keys.
SHJ extends the join capabilities of Datafusion by handling such use cases efficiently. While ordinary hash join typically remains the preferable option when both sources are finite, the join type can be changed to SHJ using a PipelineFixer sub-rule when both sources are unbounded.
Describe the solution you'd like
At skeleton implementation of SHJ that can be improved on, maybe first limited to partitioned mode only and lacking full support for output order information, but extensible enough so that these capabilities can be implemented later on. In detail:
Provide a way to support sliding window semantics in PhysicalExprs
Add a sub-rule to PipelineFixer to replace the HashJoin if necessary.
Describe alternatives you've considered
NA
Additional context
NA
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The ordinary hash join (OHJ) is a great solution when one side of the data is static and can fit into memory. However, the sort-merge join (SMJ) is more effective when keys in the join condition are already sorted. If the join filter expression has order guarantees, but not the join key, both OHJ, and SMJ can result in suboptimal performance.
This is where Symmetric Hash Join (SHJ) comes in. SHJ addresses the gap in join use cases by introducing support for filter expressions with order guarantees, such as sliding windows.
For example, consider the following query:
In this scenario, the columns
a
andb
are sorted. In this case, SMJ wouldn't be effective and OHJ may struggle with low cardinality join keys.SHJ extends the join capabilities of Datafusion by handling such use cases efficiently. While ordinary hash join typically remains the preferable option when both sources are finite, the join type can be changed to SHJ using a
PipelineFixer
sub-rule when both sources are unbounded.Describe the solution you'd like
At skeleton implementation of SHJ that can be improved on, maybe first limited to partitioned mode only and lacking full support for output order information, but extensible enough so that these capabilities can be implemented later on. In detail:
PhysicalExpr
sPipelineFixer
to replace theHashJoin
if necessary.Describe alternatives you've considered
NA
Additional context
NA
The text was updated successfully, but these errors were encountered: