RFC: Band Join #32

chenzl25 · 2022-12-26T10:13:11Z

No description provided.

yuhao-su · 2022-12-26T11:51:18Z

The implementation proposed in this RFC may enlighten a way to Temporal join.

Also, it is possible to implement with shared arrangement. (Maybe can avoid broadcast)

chenzl25 · 2022-12-27T03:35:51Z

The implementation proposed in this RFC may enlighten a way to Temporal join.

Also, it is possible to implement with shared arrangement. (Maybe can avoid broadcast)

I think actually it is more related to Interval Joins.

CAJan93 · 2022-12-27T17:37:42Z

rfcs/0032-band-join.md

+select * from A join B on A.p between B.start and B.end.
+```
+
+The band join has a nice property. The range condition of band join only involves 2 columns: one from the LHS the other from RHS, so we can always reverse the condition. For example, `A.p between B.d - 10 and B.d + 20` can be converted into `B.d between A.p - 20 and A.p + 10`. This is a crucial property for streaming queries, as we need to treat both sides of the join logically equivalent (both sides need to be built and probed).


This is a crucial property for streaming queries, as we need to treat both sides of the join logically equivalent (both sides need to be built and probed).

Why do we need to probe both sides? Why is this different in streaming vs. batch?

For streaming, updates from either left side or right side need to be reflected on the join output. Think about StreamHashJoin, we need to build a hash table for both sides and if an update comes from one side, we need to use it to probe the other side to find the matched rows. This is also applicable to the stream-band-join. As you can see, we need to build some index structures for both sides in streaming, while in batch we just need to build an index structure for one side (the build side), because the input size is bounded in batch.

rfcs/0032-band-join.md

liurenjie1024 · 2022-12-30T06:16:21Z

rfcs/0032-band-join.md

+
+- If we have an equal condition `A.a = B.b` and its selectivity is low, using `HashJoin` is enough.
+- If we have an equal condition `A.a = B.b` and its selectivity is high, we can use this condition to distribute the data to acquire parallelism. For `A.p between B.d - 10 and B.d + 20`, we can construct an internal table with order key = `A.a, A.p, A.rid` for A side and order key `B.b, B.d, B.rid` for B side. For `A.q between B.e - 10 and B.e + 20`, we can construct an internal table with order key = `A.a, A.q, A.rid` for A side and order key `B.b, B.e, B.rid` for B side. When a row came from B with (B.b, B.d, B.e) = (100, 200, 300). We can lookup A's internal table row ids with range queries: A between (A.a = 100, A.p = 200 - 10 = 190) and (A.a = 100, A.p = 200 + 20 = 220). Merge the other A's internal table row ids with range queries: A between (A.a = 100, A.q = 300 - 5 = 295) and (A.a = 100, A.p = 200 + 15 = 315). Finally we can intersect the row ids to get the corresponding A matched rows. When a row comes from A, we first need to reverse the range condition as we mentioned before and then do the same logic as for row came from B. Row deleted is basically equivalent to the insertion, but with opposed operators. Update can be handled as delete followed by insert.
+- If we don't have equal conditions `A.a = B.b`, we can only use singleton for both input sides or broadcast one side to the other side. The other logic is basically equivalent with the above example without `A.a` and `B.b` as their prefix keys.


We should be careful when doing broadcast since it duplicates the broadcasted state table. It would be better to do it with user hint.

fuyufjh · 2022-12-30T06:21:36Z

rfcs/0032-band-join.md

+```
+
+- If we have an equal condition `A.a = B.b` and its selectivity is low, using `HashJoin` is enough.
+- If we have an equal condition `A.a = B.b` and its selectivity is high, we can use this condition to distribute the data to acquire parallelism. For `A.p between B.d - 10 and B.d + 20`, we can construct an internal table with order key = `A.a, A.p, A.rid` for A side and order key `B.b, B.d, B.rid` for B side. For `A.q between B.e - 10 and B.e + 20`, we can construct an internal table with order key = `A.a, A.q, A.rid` for A side and order key `B.b, B.e, B.rid` for B side. When a row came from B with (B.b, B.d, B.e) = (100, 200, 300). We can lookup A's internal table row ids with range queries: A between (A.a = 100, A.p = 200 - 10 = 190) and (A.a = 100, A.p = 200 + 20 = 220). Merge the other A's internal table row ids with range queries: A between (A.a = 100, A.q = 300 - 5 = 295) and (A.a = 100, A.p = 200 + 15 = 315). Finally we can intersect the row ids to get the corresponding A matched rows. When a row comes from A, we first need to reverse the range condition as we mentioned before and then do the same logic as for row came from B. Row deleted is basically equivalent to the insertion, but with opposed operators. Update can be handled as delete followed by insert.


In streaming I guess most band join conditions are on timestamp columns, so it seems fine to only support one band condition.

Yes, if there are more than one band condition, we can always choose one of them to construct the internal state and other band conditions can be just treated as other conditions.

fuyufjh · 2022-12-30T06:59:38Z

rfcs/0032-band-join.md

+
+## Future possibilities
+
+If you are familiar with Flink, we can find that they have interval join which is just a special case of the `BandJoin`. Interval join requires an equal condition and the range condition looks like `b.timestamp ∈ [a.timestamp + lowerBound; a.timestamp + upperBound]`. Interval join also requires the input stream to be append-only and can cooperate with the watermark to prune the old states. We can use `BandJoin` to implement interval join in the future.


Consider using the time-bound condition to do state cleaning, there is a unresolve problem: should the time column be placed before join key or after join key?

time column before join key: state cleanning is easy (one range delete), but random acesss by join key become impossible.

join key before time column: Join key access is fast as usual, but state cleanning's cost is O(N) where N = number of join keys.

Alternatively, use two seperated state tables, but this would increase the IO cost multiple times.

Neither sounds a good solution. We need to take a look on Flink's implementation.

I had taken a look at Flink's TimeIntervalJoin and found that they use alternative two. First they utilize the join equal keys to partition the join state. Second, they maintain a leftCache and rightCache with type MapState<Long, List<Tuple2<RowData, Boolean>>> for each join equal key. The Long type is actually the timestamp. BTW, they suffer from a lack of ordered key MapState. The state cleaning mechanism of Flink TimeIntervalJoin works like that, it registers a cleanup timer to its time service. The cleanup timer triggered timestamp is calculated based on the row timestamp, input watermark and time interval provided by the user. As soon as the cleanup timer triggers, it iterates the LeftCache and RightCache, deletes expired rows, and registers another cleanup timer based on the earliest valid row timestamp left. As we can see,
Flink uses a bunch of point deletes triggered by the cleanup timer, rather than a range delete.

https://github.com/apache/flink/blob/master/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/operators/join/interval/TimeIntervalJoin.java

Consider using the time-bound condition to do state cleaning, there is a unresolve problem: should the time column be placed before join key or after join key?

time column before join key: state cleanning is easy (one range delete), but random acesss by join key become impossible.

join key before time column: Join key access is fast as usual, but state cleanning's cost is O(N) where N = number of join keys.

Alternatively, use two seperated state tables, but this would increase the IO cost multiple times.

Neither sounds a good solution. We need to take a look on Flink's implementation.

I think the first solution is ridiculous. We do not need a fast state cleanning.

fuyufjh

Shall we merge this?

chenzl25 · 2023-08-14T09:26:57Z

Shall we merge this?

Yes, because we have support Interval Join which could clean the state at least.

chenzl25 added 3 commits December 26, 2022 18:12

band join

4098807

rename

9acf27f

add example

6ecebf7

CAJan93 reviewed Dec 27, 2022

View reviewed changes

rfcs/0032-band-join.md Show resolved Hide resolved

CAJan93 reviewed Dec 27, 2022

View reviewed changes

rfcs/0032-band-join.md Show resolved Hide resolved

CAJan93 reviewed Dec 27, 2022

View reviewed changes

rfcs/0032-band-join.md Show resolved Hide resolved

liurenjie1024 reviewed Dec 30, 2022

View reviewed changes

fuyufjh reviewed Dec 30, 2022

View reviewed changes

fuyufjh mentioned this pull request Jan 6, 2023

perf: nexmark q7 become slower and slower risingwavelabs/risingwave#7244

Open

lmatz mentioned this pull request Jan 10, 2023

Tracking: Nexmark queries optimization risingwavelabs/risingwave#7289

Open

54 tasks

fuyufjh mentioned this pull request Feb 21, 2023

RFC: Stream Executor with Emit on Window Close Semantics #51

Merged

soundOfDestiny mentioned this pull request Mar 9, 2023

band join risingwavelabs/risingwave#8454

Closed

4 tasks

soundOfDestiny mentioned this pull request Apr 17, 2023

feat(hash join): trivial interval join (close #9228) risingwavelabs/risingwave#9229

Closed

6 tasks

fuyufjh approved these changes Aug 14, 2023

View reviewed changes

chenzl25 merged commit b8cf1f8 into main Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Band Join #32

RFC: Band Join #32

chenzl25 commented Dec 26, 2022

yuhao-su commented Dec 26, 2022

chenzl25 commented Dec 27, 2022

CAJan93 Dec 27, 2022

chenzl25 Dec 28, 2022

liurenjie1024 Dec 30, 2022

fuyufjh Dec 30, 2022

chenzl25 Dec 30, 2022

fuyufjh Dec 30, 2022

chenzl25 Jan 4, 2023

chenzl25 Jan 4, 2023

soundOfDestiny Mar 9, 2023

fuyufjh left a comment

chenzl25 commented Aug 14, 2023


		## Future possibilities

		If you are familiar with Flink, we can find that they have interval join which is just a special case of the `BandJoin`. Interval join requires an equal condition and the range condition looks like `b.timestamp ∈ [a.timestamp + lowerBound; a.timestamp + upperBound]`. Interval join also requires the input stream to be append-only and can cooperate with the watermark to prune the old states. We can use `BandJoin` to implement interval join in the future.

RFC: Band Join #32

RFC: Band Join #32

Conversation

chenzl25 commented Dec 26, 2022

yuhao-su commented Dec 26, 2022

chenzl25 commented Dec 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fuyufjh left a comment

Choose a reason for hiding this comment

chenzl25 commented Aug 14, 2023