Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(hash join): trivial interval join (close #9228) #9229

Closed
wants to merge 6 commits into from

Conversation

soundOfDestiny
Copy link
Contributor

@soundOfDestiny soundOfDestiny commented Apr 17, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Leverage band conditions to speed up interval join.
View risingwavelabs/rfcs#32 for details.
Refactor pk of state table and degree table in hash join:
join key || band key || input pk
We can then skip rows in join matching process in HashJoinExecutor.

Checklist For Contributors

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have demonstrated that backward compatibility is not broken by breaking changes and created issues to track deprecated features to be removed in the future. (Please refer to the issue)
  • All checks passed in ./risedev check (or alias, ./risedev c)

Checklist For Reviewers

  • I have requested macro/micro-benchmarks as this PR can affect performance substantially, and the results are shown.

Documentation

  • My PR DOES NOT contain user-facing changes.
Click here for Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

  • Installation and deployment
  • Connector (sources & sinks)
  • SQL commands, functions, and operators
  • RisingWave cluster configuration changes
  • Other (please specify in the release note below)

Release note

@codecov
Copy link

codecov bot commented Apr 17, 2023

Codecov Report

Merging #9229 (e5b1b8e) into main (9da2607) will decrease coverage by 0.05%.
The diff coverage is 55.23%.

@@            Coverage Diff             @@
##             main    #9229      +/-   ##
==========================================
- Coverage   70.82%   70.78%   -0.05%     
==========================================
  Files        1218     1218              
  Lines      202489   202886     +397     
==========================================
+ Hits       143419   143617     +198     
- Misses      59070    59269     +199     
Flag Coverage Δ
rust 70.78% <55.23%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/stream/src/executor/managed_state/join/mod.rs 91.48% <0.00%> (-0.77%) ⬇️
src/stream/src/from_proto/hash_join.rs 0.00% <0.00%> (ø)
src/stream/src/from_proto/mod.rs 0.00% <ø> (ø)
src/stream/src/executor/hash_join.rs 93.40% <23.47%> (-3.27%) ⬇️
...rc/executor/managed_state/join/join_entry_state.rs 59.82% <31.48%> (-26.39%) ⬇️
src/frontend/src/expr/mod.rs 79.72% <71.42%> (-0.31%) ⬇️
...ontend/src/optimizer/plan_node/stream_hash_join.rs 91.80% <90.05%> (-0.99%) ⬇️
...c/frontend/src/optimizer/plan_node/logical_join.rs 89.60% <100.00%> (+<0.01%) ⬆️
src/frontend/src/optimizer/plan_node/stream.rs 14.41% <100.00%> (+0.19%) ⬆️

... and 7 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@soundOfDestiny
Copy link
Contributor Author

soundOfDestiny commented Apr 18, 2023

micro benchmark result: (nexmark q7)

PR:

Benchmark Name: nexmark-q7-blackhole-medium-1cn
Metrics Name: avg-source-output-rows-per-second
Result: NEGATIVE
Fluctuation: -19.662403% 
Result Value: 459602.015512
Baseline Value: 572088.325862
Execution ID: 2581
Baseline Execution ID: 2480

main:

Benchmark Name: nexmark-q7-blackhole-medium-1cn
Metrics Name: avg-source-output-rows-per-second
Result: NEGATIVE
Fluctuation: -19.442841% 
Result Value: 460858.099444
Baseline Value: 572088.325862
Execution ID: 2586
Baseline Execution ID: 2480

@ice1000 ice1000 removed their request for review April 18, 2023 05:24
Comment on lines 1076 to 1077
// TODO: We can use binary search to start matching with
// `match_band_res==true`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this PR changes the join state's pk encoding, but it doesn't seem to utilize it properly. In my mind, we should use the range() interface of the BTreeMap to fetch the exact data according to the band condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this PR changes the join state's pk encoding, but it doesn't seem to utilize it properly. In my mind, we should use the range() interface of the BTreeMap to fetch the exact data according to the band condition.

fixed ugly.

Comment on lines 308 to 448
} else if self.clean_left_state_conjunction_idx.is_some()
&& self.clean_right_state_conjunction_idx.is_some()
{
} else if self.band_condition.is_some() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the user's perspective, StreamIntervalJoin means it can utilize the input's watermark by band conditions. If there are no watermarks, no difference for users, so let's change it back to StreamHashJoin. BandJoin is just an optimization for our implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the user's perspective, StreamIntervalJoin means it can utilize the input's watermark by band conditions. If there are no watermarks, no difference for users, so let's change it back to StreamHashJoin. BandJoin is just an optimization for our implementation.

fixed.

@@ -306,6 +313,7 @@ message HashJoinNode {
// Whether to optimize for append only stream.
// It is true when the input is append-only
bool is_append_only = 14;
BandJoinCondition band_condition = 15;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have another field inequality_pairs which has similar functionality to band_condition. Is it possible to merge them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have another field inequality_pairs which has similar functionality to band_condition. Is it possible to merge them?

too hard for me 😭

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭 Will take a look in detail later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭 Will take a look in detail later.

better not. The code is too ugly 😭

@chenzl25
Copy link
Contributor

micro benchmark result: (nexmark q7)

Actually, I haven't seen a performance improvement in the nexmark q7. Maybe we need another case to make this feature more convincing.

@soundOfDestiny
Copy link
Contributor Author

micro benchmark result: (nexmark q7)

Actually, I haven't seen a performance improvement in the nexmark q7. Maybe we need another case to make this feature more convincing.

The equal condition in q7 is price==max(price), of which the selectivity is quite low.
However, there is no other band queries in nexmark.

@soundOfDestiny
Copy link
Contributor Author

closed due to conversation with chenzl25 and the talk in all-hand meeting

@soundOfDestiny soundOfDestiny deleted the zl_interval_join branch May 19, 2023 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants