[Multi-stage] Only track max joined rows within each block #13981

Jackie-Jiang · 2024-09-11T21:40:57Z

In #13922 we added support to apply max rows limit to joined rows.
The intention is to protect operator from OOM on large CROSS JOIN, so we want to limit the rows in memory (similar to the protection over in-memory hash table).
This PR changes the logic to track joined rows per block instead of globally, so that memory is protected, but large join can still work.

codecov-commenter · 2024-09-11T22:17:39Z

Codecov Report

Attention: Patch coverage is 82.60870% with 4 lines in your changes missing coverage. Please review.

Project coverage is 57.89%. Comparing base (59551e4) to head (845e9f6).
Report is 1022 commits behind head on master.

Files with missing lines	Patch %	Lines
...pinot/query/runtime/operator/HashJoinOperator.java	82.60%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #13981      +/-   ##
============================================
- Coverage     61.75%   57.89%   -3.86%     
- Complexity      207      219      +12     
============================================
  Files          2436     2612     +176     
  Lines        133233   143202    +9969     
  Branches      20636    21985    +1349     
============================================
+ Hits          82274    82905     +631     
- Misses        44911    53819    +8908     
- Partials       6048     6478     +430

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`57.84% <82.60%> (-3.87%)`	⬇️
java-21	`57.78% <82.60%> (-3.85%)`	⬇️
skip-bytebuffers-false	`57.88% <82.60%> (-3.86%)`	⬇️
skip-bytebuffers-true	`57.73% <82.60%> (+30.01%)`	⬆️
temurin	`57.89% <82.60%> (-3.86%)`	⬇️
unittests	`57.88% <82.60%> (-3.86%)`	⬇️
unittests1	`40.76% <82.60%> (-6.13%)`	⬇️
unittests2	`27.93% <0.00%> (+0.20%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yashmayya · 2024-09-12T05:41:39Z

pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/operator/HashJoinOperator.java

-        if (incrementJoinedRowsAndCheckLimit()) {
-          break;
-        }


Why don't we need the rows limit check for semi and anti joins?

Is it because we want this protection mainly for cross joins and other similar join conditions where the number of joined rows can be much more than the sum of individual rows from the left and right blocks?

we are going to apply the limit per block, right? semi and anti join (and I guess inner) cannot produce more rows that the ones received (and I guess we assume each received block will have an acceptable size)

That is correct. We should never run a setting where it allows rows less than a block

gortiz · 2024-09-12T07:55:52Z

pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/operator/HashJoinOperator.java

-        if (incrementJoinedRowsAndCheckLimit()) {
-          break;
-        }


we are going to apply the limit per block, right? semi and anti join (and I guess inner) cannot produce more rows that the ones received (and I guess we assume each received block will have an acceptable size)

Jackie-Jiang added bugfix multi-stage Related to the multi-stage query engine labels Sep 11, 2024

Jackie-Jiang requested review from xiangfu0, gortiz and yashmayya September 11, 2024 21:40

[Multi-stage] Only track max joined rows within each block

845e9f6

Jackie-Jiang force-pushed the join_rows_limit branch from 41f290b to 845e9f6 Compare September 11, 2024 22:35

yashmayya reviewed Sep 12, 2024

View reviewed changes

gortiz approved these changes Sep 12, 2024

View reviewed changes

Jackie-Jiang merged commit c3fc1b9 into apache:master Sep 12, 2024
21 checks passed

Jackie-Jiang deleted the join_rows_limit branch September 12, 2024 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Multi-stage] Only track max joined rows within each block #13981

[Multi-stage] Only track max joined rows within each block #13981

Jackie-Jiang commented Sep 11, 2024

codecov-commenter commented Sep 11, 2024 •

edited

Loading

yashmayya Sep 12, 2024

yashmayya Sep 12, 2024

gortiz Sep 12, 2024

Jackie-Jiang Sep 12, 2024

gortiz Sep 12, 2024

[Multi-stage] Only track max joined rows within each block #13981

[Multi-stage] Only track max joined rows within each block #13981

Conversation

Jackie-Jiang commented Sep 11, 2024

codecov-commenter commented Sep 11, 2024 • edited Loading

Codecov Report

yashmayya Sep 12, 2024

Choose a reason for hiding this comment

yashmayya Sep 12, 2024

Choose a reason for hiding this comment

gortiz Sep 12, 2024

Choose a reason for hiding this comment

Jackie-Jiang Sep 12, 2024

Choose a reason for hiding this comment

gortiz Sep 12, 2024

Choose a reason for hiding this comment

codecov-commenter commented Sep 11, 2024 •

edited

Loading