Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper null handling in SELECT, ORDER BY, DISTINCT, and GROUP BY #8927

Merged
merged 25 commits into from
Jul 19, 2022

Conversation

nizarhejazi
Copy link
Contributor

Proper NULL handling in:

  • SELECT with and without ORDER BY
  • DISTINCT
  • GROUP BY
  • DISTINCT

For Single-valued columns of all data types with:

  • Raw encoding
  • Dictionary encoding

Unit tests:

Updated BigDecimalQueriesTest to work with null values.
Added BooleanNullEnabledNoDictQueriesTest.
Added AllNullQueriesTest where all values are null.
Added NullEnabledQueriesTest where some values are nulls.

@codecov-commenter
Copy link

codecov-commenter commented Jun 20, 2022

Codecov Report

Merging #8927 (b661f25) into master (84478b6) will decrease coverage by 6.44%.
The diff coverage is 79.72%.

❗ Current head b661f25 differs from pull request most recent head 85e806a. Consider uploading reports for the commit 85e806a to get more accurate results

@@             Coverage Diff              @@
##             master    #8927      +/-   ##
============================================
- Coverage     70.13%   63.68%   -6.45%     
+ Complexity     4741     4719      -22     
============================================
  Files          1831     1784      -47     
  Lines         96382    94815    -1567     
  Branches      14408    14371      -37     
============================================
- Hits          67594    60381    -7213     
- Misses        24125    30101    +5976     
+ Partials       4663     4333     -330     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 66.96% <79.72%> (+0.11%) ⬆️
unittests2 15.36% <0.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../java/org/apache/pinot/common/utils/DataTable.java 95.45% <ø> (ø)
...che/pinot/core/common/datablock/BaseDataBlock.java 77.41% <0.00%> (-0.28%) ⬇️
...che/pinot/core/common/datatable/BaseDataTable.java 78.94% <ø> (ø)
...reaming/StreamingSelectionOnlyCombineOperator.java 0.00% <0.00%> (-65.31%) ⬇️
...ator/streaming/StreamingSelectionOnlyOperator.java 0.00% <0.00%> (-90.00%) ⬇️
...ctionaryBasedSingleColumnDistinctOnlyExecutor.java 0.00% <0.00%> (ø)
...e/pinot/core/query/reduce/RowBasedBlockValSet.java 15.94% <0.00%> (-0.24%) ⬇️
...main/java/org/apache/pinot/spi/data/FieldSpec.java 82.03% <ø> (ø)
...va/org/apache/pinot/spi/utils/CommonConstants.java 28.57% <ø> (ø)
.../raw/RawBytesSingleColumnDistinctOnlyExecutor.java 42.10% <28.57%> (-37.90%) ⬇️
... and 480 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84478b6...85e806a. Read the comment docs.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In high level, we should not decide whether to enable query null handling based on the whether there is null value vector. This flag should be passed from the broker as a query option, and we can introduce another table config to enable it. This can ensure all the segments are executed using the same mode.

@@ -50,6 +53,26 @@ public ProjectionBlockValSet(DataBlockCache dataBlockCache, String column, DataS
_dataBlockCache = dataBlockCache;
_column = column;
_dataSource = dataSource;
NullValueVectorReader nullValueReader = _dataSource == null ? null : _dataSource.getNullValueVector();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data source should never be null. The computation should be lazy, and the operator should decide whether to read it based on if null handling is enabled

Copy link
Contributor Author

@nizarhejazi nizarhejazi Jul 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to do computation lazily, and removed null check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the logic to getNullBitmap() method. Please note that this method is called only when nullHandlingEnabled is set to true.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are plenty of comments, but the PR is in good shape overall. Good job!

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good.
cc @siddharthteotia to also take a look since this change might have performance impact

@Jackie-Jiang Jackie-Jiang merged commit 8acb17f into apache:master Jul 19, 2022
@Jackie-Jiang Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels Jul 19, 2022
@Jackie-Jiang
Copy link
Contributor

Related to #8697

Comment on lines +256 to +264
for (Object[] row : rows) {
for (int i = 0; i < numColumns; i++) {
Object columnValue = row[i];
if (columnValue == null) {
row[i] = colDefaultNullValues[i];
nullBitmaps[i].add(rowId);
}
}
rowId++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we merge this part with the main loop below? any specific reason we had to loop through the rows one more time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature null support release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants