Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes filtered agg result column naming and filtered agg order-by compat #10092

Merged
merged 15 commits into from
Jan 13, 2023

Conversation

egalpin
Copy link
Member

@egalpin egalpin commented Jan 10, 2023

This PR follows up on #7916 #10000 (relates to #7519). This PR fixes a bug with Filtered Aggregations where the result column name was previously unclear, and ordering by filtered aggregations previously resulted in NPE. Both are patched via this PR.

Current behaviour:
SELECT max(ArrDelay) filter (where DaysSinceEpoch >= 16090) would have a result column named:

max(ArrDelay)

After this PR:
SELECT max(ArrDelay) filter (where DaysSinceEpoch >= 16090) would have a result column named:

max(ArrDelay) filter (where DaysSinceEpoch >= 16090)

Examples:
Filtered agg name correction without grouping/ordering:
Screen Shot 2023-01-10 at 10 41 48
Filtered agg name correction with grouping/ordering:
Screen Shot 2023-01-09 at 16 00 09

@egalpin
Copy link
Member Author

egalpin commented Jan 10, 2023

cc @Jackie-Jiang

@@ -69,7 +84,14 @@ public DataSchema getDataSchema(QueryContext queryContext) {
ColumnDataType[] columnDataTypes = new ColumnDataType[numColumns];
for (int i = 0; i < numColumns; i++) {
AggregationFunction aggregationFunction = _aggregationFunctions[i];
columnNames[i] = aggregationFunction.getColumnName();
String columnName = aggregationFunction.getResultColumnName();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is a very important change to call out, as I'm likely lacking context to understand all the implications. I know that at least one unit test failed initially due to this change where the resulting columns were named count_star previously, vs count(*) now. This may be an undesirable change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and we are not using the column name returned from the server in AggregationDataTableReducer. You may verify that by reverting the changes in this class and see if all the new tests still pass. If that is the case, we can also consider dropping the changes in this class to simplify the changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had mistakenly assumed that this change was needed to ensure that non-group-by filtered aggs had their result column names accurately reflected. That appears to not be the case having confirmed that tests still pass after removing these changes.

Is it at all problematic that this section of code would be "out of sync" with other sections of code as it pertains to column naming?

@egalpin
Copy link
Member Author

egalpin commented Jan 10, 2023

I ran the previously failing tests locally, and they pass. Are the tests in org.apache.pinot.segment.local.segment.index.loader.ForwardIndexHandlerTest known to be flaky?
Screen Shot 2023-01-10 at 14 44 23

…est and InterSegmentAggregationMultiValueRawQueriesTest
@codecov-commenter
Copy link

codecov-commenter commented Jan 11, 2023

Codecov Report

Merging #10092 (04ee3d5) into master (04d09e5) will increase coverage by 40.74%.
The diff coverage is 97.72%.

@@              Coverage Diff              @@
##             master   #10092       +/-   ##
=============================================
+ Coverage     27.87%   68.62%   +40.74%     
- Complexity       53     5737     +5684     
=============================================
  Files          1979     1995       +16     
  Lines        107281   108088      +807     
  Branches      16323    16424      +101     
=============================================
+ Hits          29909    74176    +44267     
+ Misses        74417    28649    -45768     
- Partials       2955     5263     +2308     
Flag Coverage Δ
integration1 24.65% <50.00%> (+0.01%) ⬆️
integration2 ?
unittests1 68.02% <97.72%> (?)
unittests2 13.71% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...org/apache/pinot/core/data/table/TableResizer.java 89.71% <92.85%> (+22.71%) ⬆️
...ore/operator/blocks/results/ResultsBlockUtils.java 91.52% <100.00%> (+2.63%) ⬆️
...t/core/operator/query/FilteredGroupByOperator.java 71.08% <100.00%> (+71.08%) ⬆️
...va/org/apache/pinot/core/plan/GroupByPlanNode.java 89.79% <100.00%> (+29.37%) ⬆️
...aggregation/function/AggregationFunctionUtils.java 75.53% <100.00%> (+39.97%) ⬆️
...core/query/reduce/AggregationDataTableReducer.java 68.42% <100.00%> (-1.58%) ⬇️
...pinot/core/data/manager/realtime/TimerService.java 0.00% <0.00%> (-100.00%) ⬇️
...t/core/plan/StreamingInstanceResponsePlanNode.java 0.00% <0.00%> (-100.00%) ⬇️
...ore/operator/streaming/StreamingResponseUtils.java 0.00% <0.00%> (-100.00%) ⬇️
...ager/realtime/PeerSchemeSplitSegmentCommitter.java 0.00% <0.00%> (-100.00%) ⬇️
... and 1478 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good. Thanks for adding the tests!

@@ -69,7 +84,14 @@ public DataSchema getDataSchema(QueryContext queryContext) {
ColumnDataType[] columnDataTypes = new ColumnDataType[numColumns];
for (int i = 0; i < numColumns; i++) {
AggregationFunction aggregationFunction = _aggregationFunctions[i];
columnNames[i] = aggregationFunction.getColumnName();
String columnName = aggregationFunction.getResultColumnName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and we are not using the column name returned from the server in AggregationDataTableReducer. You may verify that by reverting the changes in this class and see if all the new tests still pass. If that is the case, we can also consider dropping the changes in this class to simplify the changes.

@@ -60,8 +61,10 @@ public class FilteredGroupByOperator extends BaseOperator<GroupByResultsBlock> {
private long _numEntriesScannedPostFilter;
private final DataSchema _dataSchema;
private final QueryContext _queryContext;
private final IdentityHashMap<AggregationFunction, Integer> _resultHolderIndexMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably can keep it as local variable. Don't see it being used in other places

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought on this change was to avoid the need to recompute this map on every call to getNextBlock and instead compute once on instantiation and reuse the map across calls to getNextBlock.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this operator, getNextBlock can only be called once, so it should be fine. For the same reason, FilteredAggregationOperator is also keeping it local

@@ -59,6 +59,7 @@ public Operator<GroupByResultsBlock> run() {
assert _queryContext.getGroupByExpressions() != null;

if (_queryContext.hasFilteredAggregations()) {
assert _queryContext.getFilteredAggregationFunctions() != null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the context, this is never null even if there is no filter on aggregation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated QueryContext annotation for this field to be Nonnull rather than Nullable as this has caught me a few times

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot annotate it as Nonnull because it is Nonnull only for aggregation queries. Basically _queryContext.getAggregationFunctions() and _queryContext.getFilteredAggregationFunctions() will both be null or non-null.

@egalpin
Copy link
Member Author

egalpin commented Jan 12, 2023

@Jackie-Jiang thanks for the review! I've made all the updates I believe, so this is ready to be reviewed again.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jackie-Jiang Jackie-Jiang merged commit 36307cb into apache:master Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants