Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken BIG_DECIMAL aggregations (MIN / MAX / SUM / AVG) in the multi-stage query engine #14689

Merged
merged 2 commits into from
Dec 21, 2024

Conversation

yashmayya
Copy link
Collaborator

  • Fixes Issues with min/max over big_decimal field when using Multi-stage engine #12705.
  • As discussed in the above issue, the problem is that in final aggregates the actual return type from the aggregation function will be a double value (from MIN / MAX / SUM / AVG) and this is attempted to be cast to BigDecimal when constructing the data block to be sent over the wire. The cast is attempted because the return type for MIN / MAX / SUM / AVG is inferred as DECIMAL when the input operand type is DECIMAL (or BIG_DECIMAL in Pinot) due to the use of the standard operators in Calcite. We don't want to change this type inference logic because it's backward incompatible (the actual return type for users would change if we change this) and also because we want to move towards actual polymorphic aggregation functions (i.e., not just the appearance of polymorphism through casting). Note that this isn't an issue in the leaf aggregation because the return type there is determined as DOUBLE through the AggregationFunctionType (see here).
  • The type conversion logic added to TypeUtils::convert (that handles single-stage -> multi-stage type conversion) includes a type check before converting because there are some aggregation functions like SUMPRECISION that actually return BigDecimal values and we don't want to lose precision there by converting through doubles.
  • In the long term, we want to support truly polymorphic aggregation functions in the multi-stage query engine (i.e., where the MAX aggregation function would return a value of the same type as the input operand type - right now it's always DOUBLE). However, this has backward compatibility implications and we need to be careful about not breaking the v1 engine. Right now, the output type to end users will still be the same as the input operand type in the MSQE but there can be loss of precision (MAX(BIG_DECIMAL col) -> DOUBLE output -> cast to BIG_DECIMAL (loss of precision) for instance).

@yashmayya yashmayya added bugfix multi-stage Related to the multi-stage query engine labels Dec 20, 2024
@codecov-commenter
Copy link

codecov-commenter commented Dec 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 64.02%. Comparing base (59551e4) to head (0c365d9).
Report is 1492 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14689      +/-   ##
============================================
+ Coverage     61.75%   64.02%   +2.26%     
- Complexity      207     1608    +1401     
============================================
  Files          2436     2706     +270     
  Lines        133233   149161   +15928     
  Branches      20636    22862    +2226     
============================================
+ Hits          82274    95495   +13221     
- Misses        44911    46671    +1760     
- Partials       6048     6995     +947     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.97% <100.00%> (+2.26%) ⬆️
java-21 63.91% <100.00%> (+2.28%) ⬆️
skip-bytebuffers-false 64.01% <100.00%> (+2.27%) ⬆️
skip-bytebuffers-true 63.85% <100.00%> (+36.12%) ⬆️
temurin 64.02% <100.00%> (+2.26%) ⬆️
unittests 64.01% <100.00%> (+2.26%) ⬆️
unittests1 56.32% <100.00%> (+9.42%) ⬆️
unittests2 34.46% <0.00%> (+6.73%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@gortiz gortiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried that it worked using MultistageEngineQuickStart using the following query:

select min(CAST (ArrTime as DECIMAL)) from airlineStats limit 10

Maybe we can also add some query like that in an integration test as a regression test

@yashmayya yashmayya marked this pull request as ready for review December 20, 2024 17:35
@yashmayya
Copy link
Collaborator Author

Maybe we can also add some query like that in an integration test as a regression test

The test case added in ResourceBasedQueriesTest does actually exercise that logic through the machinery in QueryRunnerTestBase but I've gone ahead and added a small actual integration test as well just in case.

@yashmayya yashmayya merged commit 616d691 into apache:master Dec 21, 2024
21 checks passed
zeronerdzerogeekzerocool pushed a commit to zeronerdzerogeekzerocool/pinot that referenced this pull request Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix multi-stage Related to the multi-stage query engine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issues with min/max over big_decimal field when using Multi-stage engine
3 participants