Fix broken BIG_DECIMAL aggregations (MIN / MAX / SUM / AVG) in the multi-stage query engine #14689

yashmayya · 2024-12-20T09:35:09Z

Fixes Issues with min/max over big_decimal field when using Multi-stage engine #12705.
As discussed in the above issue, the problem is that in final aggregates the actual return type from the aggregation function will be a double value (from MIN / MAX / SUM / AVG) and this is attempted to be cast to BigDecimal when constructing the data block to be sent over the wire. The cast is attempted because the return type for MIN / MAX / SUM / AVG is inferred as DECIMAL when the input operand type is DECIMAL (or BIG_DECIMAL in Pinot) due to the use of the standard operators in Calcite. We don't want to change this type inference logic because it's backward incompatible (the actual return type for users would change if we change this) and also because we want to move towards actual polymorphic aggregation functions (i.e., not just the appearance of polymorphism through casting). Note that this isn't an issue in the leaf aggregation because the return type there is determined as DOUBLE through the AggregationFunctionType (see here).
The type conversion logic added to TypeUtils::convert (that handles single-stage -> multi-stage type conversion) includes a type check before converting because there are some aggregation functions like SUMPRECISION that actually return BigDecimal values and we don't want to lose precision there by converting through doubles.
In the long term, we want to support truly polymorphic aggregation functions in the multi-stage query engine (i.e., where the MAX aggregation function would return a value of the same type as the input operand type - right now it's always DOUBLE). However, this has backward compatibility implications and we need to be careful about not breaking the v1 engine. Right now, the output type to end users will still be the same as the input operand type in the MSQE but there can be loss of precision (MAX(BIG_DECIMAL col) -> DOUBLE output -> cast to BIG_DECIMAL (loss of precision) for instance).

…lti-stage query engine

codecov-commenter · 2024-12-20T10:14:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 64.02%. Comparing base (59551e4) to head (0c365d9).
Report is 1492 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14689      +/-   ##
============================================
+ Coverage     61.75%   64.02%   +2.26%     
- Complexity      207     1608    +1401     
============================================
  Files          2436     2706     +270     
  Lines        133233   149161   +15928     
  Branches      20636    22862    +2226     
============================================
+ Hits          82274    95495   +13221     
- Misses        44911    46671    +1760     
- Partials       6048     6995     +947

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.97% <100.00%> (+2.26%)`	⬆️
java-21	`63.91% <100.00%> (+2.28%)`	⬆️
skip-bytebuffers-false	`64.01% <100.00%> (+2.27%)`	⬆️
skip-bytebuffers-true	`63.85% <100.00%> (+36.12%)`	⬆️
temurin	`64.02% <100.00%> (+2.26%)`	⬆️
unittests	`64.01% <100.00%> (+2.26%)`	⬆️
unittests1	`56.32% <100.00%> (+9.42%)`	⬆️
unittests2	`34.46% <0.00%> (+6.73%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gortiz

I've tried that it worked using MultistageEngineQuickStart using the following query:

select min(CAST (ArrTime as DECIMAL)) from airlineStats limit 10

Maybe we can also add some query like that in an integration test as a regression test

yashmayya · 2024-12-20T17:41:28Z

Maybe we can also add some query like that in an integration test as a regression test

The test case added in ResourceBasedQueriesTest does actually exercise that logic through the machinery in QueryRunnerTestBase but I've gone ahead and added a small actual integration test as well just in case.

…lti-stage query engine (apache#14689)

Fix broken BIG_DECIMAL aggregations (MIN / MAX / SUM / AVG) in the mu…

040d4a1

…lti-stage query engine

yashmayya requested review from gortiz and Jackie-Jiang December 20, 2024 09:35

yashmayya added bugfix multi-stage Related to the multi-stage query engine labels Dec 20, 2024

gortiz approved these changes Dec 20, 2024

View reviewed changes

yashmayya marked this pull request as ready for review December 20, 2024 17:35

Add integration test

0c365d9

yashmayya merged commit 616d691 into apache:master Dec 21, 2024
21 checks passed

zeronerdzerogeekzerocool pushed a commit to zeronerdzerogeekzerocool/pinot that referenced this pull request Feb 20, 2025

Fix broken BIG_DECIMAL aggregations (MIN / MAX / SUM / AVG) in the mu…

2609bff

…lti-stage query engine (apache#14689)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken BIG_DECIMAL aggregations (MIN / MAX / SUM / AVG) in the multi-stage query engine #14689

Fix broken BIG_DECIMAL aggregations (MIN / MAX / SUM / AVG) in the multi-stage query engine #14689

yashmayya commented Dec 20, 2024

codecov-commenter commented Dec 20, 2024 •

edited

Loading

gortiz left a comment

yashmayya commented Dec 20, 2024

Fix broken BIG_DECIMAL aggregations (MIN / MAX / SUM / AVG) in the multi-stage query engine #14689

Fix broken BIG_DECIMAL aggregations (MIN / MAX / SUM / AVG) in the multi-stage query engine #14689

Conversation

yashmayya commented Dec 20, 2024

codecov-commenter commented Dec 20, 2024 • edited Loading

Codecov Report

gortiz left a comment

Choose a reason for hiding this comment

yashmayya commented Dec 20, 2024

codecov-commenter commented Dec 20, 2024 •

edited

Loading