-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Optimize decimal precision check in decimal aggregates (sum and avg) #952
Conversation
Benchmark results:
|
10 runs of TPC-H q1 @ 100 GB: main branch
this PR
|
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #952 +/- ##
============================================
+ Coverage 33.80% 33.81% +0.01%
+ Complexity 852 851 -1
============================================
Files 112 112
Lines 43276 43286 +10
Branches 9572 9572
============================================
+ Hits 14629 14639 +10
Misses 25634 25634
Partials 3013 3013 ☔ View full report in Codecov by Sentry. |
I was a bit surprised to see a performance win from changing an if-else to a compound boolean expression, since this seems like something that an optimizing compiler should handle well. I think I confirmed that part by putting the example code above in the Rust playground, and in Release mode the My next guess was that hoisting the implementation from arrow-rs into Comet enabled better inlining opportunities. My ARM assembly is not as proficient as x86, but I believe the relevant bits are below. This is current main branch disassembly for
There's a branch and link that isn't present in this PR's disassembly. The compiler is able to better inline the hoisted code. I am not as familiar with Rust's build environment, so I'm not sure if this is expected when calling into code from other crates. I see Comet currently does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -127,19 +127,22 @@ impl AggregateUDFImpl for SumDecimal { | |||
fn reverse_expr(&self) -> ReversedUDAF { | |||
ReversedUDAF::Identical | |||
} | |||
|
|||
fn is_nullable(&self) -> bool { | |||
// SumDecimal is always nullable because overflows can cause null values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if this is true for ANSI.
It looks the previous code is also hardcoding true, but this may be a good time to file an issue if there is not yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I filed #961
Thanks @mbutrovich. This is very insightful. I'd like to go ahead and merge this PR but would also like to have a better understanding of why this is actually faster. |
Which issue does this PR close?
Part of #951
Builds on #948
Rationale for this change
I noticed two areas of overhead in the current approach to verifying decimal precision in decimal aggregates
sum
andavg
:I tested the following variations of the decimal precision check in Rust playground.
validate_decimal_precision1
avoids amemcpy
that appears invalidate_decimal_precision2
:What changes are included in this PR?
Err
and avoids amemcpy
How are these changes tested?