perf: Optimize decimal precision check in decimal aggregates (sum and avg) #952

andygrove · 2024-09-18T17:33:46Z

Which issue does this PR close?

Part of #951

Builds on #948

Rationale for this change

I noticed two areas of overhead in the current approach to verifying decimal precision in decimal aggregates sum and avg:

In the event of an overflow, we are building a formatted string for an error message, which is then discarded
The range check code introduces a memcpy which can be avoided

I tested the following variations of the decimal precision check in Rust playground.

fn validate_decimal_precision1(value: i128, precision: u8) -> bool {
    if precision > DECIMAL128_MAX_PRECISION {
        return false;
    }
    let idx = usize::from(precision) - 1;
    value >= MIN_DECIMAL_FOR_EACH_PRECISION[idx] && value <= MAX_DECIMAL_FOR_EACH_PRECISION[idx]
}

// based on arrow-rs version
fn validate_decimal_precision2(value: i128, precision: u8) -> bool {
    if precision > DECIMAL128_MAX_PRECISION {
        return false;
    }

    let max = MAX_DECIMAL_FOR_EACH_PRECISION[usize::from(precision) - 1];
    let min = MIN_DECIMAL_FOR_EACH_PRECISION[usize::from(precision) - 1];

    if value > max {
        false
    } else if value < min {
        false
    } else {
        true
    }
}

validate_decimal_precision1 avoids a memcpy that appears in validate_decimal_precision2:

playground::validate_decimal_precision1:
	subq	$1304, %rsp
	movb	%dl, %al
	movb	%al, 23(%rsp)
	movq	%rsi, 24(%rsp)
	movq	%rdi, 32(%rsp)
	movq	%rdi, 1264(%rsp)
	movq	%rsi, 1272(%rsp)
	movb	%al, 1287(%rsp)
	cmpb	$38, %al
	ja	.LBB9_2
	movb	23(%rsp), %al
	movb	%al, 1303(%rsp)
	movzbl	%al, %eax
	movq	%rax, %rcx
	subq	$1, %rcx
	movq	%rcx, 8(%rsp)
	cmpq	$1, %rax
	jb	.LBB9_4
	jmp	.LBB9_3

playground::validate_decimal_precision2:
	subq	$1368, %rsp
	movb	%dl, %al
	movb	%al, 55(%rsp)
	movq	%rsi, 56(%rsp)
	movq	%rdi, 64(%rsp)
	movq	%rdi, 1296(%rsp)
	movq	%rsi, 1304(%rsp)
	movb	%al, 1327(%rsp)
	cmpb	$38, %al
	ja	.LBB10_2
	leaq	80(%rsp), %rdi
	leaq	.L__unnamed_5(%rip), %rsi
	movl	$608, %edx
	callq	memcpy@PLT   <----- MEMCPY HERE
	movb	55(%rsp), %al
	movb	%al, 1367(%rsp)
	movzbl	%al, %eax
	movq	%rax, %rcx
	subq	$1, %rcx
	movq	%rcx, 40(%rsp)
	cmpq	$1, %rax
	jb	.LBB10_4
	jmp	.LBB10_3

What changes are included in this PR?

New precision check that returns bool instead of Err and avoids a memcpy
Added a Rust unit test for SumDecimal to make profiling easier
Minor code improvements

How are these changes tested?

New benchmark
Existing tests

…decimal_precision

andygrove · 2024-09-18T17:40:25Z

Benchmark results:

aggregate/avg_decimal_comet
                        time:   [905.95 µs 907.59 µs 909.32 µs]
                        change: [-4.4075% -4.0254% -3.6624%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

aggregate/sum_decimal_comet
                        time:   [781.78 µs 786.56 µs 791.40 µs]
                        change: [-15.686% -15.052% -14.420%] (p = 0.00 < 0.05)
                        Performance has improved.

…ision

andygrove · 2024-09-18T18:38:44Z

10 runs of TPC-H q1 @ 100 GB:

main branch


 "1": [
        13.25664234161377,
        11.419574975967407,
        11.131508111953735,
        11.158952474594116,
        11.132988929748535,
        11.101924657821655,
        11.077952146530151,
        11.080657720565796,
        11.011901378631592,
        11.053855419158936
    ]

this PR

"1": [
        12.989612579345703,
        10.942750453948975,
        10.66578459739685,
        10.707809686660767,
        10.62846064567566,
        10.636594295501709,
        10.699990034103394,
        10.733628034591675,
        10.689185380935669,
        10.626504182815552
    ]

andygrove · 2024-09-18T19:04:38Z

aggregate/avg_decimal_datafusion
                        time:   [648.82 µs 651.35 µs 654.29 µs]
aggregate/avg_decimal_comet
                        time:   [882.83 µs 883.65 µs 884.48 µs]
aggregate/sum_decimal_datafusion
                        time:   [695.71 µs 696.20 µs 696.70 µs]
aggregate/sum_decimal_comet
                        time:   [784.45 µs 786.38 µs 788.42 µs]

codecov-commenter · 2024-09-18T20:14:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 33.81%. Comparing base (cbaf1c2) to head (5381327).
Report is 5 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #952      +/-   ##
============================================
+ Coverage     33.80%   33.81%   +0.01%     
+ Complexity      852      851       -1     
============================================
  Files           112      112              
  Lines         43276    43286      +10     
  Branches       9572     9572              
============================================
+ Hits          14629    14639      +10     
  Misses        25634    25634              
  Partials       3013     3013

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andygrove · 2024-09-19T21:36:43Z

Comparison between main branch and this PR:

mbutrovich · 2024-09-20T00:00:45Z

I was a bit surprised to see a performance win from changing an if-else to a compound boolean expression, since this seems like something that an optimizing compiler should handle well. I think I confirmed that part by putting the example code above in the Rust playground, and in Release mode the memcpy doesn't exist. If I leave it in Debug mode, I see the memcpy the same as described above. That doesn't explain the real performance win in Comet, however.

My next guess was that hoisting the implementation from arrow-rs into Comet enabled better inlining opportunities. My ARM assembly is not as proficient as x86, but I believe the relevant bits are below. This is current main branch disassembly for comet::execution::datafusion::expressions::avg_decimal::AvgDecimalGroupsAccumulator::update_single:

bl arrow_data::decimal::validate_decimal_precision

There's a branch and link that isn't present in this PR's disassembly. The compiler is able to better inline the hoisted code.

I am not as familiar with Rust's build environment, so I'm not sure if this is expected when calling into code from other crates. I see Comet currently does thin LTO. I am curious if full would enable better inlining of functions like this, or if we're just at the limits of what the compiler can do. In general, this makes me curious about the performance limits of arrow-rs kernels in hot code paths, and may guide our future optimizations.

kazuyukitanimura

LGTM

kazuyukitanimura · 2024-09-23T20:38:00Z

native/core/src/execution/datafusion/expressions/sum_decimal.rs

@@ -127,19 +127,22 @@ impl AggregateUDFImpl for SumDecimal {
    fn reverse_expr(&self) -> ReversedUDAF {
        ReversedUDAF::Identical
    }
+
+    fn is_nullable(&self) -> bool {
+        // SumDecimal is always nullable because overflows can cause null values


Wondering if this is true for ANSI.
It looks the previous code is also hardcoding true, but this may be a good time to file an issue if there is not yet.

Good point. I filed #961

andygrove · 2024-09-24T14:34:32Z

I was a bit surprised to see a performance win from changing an if-else to a compound boolean expression, since this seems like something that an optimizing compiler should handle well. I think I confirmed that part by putting the example code above in the Rust playground, and in Release mode the memcpy doesn't exist. If I leave it in Debug mode, I see the memcpy the same as described above. That doesn't explain the real performance win in Comet, however.

My next guess was that hoisting the implementation from arrow-rs into Comet enabled better inlining opportunities. My ARM assembly is not as proficient as x86, but I believe the relevant bits are below. This is current main branch disassembly for comet::execution::datafusion::expressions::avg_decimal::AvgDecimalGroupsAccumulator::update_single:
bl arrow_data::decimal::validate_decimal_precision
There's a branch and link that isn't present in this PR's disassembly. The compiler is able to better inline the hoisted code.

I am not as familiar with Rust's build environment, so I'm not sure if this is expected when calling into code from other crates. I see Comet currently does thin LTO. I am curious if full would enable better inlining of functions like this, or if we're just at the limits of what the compiler can do. In general, this makes me curious about the performance limits of arrow-rs kernels in hot code paths, and may guide our future optimizations.

Thanks @mbutrovich. This is very insightful. I'd like to go ahead and merge this PR but would also like to have a better understanding of why this is actually faster.

andygrove added 6 commits September 17, 2024 16:42

agg bench

2ff71b3

fix

b24794e

fix

af9fc15

refactor

36260ad

avg

e0d3a58

optimized decimal aggregates with more efficient version of validate_…

858b998

…decimal_precision

andygrove added the performance label Sep 18, 2024

Merge remote-tracking branch 'apache/main' into is_valid_decimal_prec…

ee3ff15

…ision

simplify function to remove branch

1e57c04

andygrove added 2 commits September 18, 2024 13:22

address feedback

13312aa

Merge branch 'agg-bench' into is_valid_decimal_precision

e46643c

andygrove added 8 commits September 19, 2024 07:21

format

4ca48f4

Revert a change

f281fe4

add rust unit test

0221e9f

format

2357f62

code cleanup

dcbc88a

fix

8487251

fix

3038ba6

fix

a75f870

andygrove marked this pull request as ready for review September 19, 2024 14:18

andygrove added 4 commits September 19, 2024 10:02

fmt

284bd99

update bench

18d474f

upmerge

ed4e387

clippy

5381327

andygrove changed the title ~~perf: Avoid memcpy during decimal precision check in decimal aggregates (sum and avg)~~ perf: Optimize decimal precision check in decimal aggregates (sum and avg) Sep 22, 2024

kazuyukitanimura approved these changes Sep 23, 2024

View reviewed changes

andygrove merged commit 50517f6 into apache:main Sep 24, 2024
75 checks passed

andygrove deleted the is_valid_decimal_precision branch September 24, 2024 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Optimize decimal precision check in decimal aggregates (sum and avg) #952

perf: Optimize decimal precision check in decimal aggregates (sum and avg) #952

andygrove commented Sep 18, 2024 •

edited

Loading

andygrove commented Sep 18, 2024

andygrove commented Sep 18, 2024

andygrove commented Sep 18, 2024

codecov-commenter commented Sep 18, 2024 •

edited

Loading

andygrove commented Sep 19, 2024

mbutrovich commented Sep 20, 2024

kazuyukitanimura left a comment

kazuyukitanimura Sep 23, 2024

andygrove Sep 24, 2024

andygrove commented Sep 24, 2024

perf: Optimize decimal precision check in decimal aggregates (sum and avg) #952

perf: Optimize decimal precision check in decimal aggregates (sum and avg) #952

Conversation

andygrove commented Sep 18, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

andygrove commented Sep 18, 2024

andygrove commented Sep 18, 2024

main branch

this PR

andygrove commented Sep 18, 2024

codecov-commenter commented Sep 18, 2024 • edited Loading

Codecov Report

andygrove commented Sep 19, 2024

mbutrovich commented Sep 20, 2024

kazuyukitanimura left a comment

Choose a reason for hiding this comment

kazuyukitanimura Sep 23, 2024

Choose a reason for hiding this comment

andygrove Sep 24, 2024

Choose a reason for hiding this comment

andygrove commented Sep 24, 2024

andygrove commented Sep 18, 2024 •

edited

Loading

codecov-commenter commented Sep 18, 2024 •

edited

Loading