Write faster kernel for is_distinct #4560

comphead · 2022-12-09T00:20:32Z

Which issue does this PR close?

Closes #4482.

Rationale for this change

See #4482

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

comphead · 2022-12-09T00:25:45Z

Hi @Dandandan

This change includes optimization for is_distinct/is_not_distinct.

1B array processed with avg 170s before, and 80s now.

Also using our own conversion to BooleanArray instead of BooleanArray::from_slice, as their implementation likely loses ticks on derefencing in loop. This part allowed to save 20s. Perhaps we can fix it in arrow-rs as well.

alamb · 2022-12-09T11:54:02Z

@tustvold do you have any additional suggestions for speeding up this kernel?

tustvold

I'm not sure this handles nulls correctly, FWIW the next release of arrow (due today) adds a BooleanArray::from_binary which will be significantly faster than this as it vectorises properly. In the short-term you could copy its implementation - https://github.com/apache/arrow-rs/blob/master/arrow-array/src/array/boolean_array.rs#L232

The key primitive that allows it to vectorise correctly is MutableBuffer::collect_bool.

Edit: Correctly handling nulls is a bit of a pain, let me bash something out for you

Dandandan · 2022-12-09T12:50:31Z

datafusion/physical-expr/src/expressions/binary/kernels_arrow.rs

+    let array_len = left.data().len().min(right.data().len());
+    let mut arr = vec![false; array_len].into_boxed_slice();
+
+    let left_values = left.values();


It should also look at the null bitmap (if available).

tustvold · 2022-12-09T14:45:19Z

It needs proper testing but the following I think is correct

pub(crate) fn is_distinct_from<A>(left: A, right: A) -> Result<BooleanArray>
where
    A: ArrayAccessor,
    A::Item: PartialEq,
{
    let ld = left.data();
    let rd = right.data();
    assert_eq!(ld.len(), rd.len());

    let mut distinct = MutableBuffer::collect_bool(left.len(), |i| unsafe {
        // SAFETY: i in range 0..len
        left.value_unchecked(i) != right.value_unchecked(i)
    });

    if let (Some(l), Some(r)) = (ld.null_buffer(), rd.null_buffer()) {
        let left = BitChunks::new(l.as_slice(), ld.offset(), ld.len());
        let right = BitChunks::new(r.as_slice(), rd.offset(), rd.len());

        // Pad to multiple of u64
        distinct.resize(bit_util::ceil(ld.len(), 8), 0);

        // Distinct so long as one or both are valid
        distinct
            .typed_data_mut::<u64>()
            .iter_mut()
            .zip(left)
            .zip(right)
            .for_each(|((out, left), right)| *out = *out & (left | right))
    }

    let builder = ArrayDataBuilder::new(DataType::Boolean)
        .len(left.len())
        .add_buffer(distinct.into());

    Ok(BooleanArray::from(unsafe { builder.build_unchecked() }))
}

I'll work on getting it added to arrow-rs

comphead · 2022-12-09T18:21:01Z

Thanks @tustvold @Dandandan for the feedback. The same 1B perf test for the code above is 90s.

Will is_distinct_from<A>/is_not_distinct_from<A> be a part of arrow-rs and datafusion adopts it with new arrow-rs version?

tustvold · 2022-12-09T18:25:20Z

Yes, I believe there is a ticket already to add it

comphead · 2022-12-09T18:42:50Z

Yes, I believe there is a ticket already to add it

Thanks, please ping the arrow-rs ticket for tracking and we close this issue once adopt arrow-rs new implementation

alamb · 2022-12-09T18:42:58Z

Yes, I believe there is a ticket already to add it

apache/arrow-rs#960

Dandandan · 2022-12-09T22:40:29Z

Looks correct to me and a really clever implementation!

Dandandan · 2022-12-09T22:48:41Z

Ah on further look - I think it might not be correct for when the left or right side doesn't contain a null bitmap.

comphead · 2022-12-09T22:53:19Z

Ah on further look - I think it might not be correct for when the left or right side doesn't contain a null bitmap.

@Dandandan please correct me if I'm wrong but current implementation in datafusion also lacks null bitmap?

tustvold · 2022-12-09T23:10:59Z

please correct me if I'm wrong but current implementation in datafusion also lacks null bitmap?

The iterators consult the bitmap and return Option<T> and so the x == y is actually comparing Option and not the bare values.

I think it might not be correct for when the left or right side doesn't contain a null bitmap.

Assuming *out & (left | right) is correct (which I think it is) if either side is missing a null bitmap it simplifies to *out & (left | TRUE) i.e. *out.

Edit: I think it might need to be (*out & left & right) | (left ^ right) 🤔

Dandandan · 2022-12-10T08:15:33Z

please correct me if I'm wrong but current implementation in datafusion also lacks null bitmap?

The iterators consult the bitmap and return Option<T> and so the x == y is actually comparing Option and not the bare values.

I think it might not be correct for when the left or right side doesn't contain a null bitmap.

Assuming *out & (left | right) is correct (which I think it is) if either side is missing a null bitmap it simplifies to *out & (left | TRUE) i.e. *out.

Edit: I think it might need to be (*out & left & right) | (left ^ right) 🤔

The main thing I think is wrong is that left.value_unchecked(i) != right.value_unchecked(i) might return true or false regardless of validity for each i. If only the left side contains a bitmap, it should also consider this left bitmap for the results (always distinct if left side is null value regardless of values or distinct if valid and having different values)`.

tustvold · 2022-12-10T08:18:54Z

Agreed, I got the boolean expression wrong, I think the corrected form I posted is correct, and would require handling the single buffer case?

comphead · 2022-12-10T18:00:03Z

@Dandandan when you say null bitmap, do you mean this structure
https://github.com/apache/arrow-datafusion/blob/master/datafusion/common/src/from_slice.rs#L105?

Dandandan · 2022-12-10T19:46:06Z

I mean the validity/null data structure that can optionally be present in arrow arrays. This uses a bitmap to encode false/true values (not valid / valid).

You can see it is accessed as .null_buffer() from the code from @tustvold .

alamb · 2022-12-12T14:34:58Z

Thank you @comphead

Here is what I recommend:

Add tests to this PR (or point at other pre-existing tests) that show it works correctly with NULL values
Merge this PR and we can file a ticket to switch to the arrow-rs kernel if/when that is available.

tustvold · 2022-12-12T14:39:32Z

Some example cases to try out might be:

NULL is distinct from the 0 value
NULL is not distinct from NULL even if the values in the corresponding null slots are distinct
NULL is distinct from an array containing no nulls

comphead · 2022-12-13T22:28:34Z

Avg time now 70s

@alamb added tests for nulls
@tustvold iterating through .values() is significantly faster than accessing value_unchecked(i)
@Dandandan you mentioned null bitmap is optional, is it for some specific primitiveArray types?

tustvold · 2022-12-13T22:36:48Z

datafusion/physical-expr/src/expressions/binary/kernels_arrow.rs

@@ -515,4 +513,40 @@ mod tests {
        let err = modulus_decimal_scalar(&left_decimal_array, 0).unwrap_err();
        assert_eq!("Arrow error: Divide by zero error", err.to_string());
    }
+
+    #[test]
+    fn is_distinct_from_nulls() -> Result<()> {


tustvold · 2022-12-13T22:39:18Z

iterating through .values() is significantly faster than accessing value_unchecked(i)

Yeah, I've seen this before. Last time it required some hackery with inline(never) to make it do the right thing. It is frustrating playing LLVM bingo 😆

Avg time now 70s

Very cool, perhaps you could add the benchmark code so that we can catch regressions in this? I'm also curious if this benchmark contains any null values?

you mentioned null bitmap is optional, is it for some specific primitiveArray types?

Arrow arrays will only contain a null bitmap if they have a non-zero null count, otherwise it will be empty. No null buffer means all positions are valid, i.e. non-null.

alamb

THanks @comphead

datafusion/physical-expr/src/expressions/binary/kernels_arrow.rs

alamb · 2022-12-14T10:55:04Z

I think this PR is ready to merge. Please let me know if there any concerns with doing so otherwise I will do so tomorrow

comphead · 2022-12-14T16:36:38Z

Thanks @alamb Added 1 more test to cover Arrow arrays will only contain a null bitmap if they have a non-zero null count, otherwise it will be empty. No null buffer means all positions are valid, i.e. non-null.

For bench I'll create a separate PR

alamb · 2022-12-14T20:28:28Z

Thanks @comphead

Dandandan · 2022-12-14T20:31:11Z

Nice, thank you @comphead !

ursabot · 2022-12-14T20:32:43Z

Benchmark runs are scheduled for baseline = 40e6a67 and contender = 4542031. 4542031 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

liukun4515 · 2022-12-15T02:47:51Z

Do we need to move the is distinct and is not distinct to arrow-rs?
cc @alamb @alamb

comphead · 2022-12-15T04:47:37Z

@liukun4515 thats a great question and the ticket still exists apache/arrow-rs#960
But it requires a bit more work on arrow side, planning to take care on this ticket

alamb · 2022-12-15T18:30:45Z

Do we need to move the is distinct and is not distinct to arrow-rs?

I agree -- we should move the kernel to arrow and that work is tracked by apache/arrow-rs#960 (which is also referenced in the code) https://github.com/apache/arrow-datafusion/pull/4560/files#diff-133624520f20c20b33c2c54ec6ae565f09b058ef317ede0e768bd23bae30941dR26

Optimize is_distinct

c3b4e56

github-actions bot added the physical-expr Physical Expressions label Dec 9, 2022

tustvold reviewed Dec 9, 2022

View reviewed changes

Dandandan reviewed Dec 9, 2022

View reviewed changes

comphead mentioned this pull request Dec 9, 2022

Consider adding is_distinct_from kernels apache/arrow-rs#960

Closed

Merge branch 'apache:master' into opt_isdistinct

b30fb38

comphead marked this pull request as draft December 13, 2022 16:22

added tests

1ea2451

tustvold reviewed Dec 13, 2022

View reviewed changes

alamb approved these changes Dec 14, 2022

View reviewed changes

datafusion/physical-expr/src/expressions/binary/kernels_arrow.rs Outdated Show resolved Hide resolved

alamb marked this pull request as ready for review December 14, 2022 10:54

added 1 more test

4b592bd

alamb merged commit 4542031 into apache:master Dec 14, 2022

comphead deleted the opt_isdistinct branch December 14, 2022 21:50

tustvold mentioned this pull request Aug 18, 2023

Is Distinct From Incorrectly Handles Masked Nulls #7332

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write faster kernel for is_distinct #4560

Write faster kernel for is_distinct #4560

comphead commented Dec 9, 2022

comphead commented Dec 9, 2022

alamb commented Dec 9, 2022

tustvold left a comment •

edited

Loading

Dandandan Dec 9, 2022

tustvold commented Dec 9, 2022

comphead commented Dec 9, 2022

tustvold commented Dec 9, 2022

comphead commented Dec 9, 2022

alamb commented Dec 9, 2022

Dandandan commented Dec 9, 2022

Dandandan commented Dec 9, 2022 •

edited

Loading

comphead commented Dec 9, 2022

tustvold commented Dec 9, 2022 •

edited

Loading

Dandandan commented Dec 10, 2022 •

edited

Loading

tustvold commented Dec 10, 2022

comphead commented Dec 10, 2022

Dandandan commented Dec 10, 2022

alamb commented Dec 12, 2022

tustvold commented Dec 12, 2022 •

edited

Loading

comphead commented Dec 13, 2022

tustvold Dec 13, 2022

tustvold commented Dec 13, 2022 •

edited

Loading

alamb left a comment

alamb commented Dec 14, 2022

comphead commented Dec 14, 2022

alamb commented Dec 14, 2022

Dandandan commented Dec 14, 2022

ursabot commented Dec 14, 2022

liukun4515 commented Dec 15, 2022

comphead commented Dec 15, 2022

alamb commented Dec 15, 2022

Write faster kernel for is_distinct #4560

Write faster kernel for is_distinct #4560

Conversation

comphead commented Dec 9, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

comphead commented Dec 9, 2022

alamb commented Dec 9, 2022

tustvold left a comment • edited Loading

Choose a reason for hiding this comment

Dandandan Dec 9, 2022

Choose a reason for hiding this comment

tustvold commented Dec 9, 2022

comphead commented Dec 9, 2022

tustvold commented Dec 9, 2022

comphead commented Dec 9, 2022

alamb commented Dec 9, 2022

Dandandan commented Dec 9, 2022

Dandandan commented Dec 9, 2022 • edited Loading

comphead commented Dec 9, 2022

tustvold commented Dec 9, 2022 • edited Loading

Dandandan commented Dec 10, 2022 • edited Loading

tustvold commented Dec 10, 2022

comphead commented Dec 10, 2022

Dandandan commented Dec 10, 2022

alamb commented Dec 12, 2022

tustvold commented Dec 12, 2022 • edited Loading

comphead commented Dec 13, 2022

tustvold Dec 13, 2022

Choose a reason for hiding this comment

tustvold commented Dec 13, 2022 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb commented Dec 14, 2022

comphead commented Dec 14, 2022

alamb commented Dec 14, 2022

Dandandan commented Dec 14, 2022

ursabot commented Dec 14, 2022

liukun4515 commented Dec 15, 2022

comphead commented Dec 15, 2022

alamb commented Dec 15, 2022

tustvold left a comment •

edited

Loading

Dandandan commented Dec 9, 2022 •

edited

Loading

tustvold commented Dec 9, 2022 •

edited

Loading

Dandandan commented Dec 10, 2022 •

edited

Loading

tustvold commented Dec 12, 2022 •

edited

Loading

tustvold commented Dec 13, 2022 •

edited

Loading