Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write faster kernel for is_distinct #4560

Merged
merged 4 commits into from
Dec 14, 2022
Merged

Conversation

comphead
Copy link
Contributor

@comphead comphead commented Dec 9, 2022

Which issue does this PR close?

Closes #4482.

Rationale for this change

See #4482

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Physical Expressions label Dec 9, 2022
@comphead
Copy link
Contributor Author

comphead commented Dec 9, 2022

Hi @Dandandan

This change includes optimization for is_distinct/is_not_distinct.

1B array processed with avg 170s before, and 80s now.

Also using our own conversion to BooleanArray instead of BooleanArray::from_slice, as their implementation likely loses ticks on derefencing in loop. This part allowed to save 20s. Perhaps we can fix it in arrow-rs as well.

@alamb
Copy link
Contributor

alamb commented Dec 9, 2022

@tustvold do you have any additional suggestions for speeding up this kernel?

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this handles nulls correctly, FWIW the next release of arrow (due today) adds a BooleanArray::from_binary which will be significantly faster than this as it vectorises properly. In the short-term you could copy its implementation - https://github.com/apache/arrow-rs/blob/master/arrow-array/src/array/boolean_array.rs#L232

The key primitive that allows it to vectorise correctly is MutableBuffer::collect_bool.

Edit: Correctly handling nulls is a bit of a pain, let me bash something out for you

let array_len = left.data().len().min(right.data().len());
let mut arr = vec![false; array_len].into_boxed_slice();

let left_values = left.values();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should also look at the null bitmap (if available).

@tustvold
Copy link
Contributor

tustvold commented Dec 9, 2022

It needs proper testing but the following I think is correct

pub(crate) fn is_distinct_from<A>(left: A, right: A) -> Result<BooleanArray>
where
    A: ArrayAccessor,
    A::Item: PartialEq,
{
    let ld = left.data();
    let rd = right.data();
    assert_eq!(ld.len(), rd.len());

    let mut distinct = MutableBuffer::collect_bool(left.len(), |i| unsafe {
        // SAFETY: i in range 0..len
        left.value_unchecked(i) != right.value_unchecked(i)
    });

    if let (Some(l), Some(r)) = (ld.null_buffer(), rd.null_buffer()) {
        let left = BitChunks::new(l.as_slice(), ld.offset(), ld.len());
        let right = BitChunks::new(r.as_slice(), rd.offset(), rd.len());

        // Pad to multiple of u64
        distinct.resize(bit_util::ceil(ld.len(), 8), 0);

        // Distinct so long as one or both are valid
        distinct
            .typed_data_mut::<u64>()
            .iter_mut()
            .zip(left)
            .zip(right)
            .for_each(|((out, left), right)| *out = *out & (left | right))
    }

    let builder = ArrayDataBuilder::new(DataType::Boolean)
        .len(left.len())
        .add_buffer(distinct.into());

    Ok(BooleanArray::from(unsafe { builder.build_unchecked() }))
}

I'll work on getting it added to arrow-rs

@comphead
Copy link
Contributor Author

comphead commented Dec 9, 2022

Thanks @tustvold @Dandandan for the feedback. The same 1B perf test for the code above is 90s.

Will is_distinct_from<A>/is_not_distinct_from<A> be a part of arrow-rs and datafusion adopts it with new arrow-rs version?

@tustvold
Copy link
Contributor

tustvold commented Dec 9, 2022

Yes, I believe there is a ticket already to add it

@comphead
Copy link
Contributor Author

comphead commented Dec 9, 2022

Yes, I believe there is a ticket already to add it

Thanks, please ping the arrow-rs ticket for tracking and we close this issue once adopt arrow-rs new implementation

@alamb
Copy link
Contributor

alamb commented Dec 9, 2022

Yes, I believe there is a ticket already to add it

apache/arrow-rs#960

@Dandandan
Copy link
Contributor

Looks correct to me and a really clever implementation!

@Dandandan
Copy link
Contributor

Dandandan commented Dec 9, 2022

Ah on further look - I think it might not be correct for when the left or right side doesn't contain a null bitmap.

@comphead
Copy link
Contributor Author

comphead commented Dec 9, 2022

Ah on further look - I think it might not be correct for when the left or right side doesn't contain a null bitmap.

@Dandandan please correct me if I'm wrong but current implementation in datafusion also lacks null bitmap?

@tustvold
Copy link
Contributor

tustvold commented Dec 9, 2022

please correct me if I'm wrong but current implementation in datafusion also lacks null bitmap?

The iterators consult the bitmap and return Option<T> and so the x == y is actually comparing Option and not the bare values.

I think it might not be correct for when the left or right side doesn't contain a null bitmap.

Assuming *out & (left | right) is correct (which I think it is) if either side is missing a null bitmap it simplifies to *out & (left | TRUE) i.e. *out.

Edit: I think it might need to be (*out & left & right) | (left ^ right) 🤔

@Dandandan
Copy link
Contributor

Dandandan commented Dec 10, 2022

please correct me if I'm wrong but current implementation in datafusion also lacks null bitmap?

The iterators consult the bitmap and return Option<T> and so the x == y is actually comparing Option and not the bare values.

I think it might not be correct for when the left or right side doesn't contain a null bitmap.

Assuming *out & (left | right) is correct (which I think it is) if either side is missing a null bitmap it simplifies to *out & (left | TRUE) i.e. *out.

Edit: I think it might need to be (*out & left & right) | (left ^ right) 🤔

The main thing I think is wrong is that left.value_unchecked(i) != right.value_unchecked(i) might return true or false regardless of validity for each i. If only the left side contains a bitmap, it should also consider this left bitmap for the results (always distinct if left side is null value regardless of values or distinct if valid and having different values)`.

@tustvold
Copy link
Contributor

Agreed, I got the boolean expression wrong, I think the corrected form I posted is correct, and would require handling the single buffer case?

@comphead
Copy link
Contributor Author

@Dandandan
Copy link
Contributor

I mean the validity/null data structure that can optionally be present in arrow arrays. This uses a bitmap to encode false/true values (not valid / valid).

You can see it is accessed as .null_buffer() from the code from @tustvold .

@alamb
Copy link
Contributor

alamb commented Dec 12, 2022

Thank you @comphead

Here is what I recommend:

  1. Add tests to this PR (or point at other pre-existing tests) that show it works correctly with NULL values
  2. Merge this PR and we can file a ticket to switch to the arrow-rs kernel if/when that is available.

@tustvold
Copy link
Contributor

tustvold commented Dec 12, 2022

Some example cases to try out might be:

  • NULL is distinct from the 0 value
  • NULL is not distinct from NULL even if the values in the corresponding null slots are distinct
  • NULL is distinct from an array containing no nulls

@comphead comphead marked this pull request as draft December 13, 2022 16:22
@comphead
Copy link
Contributor Author

Avg time now 70s

@alamb added tests for nulls
@tustvold iterating through .values() is significantly faster than accessing value_unchecked(i)
@Dandandan you mentioned null bitmap is optional, is it for some specific primitiveArray types?

@@ -515,4 +513,40 @@ mod tests {
let err = modulus_decimal_scalar(&left_decimal_array, 0).unwrap_err();
assert_eq!("Arrow error: Divide by zero error", err.to_string());
}

#[test]
fn is_distinct_from_nulls() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@tustvold
Copy link
Contributor

tustvold commented Dec 13, 2022

iterating through .values() is significantly faster than accessing value_unchecked(i)

Yeah, I've seen this before. Last time it required some hackery with inline(never) to make it do the right thing. It is frustrating playing LLVM bingo 😆

Avg time now 70s

Very cool, perhaps you could add the benchmark code so that we can catch regressions in this? I'm also curious if this benchmark contains any null values?

you mentioned null bitmap is optional, is it for some specific primitiveArray types?

Arrow arrays will only contain a null bitmap if they have a non-zero null count, otherwise it will be empty. No null buffer means all positions are valid, i.e. non-null.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks @comphead

@alamb alamb marked this pull request as ready for review December 14, 2022 10:54
@alamb
Copy link
Contributor

alamb commented Dec 14, 2022

I think this PR is ready to merge. Please let me know if there any concerns with doing so otherwise I will do so tomorrow

@comphead
Copy link
Contributor Author

Thanks @alamb Added 1 more test to cover Arrow arrays will only contain a null bitmap if they have a non-zero null count, otherwise it will be empty. No null buffer means all positions are valid, i.e. non-null.

For bench I'll create a separate PR

@alamb alamb merged commit 4542031 into apache:master Dec 14, 2022
@alamb
Copy link
Contributor

alamb commented Dec 14, 2022

Thanks @comphead

@Dandandan
Copy link
Contributor

Nice, thank you @comphead !

@ursabot
Copy link

ursabot commented Dec 14, 2022

Benchmark runs are scheduled for baseline = 40e6a67 and contender = 4542031. 4542031 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@comphead comphead deleted the opt_isdistinct branch December 14, 2022 21:50
@liukun4515
Copy link
Contributor

Do we need to move the is distinct and is not distinct to arrow-rs?
cc @alamb @alamb

@comphead
Copy link
Contributor Author

@liukun4515 thats a great question and the ticket still exists apache/arrow-rs#960
But it requires a bit more work on arrow side, planning to take care on this ticket

@alamb
Copy link
Contributor

alamb commented Dec 15, 2022

Do we need to move the is distinct and is not distinct to arrow-rs?

I agree -- we should move the kernel to arrow and that work is tracked by apache/arrow-rs#960 (which is also referenced in the code) https://github.com/apache/arrow-datafusion/pull/4560/files#diff-133624520f20c20b33c2c54ec6ae565f09b058ef317ede0e768bd23bae30941dR26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize is_distinct_from / is_not_distinct_from
6 participants