-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write faster kernel for is_distinct #4560
Conversation
Hi @Dandandan This change includes optimization for 1B array processed with avg 170s before, and 80s now. Also using our own conversion to |
@tustvold do you have any additional suggestions for speeding up this kernel? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this handles nulls correctly, FWIW the next release of arrow (due today) adds a BooleanArray::from_binary
which will be significantly faster than this as it vectorises properly. In the short-term you could copy its implementation - https://github.com/apache/arrow-rs/blob/master/arrow-array/src/array/boolean_array.rs#L232
The key primitive that allows it to vectorise correctly is MutableBuffer::collect_bool
.
Edit: Correctly handling nulls is a bit of a pain, let me bash something out for you
let array_len = left.data().len().min(right.data().len()); | ||
let mut arr = vec![false; array_len].into_boxed_slice(); | ||
|
||
let left_values = left.values(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should also look at the null bitmap (if available).
It needs proper testing but the following I think is correct
I'll work on getting it added to arrow-rs |
Thanks @tustvold @Dandandan for the feedback. The same 1B perf test for the code above is 90s. Will |
Yes, I believe there is a ticket already to add it |
Thanks, please ping the arrow-rs ticket for tracking and we close this issue once adopt arrow-rs new implementation |
|
Looks correct to me and a really clever implementation! |
Ah on further look - I think it might not be correct for when the left or right side doesn't contain a null bitmap. |
@Dandandan please correct me if I'm wrong but current implementation in datafusion also lacks null bitmap? |
The iterators consult the bitmap and return
Assuming Edit: I think it might need to be |
The main thing I think is wrong is that |
Agreed, I got the boolean expression wrong, I think the corrected form I posted is correct, and would require handling the single buffer case? |
@Dandandan when you say null bitmap, do you mean this structure |
I mean the validity/null data structure that can optionally be present in arrow arrays. This uses a bitmap to encode false/true values (not valid / valid). You can see it is accessed as |
Thank you @comphead Here is what I recommend:
|
Some example cases to try out might be:
|
Avg time now 70s @alamb added tests for nulls |
@@ -515,4 +513,40 @@ mod tests { | |||
let err = modulus_decimal_scalar(&left_decimal_array, 0).unwrap_err(); | |||
assert_eq!("Arrow error: Divide by zero error", err.to_string()); | |||
} | |||
|
|||
#[test] | |||
fn is_distinct_from_nulls() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Yeah, I've seen this before. Last time it required some hackery with
Very cool, perhaps you could add the benchmark code so that we can catch regressions in this? I'm also curious if this benchmark contains any null values?
Arrow arrays will only contain a null bitmap if they have a non-zero null count, otherwise it will be empty. No null buffer means all positions are valid, i.e. non-null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks @comphead
datafusion/physical-expr/src/expressions/binary/kernels_arrow.rs
Outdated
Show resolved
Hide resolved
I think this PR is ready to merge. Please let me know if there any concerns with doing so otherwise I will do so tomorrow |
Thanks @alamb Added 1 more test to cover For bench I'll create a separate PR |
Thanks @comphead |
Nice, thank you @comphead ! |
Benchmark runs are scheduled for baseline = 40e6a67 and contender = 4542031. 4542031 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
@liukun4515 thats a great question and the ticket still exists apache/arrow-rs#960 |
I agree -- we should move the kernel to arrow and that work is tracked by apache/arrow-rs#960 (which is also referenced in the code) https://github.com/apache/arrow-datafusion/pull/4560/files#diff-133624520f20c20b33c2c54ec6ae565f09b058ef317ede0e768bd23bae30941dR26 |
Which issue does this PR close?
Closes #4482.
Rationale for this change
See #4482
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?