-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flesh out NullBuffer abstraction (#3880) #3885
Conversation
use arrow_schema::ArrowError; | ||
use std::sync::Arc; | ||
|
||
#[inline] | ||
unsafe fn build_primitive_array<O: ArrowPrimitiveType>( | ||
len: usize, | ||
buffer: Buffer, | ||
null_count: usize, | ||
null_buffer: Option<Buffer>, | ||
nulls: Option<NullBuffer>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The astute will notice this is precisely the PrimitiveArray::new_unchecked
that #3880 proposes to add 🎉
vec![result.finish()], | ||
vec![], | ||
) | ||
ArrayDataBuilder::new(DataType::Boolean) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A future BooleanArray::new_unchecked
should clean this up 😄
@@ -23,7 +23,7 @@ pub use data::*; | |||
mod equal; | |||
pub mod transform; | |||
|
|||
pub mod bit_iterator; | |||
pub use arrow_buffer::bit_iterator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved to arrow_buffer
.as_ref() | ||
.map(|x| len - x.count_set_bits_offset(0, len)) | ||
.unwrap_or_default(); | ||
let nulls = NullBuffer::union(a.nulls(), b.nulls()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only is this less code, but it can now preserve the source null buffers even if they have an offset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me -- the code looks much nicer to me
Please run benchmarks prior to merging this -- while it shouldn't make any difference I can imagine silly things like cross crate inlining or whatever might cause unexpected performance slowdown
cc @viirya
@@ -391,17 +372,12 @@ where | |||
if a.null_count() == 0 && b.null_count() == 0 { | |||
try_binary_no_nulls_mut(len, a, b, op) | |||
} else { | |||
let null_buffer = combine_option_bitmap(&[a.data(), b.data()], len); | |||
let null_count = null_buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is nice that the null count now is encapsulated in NullBuffer
rather than needing to be calculated explicitly (and potentially inconsistently)
.null_count(null_count); | ||
|
||
let array_data = unsafe { array_builder.build_unchecked() }; | ||
let array_builder = builder.finish().into_data().into_builder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why we need to make a new builder (rather than just update the one we had) -- though I see this is what the previous code did too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PrimitiveBuilder vs ArrayBuilder, this should be greatly simplified with the ongoing work in #3880
@@ -145,3 +146,41 @@ impl BooleanBuffer { | |||
self.buffer | |||
} | |||
} | |||
|
|||
impl BitAnd<&BooleanBuffer> for &BooleanBuffer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very nice 👍 -- it will be really nice to apply such bit operations on BooleanBuffer / NullBuffers directly and not via some function
fn bitand(self, rhs: &BooleanBuffer) -> Self::Output { | ||
assert_eq!(self.len, rhs.len); | ||
BooleanBuffer { | ||
buffer: buffer_bin_and( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed above that buffers are is combined using combine_option_bitmap
which is different than this.
However, when I double checked that combine_option_bitmap
calls eventually to buffer_bin_and
arrow-rs/arrow-data/src/bit_mask.rs
Lines 71 to 98 in e41e0ca
pub fn combine_option_bitmap( | |
arrays: &[&ArrayData], | |
len_in_bits: usize, | |
) -> Option<Buffer> { | |
let (buffer, offset) = arrays | |
.iter() | |
.map(|array| match array.nulls() { | |
Some(n) => (Some(n.buffer().clone()), n.offset()), | |
None => (None, 0), | |
}) | |
.reduce(|acc, buffer_and_offset| match (acc, buffer_and_offset) { | |
((None, _), (None, _)) => (None, 0), | |
((Some(buffer), offset), (None, _)) | ((None, _), (Some(buffer), offset)) => { | |
(Some(buffer), offset) | |
} | |
((Some(buffer_left), offset_left), (Some(buffer_right), offset_right)) => ( | |
Some(buffer_bin_and( | |
&buffer_left, | |
offset_left, | |
&buffer_right, | |
offset_right, | |
len_in_bits, | |
)), | |
0, | |
), | |
})?; | |
Some(buffer?.bit_slice(offset, len_in_bits)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. It looks much simple and clear.
@@ -50,6 +51,18 @@ impl NullBuffer { | |||
Self { buffer, null_count } | |||
} | |||
|
|||
/// Computes the union of two optional [`NullBuffer`] | |||
pub fn union( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is bit and, but I think union
sounds like bit or? Maybe bitand
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its union of the nulls, its confusing because a 0 bit is a null... I'll add further docs
I've confirmed this to have no major impact on benchmarks |
* Flesh out NullBuffer abstraction (apache#3880) * Review feedback
Which issue does this PR close?
Part of #3880
Rationale for this change
This continues the process of moving to more expressive buffer abstractions
What changes are included in this PR?
Are there any user-facing changes?