Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flesh out NullBuffer abstraction (#3880) #3885

Merged
merged 3 commits into from
Mar 21, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Part of #3880

Rationale for this change

This continues the process of moving to more expressive buffer abstractions

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 17, 2023
use arrow_schema::ArrowError;
use std::sync::Arc;

#[inline]
unsafe fn build_primitive_array<O: ArrowPrimitiveType>(
len: usize,
buffer: Buffer,
null_count: usize,
null_buffer: Option<Buffer>,
nulls: Option<NullBuffer>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The astute will notice this is precisely the PrimitiveArray::new_unchecked that #3880 proposes to add 🎉

vec![result.finish()],
vec![],
)
ArrayDataBuilder::new(DataType::Boolean)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A future BooleanArray::new_unchecked should clean this up 😄

@@ -23,7 +23,7 @@ pub use data::*;
mod equal;
pub mod transform;

pub mod bit_iterator;
pub use arrow_buffer::bit_iterator;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved to arrow_buffer

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 17, 2023
.as_ref()
.map(|x| len - x.count_set_bits_offset(0, len))
.unwrap_or_default();
let nulls = NullBuffer::union(a.nulls(), b.nulls());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only is this less code, but it can now preserve the source null buffers even if they have an offset

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- the code looks much nicer to me

Please run benchmarks prior to merging this -- while it shouldn't make any difference I can imagine silly things like cross crate inlining or whatever might cause unexpected performance slowdown

cc @viirya

@@ -391,17 +372,12 @@ where
if a.null_count() == 0 && b.null_count() == 0 {
try_binary_no_nulls_mut(len, a, b, op)
} else {
let null_buffer = combine_option_bitmap(&[a.data(), b.data()], len);
let null_count = null_buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is nice that the null count now is encapsulated in NullBuffer rather than needing to be calculated explicitly (and potentially inconsistently)

.null_count(null_count);

let array_data = unsafe { array_builder.build_unchecked() };
let array_builder = builder.finish().into_data().into_builder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we need to make a new builder (rather than just update the one we had) -- though I see this is what the previous code did too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrimitiveBuilder vs ArrayBuilder, this should be greatly simplified with the ongoing work in #3880

@@ -145,3 +146,41 @@ impl BooleanBuffer {
self.buffer
}
}

impl BitAnd<&BooleanBuffer> for &BooleanBuffer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice 👍 -- it will be really nice to apply such bit operations on BooleanBuffer / NullBuffers directly and not via some function

fn bitand(self, rhs: &BooleanBuffer) -> Self::Output {
assert_eq!(self.len, rhs.len);
BooleanBuffer {
buffer: buffer_bin_and(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed above that buffers are is combined using combine_option_bitmap which is different than this.

However, when I double checked that combine_option_bitmap calls eventually to buffer_bin_and

pub fn combine_option_bitmap(
arrays: &[&ArrayData],
len_in_bits: usize,
) -> Option<Buffer> {
let (buffer, offset) = arrays
.iter()
.map(|array| match array.nulls() {
Some(n) => (Some(n.buffer().clone()), n.offset()),
None => (None, 0),
})
.reduce(|acc, buffer_and_offset| match (acc, buffer_and_offset) {
((None, _), (None, _)) => (None, 0),
((Some(buffer), offset), (None, _)) | ((None, _), (Some(buffer), offset)) => {
(Some(buffer), offset)
}
((Some(buffer_left), offset_left), (Some(buffer_right), offset_right)) => (
Some(buffer_bin_and(
&buffer_left,
offset_left,
&buffer_right,
offset_right,
len_in_bits,
)),
0,
),
})?;
Some(buffer?.bit_slice(offset, len_in_bits))

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. It looks much simple and clear.

@@ -50,6 +51,18 @@ impl NullBuffer {
Self { buffer, null_count }
}

/// Computes the union of two optional [`NullBuffer`]
pub fn union(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bit and, but I think union sounds like bit or? Maybe bitand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its union of the nulls, its confusing because a 0 bit is a null... I'll add further docs

@tustvold
Copy link
Contributor Author

I've confirmed this to have no major impact on benchmarks

@tustvold tustvold merged commit 67fe807 into apache:master Mar 21, 2023
spebern pushed a commit to spebern/arrow-rs that referenced this pull request Mar 25, 2023
* Flesh out NullBuffer abstraction (apache#3880)

* Review feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants