-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41016: [C++] Fix null count check in BooleanArray.true_count() #41070
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -56,7 +56,7 @@ int64_t BooleanArray::false_count() const { | |||
} | |||
|
|||
int64_t BooleanArray::true_count() const { | |||
if (data_->null_count.load() != 0) { | |||
if (data_->GetNullCount() != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the problem is that validity buffer might not exists, which might cause segment fault here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (in general, I think we should never use ArrayData.null_count directly, unless in some low level code that deals directly with that, like the implementation of GetNullCount
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will approve after #41070 (comment) fixed
By the way, should this be in 16.0?
Given it fixes a segfault, might be good to tag as 16.0 yes. |
if (data_->null_count.load() != 0) { | ||
if (data_->GetNullCount() != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use data_->MayHaveNulls()
instead. It will do the right thing without triggering the bitmap scan -- that would be extra work that doesn't even speed-up the rest of this function (except for maybe warming up the cache).
https://github.com/apache/arrow/blob/main/cpp/src/arrow/array/data.h#L287-L291
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, so it's because a boolean check always faster than count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @felipecrv!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, so it's because a boolean check always faster than count?
@mapleFU GetNullCount
might scan the entire bitmap if null_count == kUnknownNullCount
. It's O(length) while MayHaveNulls()
is a O(1) check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that I mean it doesn't need to get the exactly count here, just quick check whether it "MayHasNull"
cpp/src/arrow/array/array_test.cc
Outdated
// GH-41016 true_count() with array without validity buffer with null_count of -1 | ||
auto data = ArrayFromJSON(boolean(), "[true, false, true]")->data(); | ||
data->null_count = -1; | ||
auto arr_unknown_null_count = std::make_shared<BooleanArray>(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we're redoing this after ArrayFromJSON
already returned an array. Can't you just reuse the original array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes of course. For some reason I had in my head I needed to recreate an array object with the changed null count, but I can of course just update the null_count of the the array inplace. Updated.
@github-actions crossbow submit -g cpp |
Revision: ffb346d Submitted crossbow builds: ursacomputing/crossbow @ actions-fab48da664 |
…1070) ### Rationale for this change Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case. ### What changes are included in this PR? Use `data->MayHaveNulls()` instead of `data->null_count.load()` ### Are these changes tested? Yes * GitHub Issue: #41016 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 729dcb8. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
…() (apache#41070) ### Rationale for this change Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case. ### What changes are included in this PR? Use `data->MayHaveNulls()` instead of `data->null_count.load()` ### Are these changes tested? Yes * GitHub Issue: apache#41016 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…() (apache#41070) ### Rationale for this change Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case. ### What changes are included in this PR? Use `data->MayHaveNulls()` instead of `data->null_count.load()` ### Are these changes tested? Yes * GitHub Issue: apache#41016 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
Loading the
null_count
attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case.What changes are included in this PR?
Use
data->MayHaveNulls()
instead ofdata->null_count.load()
Are these changes tested?
Yes