Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41016: [C++] Fix null count check in BooleanArray.true_count() #41070

Merged
merged 4 commits into from
Apr 15, 2024

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Apr 8, 2024

Rationale for this change

Loading the null_count attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case.

What changes are included in this PR?

Use data->MayHaveNulls() instead of data->null_count.load()

Are these changes tested?

Yes

Copy link

github-actions bot commented Apr 8, 2024

⚠️ GitHub issue #41016 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

cpp/src/arrow/array/array_test.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Apr 8, 2024
@@ -56,7 +56,7 @@ int64_t BooleanArray::false_count() const {
}

int64_t BooleanArray::true_count() const {
if (data_->null_count.load() != 0) {
if (data_->GetNullCount() != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the problem is that validity buffer might not exists, which might cause segment fault here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (in general, I think we should never use ArrayData.null_count directly, unless in some low level code that deals directly with that, like the implementation of GetNullCount)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will approve after #41070 (comment) fixed

By the way, should this be in 16.0?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge awaiting changes Awaiting changes labels Apr 9, 2024
@jorisvandenbossche
Copy link
Member Author

By the way, should this be in 16.0?

Given it fixes a segfault, might be good to tag as 16.0 yes.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 9, 2024
if (data_->null_count.load() != 0) {
if (data_->GetNullCount() != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use data_->MayHaveNulls() instead. It will do the right thing without triggering the bitmap scan -- that would be extra work that doesn't even speed-up the rest of this function (except for maybe warming up the cache).

https://github.com/apache/arrow/blob/main/cpp/src/arrow/array/data.h#L287-L291

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, so it's because a boolean check always faster than count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felipecrv!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, so it's because a boolean check always faster than count?

@mapleFU GetNullCount might scan the entire bitmap if null_count == kUnknownNullCount. It's O(length) while MayHaveNulls() is a O(1) check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that I mean it doesn't need to get the exactly count here, just quick check whether it "MayHasNull"

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 10, 2024
// GH-41016 true_count() with array without validity buffer with null_count of -1
auto data = ArrayFromJSON(boolean(), "[true, false, true]")->data();
data->null_count = -1;
auto arr_unknown_null_count = std::make_shared<BooleanArray>(data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we're redoing this after ArrayFromJSON already returned an array. Can't you just reuse the original array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes of course. For some reason I had in my head I needed to recreate an array object with the changed null count, but I can of course just update the null_count of the the array inplace. Updated.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 11, 2024
@pitrou
Copy link
Member

pitrou commented Apr 15, 2024

@github-actions crossbow submit -g cpp

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 15, 2024
Copy link

Revision: ffb346d

Submitted crossbow builds: ursacomputing/crossbow @ actions-fab48da664

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

@pitrou pitrou merged commit 729dcb8 into apache:main Apr 15, 2024
33 of 35 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Apr 15, 2024
raulcd pushed a commit that referenced this pull request Apr 15, 2024
…1070)

### Rationale for this change

Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case.

### What changes are included in this PR?

Use `data->MayHaveNulls()` instead of `data->null_count.load()`

### Are these changes tested?

Yes

* GitHub Issue: #41016

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@jorisvandenbossche jorisvandenbossche deleted the gh-41016 branch April 15, 2024 16:25
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 729dcb8.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…() (apache#41070)

### Rationale for this change

Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case.

### What changes are included in this PR?

Use `data->MayHaveNulls()` instead of `data->null_count.load()`

### Are these changes tested?

Yes

* GitHub Issue: apache#41016

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…() (apache#41070)

### Rationale for this change

Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case.

### What changes are included in this PR?

Use `data->MayHaveNulls()` instead of `data->null_count.load()`

### Are these changes tested?

Yes

* GitHub Issue: apache#41016

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants