Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43573: [C++] Copy bitmap when casting from string-view to offset string and binary types #44822

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

CrystalZhou0529
Copy link
Contributor

@CrystalZhou0529 CrystalZhou0529 commented Nov 24, 2024

Rationale for this change

Use CopyBitmap to optimize performance in string casting from string-view to offset string.

What changes are included in this PR?

Originally, the way we create the bitmap is by appending one bit at a time, which is slow. Since casting should not change the values in bitmap, this feature takes advantage of CopyBitmap to create the entire bitmap at once.

Then, to create offsets and buffer array, I use TypedBufferBuilder as suggested in the original comment #43302 (comment).

Are these changes tested?

The original unit tests have passed.

Are there any user-facing changes?

No, the casting behavior should remain unchanged.

closes #43573

Copy link

⚠️ GitHub issue #43573 has been automatically assigned in GitHub to PR creator.

@CrystalZhou0529 CrystalZhou0529 marked this pull request as draft November 24, 2024 03:11
@github-actions github-actions bot added the awaiting review Awaiting review label Nov 24, 2024
@CrystalZhou0529 CrystalZhou0529 marked this pull request as ready for review November 24, 2024 03:12
cpp/src/arrow/compute/kernels/scalar_cast_string.cc Outdated Show resolved Hide resolved
if (input.offset == output->offset) {
output->buffers[0] = input.GetBuffer(0);
} else {
if (input.buffers[0].data != NULLPTR) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    // When the offsets are different (e.g., due to slice operation), we need to check if
    // the null bitmap buffer is not null before copying it. The null bitmap buffer can be
    // null if the input array value does not contain any null value.

Do we also need a comment here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you copy and paste this utility function [1] to this compilation unit and call it from here instead?

[1] https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc#L42

(Later the utility could be moved to a .h so it's callable from anywhere and inlinable. But I'm suggesting a copy because it's tricky to name this function in an informative and non-error-prone way.)

cpp/src/arrow/compute/kernels/scalar_cast_string.cc Outdated Show resolved Hide resolved
}

// Set up offset and data buffer
DataBuilder data_builder(ctx->memory_pool());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation caculate a sum_of_binary_view_sizes, and ReserveData for it. Why did here doesn't use same way to reserve data? Would blindly check append the buffer being faster?

Copy link
Contributor Author

@CrystalZhou0529 CrystalZhou0529 Nov 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this optimization in the original code. Just added it back. With this change, I am also updating the data_builder.Append call to UnsafeAppend

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 24, 2024
@pitrou pitrou requested a review from felipecrv November 27, 2024 13:29
Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Asking for some tweaks.

if (input.offset == output->offset) {
output->buffers[0] = input.GetBuffer(0);
} else {
if (input.buffers[0].data != NULLPTR) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you copy and paste this utility function [1] to this compilation unit and call it from here instead?

[1] https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc#L42

(Later the utility could be moved to a .h so it's callable from anywhere and inlinable. But I'm suggesting a copy because it's tricky to name this function in an informative and non-error-prone way.)

cpp/src/arrow/compute/kernels/scalar_cast_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_cast_string.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Dec 10, 2024
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM, thanks!

RETURN_NOT_OK(VisitArraySpanInline<I>(
batch[0].array,
[&](std::string_view s) {
// for non-null value, append string view to buffer and calculate offset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does calculate offset means?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taking it from the length of the buffer builder after the append

@felipecrv
Copy link
Contributor

@mapleFU don't you want take this to the finish line? Unless @CrystalZhou0529 is available to implement the final changes.

@CrystalZhou0529
Copy link
Contributor Author

Thanks for reviewing it! Sorry for falling behind on this PR. I will implement the final changes now.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 7, 2025
@CrystalZhou0529
Copy link
Contributor Author

Hi @felipecrv @mapleFU, I have just committed the requested changes. Please take another look and let me know if it looks reasonable!

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM!

return in_array.GetBuffer(0);
}

// If a non-zero offset, we need to shift the bitmap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipecrv Just curious can here uses Slicing Buffer if offset % 8 == 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Copy bitmap all at once when casting from string-view to offset string and binary types
3 participants