Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43573: [C++] Copy bitmap when casting from string-view to offset string and binary types #44822

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
58 changes: 36 additions & 22 deletions cpp/src/arrow/compute/kernels/scalar_cast_string.cc
Original file line number Diff line number Diff line change
Expand Up @@ -314,7 +314,9 @@ BinaryToBinaryCastExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* ou
template <typename O, typename I>
enable_if_t<is_binary_view_like_type<I>::value && is_base_binary_type<O>::value, Status>
BinaryToBinaryCastExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) {
using OutputBuilderType = typename TypeTraits<O>::BuilderType;
using offset_type = typename O::offset_type;
using DataBuilder = TypedBufferBuilder<uint8_t>;
using OffsetBuilder = TypedBufferBuilder<offset_type>;
const CastOptions& options = checked_cast<const CastState&>(*ctx->state()).options;
const ArraySpan& input = batch[0].array;

Expand All @@ -327,31 +329,43 @@ BinaryToBinaryCastExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* ou
}
}

const int64_t sum_of_binary_view_sizes = util::SumOfBinaryViewSizes(
input.GetValues<BinaryViewType::c_type>(1), input.length);

// TODO(GH-43573): A more efficient implementation that copies the validity
// bitmap all at once is possible, but would mean we don't delegate all the
// building logic to the ArrayBuilder implementation for the output type.
OutputBuilderType builder(options.to_type.GetSharedPtr(), ctx->memory_pool());
RETURN_NOT_OK(builder.Resize(input.length));
RETURN_NOT_OK(builder.ReserveData(sum_of_binary_view_sizes));
arrow::internal::ArraySpanInlineVisitor<I> visitor;
RETURN_NOT_OK(visitor.VisitStatus(
input,
[&](std::string_view v) {
// Append valid string view
return builder.Append(v);
ArrayData* output = out->array_data().get();
output->length = input.length;
output->SetNullCount(input.null_count);

// Set up bitmap
CrystalZhou0529 marked this conversation as resolved.
Show resolved Hide resolved
if (input.offset == output->offset) {
output->buffers[0] = input.GetBuffer(0);
} else {
if (input.buffers[0].data != NULLPTR) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    // When the offsets are different (e.g., due to slice operation), we need to check if
    // the null bitmap buffer is not null before copying it. The null bitmap buffer can be
    // null if the input array value does not contain any null value.

Do we also need a comment here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you copy and paste this utility function [1] to this compilation unit and call it from here instead?

[1] https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc#L42

(Later the utility could be moved to a .h so it's callable from anywhere and inlinable. But I'm suggesting a copy because it's tricky to name this function in an informative and non-error-prone way.)

ARROW_ASSIGN_OR_RAISE(
output->buffers[0],
arrow::internal::CopyBitmap(ctx->memory_pool(), input.buffers[0].data,
input.offset, input.length));
}
}

// Set up offset and data buffer
DataBuilder data_builder(ctx->memory_pool());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation caculate a sum_of_binary_view_sizes, and ReserveData for it. Why did here doesn't use same way to reserve data? Would blindly check append the buffer being faster?

Copy link
Contributor Author

@CrystalZhou0529 CrystalZhou0529 Nov 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this optimization in the original code. Just added it back. With this change, I am also updating the data_builder.Append call to UnsafeAppend

OffsetBuilder offset_builder(ctx->memory_pool());
RETURN_NOT_OK(offset_builder.Reserve(batch.length + 1));
CrystalZhou0529 marked this conversation as resolved.
Show resolved Hide resolved
offset_builder.UnsafeAppend(0); // offsets start at 0
RETURN_NOT_OK(VisitArraySpanInline<I>(
batch[0].array,
CrystalZhou0529 marked this conversation as resolved.
Show resolved Hide resolved
[&](std::string_view s) {
// for non-null value, append string view to buffer and calculate offset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does calculate offset means?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taking it from the length of the buffer builder after the append

ARROW_RETURN_NOT_OK(data_builder.Append(
reinterpret_cast<const uint8_t*>(s.data()), static_cast<int64_t>(s.size())));
offset_builder.UnsafeAppend(static_cast<offset_type>(data_builder.length()));
return Status::OK();
},
[&]() {
// Append null
builder.UnsafeAppendNull();
// for null value, no need to update buffer
CrystalZhou0529 marked this conversation as resolved.
Show resolved Hide resolved
offset_builder.UnsafeAppend(static_cast<offset_type>(data_builder.length()));
return Status::OK();
}));

std::shared_ptr<ArrayData> output_array;
RETURN_NOT_OK(builder.FinishInternal(&output_array));
out->value = std::move(output_array);
RETURN_NOT_OK(data_builder.Finish(&output->buffers[2]));
RETURN_NOT_OK(offset_builder.Finish(&output->buffers[1]));
return Status::OK();
}

Expand Down
Loading