-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39332: [C++] Explicit error in ExecBatchBuilder when appending var length data exceeds offset limit (int32 max) #39383
Conversation
|
Thanks for the PR @zanmato1984 . I would ask you to rebase this once #39585 is merged. |
4acb497
to
e972f3e
Compare
Thanks for looking. Rebase done. |
cpp/src/arrow/compute/light_array.cc
Outdated
int64_t sum = num_rows_before == 0 ? 0 : offsets[num_rows_before]; | ||
Visit(source, num_rows_to_append, row_ids, | ||
[&](int i, const uint8_t* ptr, uint32_t num_bytes) { | ||
offsets[num_rows_before + i] = num_bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a possibility for num_bytes
to be greater than std::numeric_limits<int32_t>::max()
?
Perhaps we should change Visit
to pass a int32_t num_bytes
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If source
is a well-constructed array, then num_bytes
shouldn't be greater than int32 max.
But yeah, changing num_bytes
to int32_t
will be more consistent with other part. I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/arrow/compute/light_array.cc
Outdated
for (int i = 0; i < num_rows_to_append; ++i) { | ||
int32_t length = offsets[num_rows_before + i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I find it amusing that we're looping over the offsets a second time here. Perhaps Visit
for variable-length types isn't useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry I don't quite follow. Could you elaborate more on this? Thanks.
cpp/src/arrow/compute/light_array.cc
Outdated
offsets[num_rows_before + i] = static_cast<int32_t>(sum); | ||
if (ARROW_PREDICT_FALSE(sum + length > BinaryBuilder::memory_limit())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, but an alternative would be to use AddWithOverflow
instead of a 64-bit addition. Both will probably have the same performance, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know there is such handy function!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/arrow/compute/light_array.cc
Outdated
@@ -19,7 +19,9 @@ | |||
|
|||
#include <type_traits> | |||
|
|||
#include "arrow/array/builder_binary.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit overkill to include this just for the binary size limit. You can instead use std::numeric_limits<int32_t>::max()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there is a pitfall about BaseBinaryBuilder::memory_limit()
- it's std::numeric_limits<int32_t>::max() - 1
. I only noticed this til my initial test failed. So I thought it might be necessary to keep using BaseBinaryBuilder::memory_limit()
to avoid the potential mismatch. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it might be an artifact of how BinaryBuilder
is implemented, but it is not being used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I got the point. They are indeed not necessarily related. I'll update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
auto num_rows = std::numeric_limits<int32_t>::max() / eight_mb; | ||
StringBuilder values_ref_builder; | ||
ASSERT_OK(values_ref_builder.Reserve(num_rows + 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also call ReserveData
to avoid cascading resizes to grow the data buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know! Thanks for pointing this out. I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -407,6 +408,63 @@ TEST(ExecBatchBuilder, AppendValuesBeyondLimit) { | |||
ASSERT_EQ(0, pool->bytes_allocated()); | |||
} | |||
|
|||
TEST(ExecBatchBuilder, AppendVarLengthBeyondLimit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment referring to the GH issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
std::string str_8mb(eight_mb, 'a'); | ||
std::string str_8mb_minus_1(eight_mb - 1, 'b'); | ||
std::string str_8mb_minus_2(eight_mb - 2, | ||
'b'); // BaseBinaryBuilder::memory_limit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this string has to do with BaseBinaryBuilder::memory_limit()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to use this string to be the last element of an array that occupies exactly memory_limit()
bytes. Unfortunately it is not a good comment. My bad. I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ExecBatchBuilder builder; | ||
ASSERT_OK(builder.AppendSelected(pool, batch_8mb, num_rows, first_set_row_ids.data(), | ||
/*num_cols=*/1)); | ||
ASSERT_RAISES(Invalid, builder.AppendSelected(pool, batch_8mb_minus_1, 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could perhaps use ASSERT_RAISES_WITH_MESSAGE
to make sure that the error message is as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know! I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ASSERT_OK(values_ref_builder.Append(str_8mb_minus_2)); | ||
ASSERT_OK_AND_ASSIGN(auto values_ref, values_ref_builder.Finish()); | ||
auto values_ref_1 = values_ref->Slice(0, num_rows); | ||
ExecBatch batch_ref_1({values_ref_1}, num_rows); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one isn't used, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my bad. Thanks for pointing this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thank you @pitrou so much for the comprehensive review comments! Learned a lot. I will make the change later. |
@@ -480,7 +480,7 @@ void ExecBatchBuilder::Visit(const std::shared_ptr<ArrayData>& column, int num_r | |||
const uint8_t* field_ptr = | |||
column->buffers[1]->data() + | |||
(column->offset + row_id) * static_cast<int64_t>(metadata.fixed_length); | |||
process_value_fn(i, field_ptr, metadata.fixed_length); | |||
process_value_fn(i, field_ptr, static_cast<int32_t>(metadata.fixed_length)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to change the type of metadata.fixed_length
to int32_t
, but that would bring big amount related changes overwhelming to this small PR. So I tend to leave it as is and do a simple cast here. Changing fixed_length
of RowTableMetadata
and KeyColumnMetaData
to signed type could be a future enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing
fixed_length
ofRowTableMetadata
andKeyColumnMetaData
to signed type could be a future enhancement.
Cool, thank you!
A couple compiler errors appeared on CI, can you take a look? |
A namespace resolving issue. Fixed. |
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
…ffset limit (int32 max)
Co-authored-by: Antoine Pitrou <[email protected]>
I saw the test failed by OOM in this run because the test would use about 2~3GB RAM. Will this be an issue? |
It's not an OOM, it's a 32-bit build that cannot go past 2GB of memory per process. |
@github-actions crossbow submit -g cpp |
Got it. And thanks for the modification to the test code. |
Revision: 34d2d41 Submitted crossbow builds: ursacomputing/crossbow @ actions-9312816a35 |
Ok, the remaining CI failure is unrelated, so I'll merge. Thanks again for this @zanmato1984 ! |
@raulcd This is another candidate for 15.0.0 if you cut another RC. |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit a2aa1c4. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
…ng var length data exceeds offset limit (int32 max) (apache#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: apache#39332 Lead-authored-by: zanmato <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Rossi(Ruoxi) Sun <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…ng var length data exceeds offset limit (int32 max) (apache#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: apache#39332 Lead-authored-by: zanmato <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Rossi(Ruoxi) Sun <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
… length data exceeds offset limit (int32 max) (#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: #39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: #39332 Lead-authored-by: zanmato <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Rossi(Ruoxi) Sun <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…ng var length data exceeds offset limit (int32 max) (apache#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: apache#39332 Lead-authored-by: zanmato <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Rossi(Ruoxi) Sun <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…ng var length data exceeds offset limit (int32 max) (apache#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: apache#39332 Lead-authored-by: zanmato <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Rossi(Ruoxi) Sun <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
When appending var length data in
ExecBatchBuilder
, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: #39332 (comment).The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens.
What changes are included in this PR?
ExecBatchBuilder
and explicitly throw.uint32_t
toint32_t
inExecBatchBuilder
and respects theBinaryBuilder::memory_limit()
which is2GB - 2B
as the rest part of the codebase.Are these changes tested?
UT included.
Are there any user-facing changes?
No.