Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39332: [C++] Explicit error in ExecBatchBuilder when appending var length data exceeds offset limit (int32 max) #39383

Merged
merged 8 commits into from
Jan 18, 2024

Conversation

zanmato1984
Copy link
Contributor

@zanmato1984 zanmato1984 commented Dec 28, 2023

Rationale for this change

When appending var length data in ExecBatchBuilder, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: #39332 (comment).

The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens.

What changes are included in this PR?

  1. Detect the offset overflow in appending data in ExecBatchBuilder and explicitly throw.
  2. Change the offset type from uint32_t to int32_t in ExecBatchBuilder and respects the BinaryBuilder::memory_limit() which is 2GB - 2B as the rest part of the codebase.

Are these changes tested?

UT included.

Are there any user-facing changes?

No.

Copy link

⚠️ GitHub issue #39332 has been automatically assigned in GitHub to PR creator.

@zanmato1984 zanmato1984 changed the title GH-39332: [C++] Give explicit error in ExecBatchBuilder when appending data exceeds offset limit (int32 max) GH-39332: [C++] Explicit error in ExecBatchBuilder when appending data exceeds offset limit (int32 max) Dec 28, 2023
@zanmato1984 zanmato1984 changed the title GH-39332: [C++] Explicit error in ExecBatchBuilder when appending data exceeds offset limit (int32 max) GH-39332: [C++] Explicit error in ExecBatchBuilder when appending var length data exceeds offset limit (int32 max) Jan 15, 2024
@zanmato1984 zanmato1984 marked this pull request as ready for review January 15, 2024 08:50
@pitrou
Copy link
Member

pitrou commented Jan 17, 2024

Thanks for the PR @zanmato1984 . I would ask you to rebase this once #39585 is merged.

@zanmato1984
Copy link
Contributor Author

Thanks for the PR @zanmato1984 . I would ask you to rebase this once #39585 is merged.

Thanks for looking. Rebase done.

int64_t sum = num_rows_before == 0 ? 0 : offsets[num_rows_before];
Visit(source, num_rows_to_append, row_ids,
[&](int i, const uint8_t* ptr, uint32_t num_bytes) {
offsets[num_rows_before + i] = num_bytes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a possibility for num_bytes to be greater than std::numeric_limits<int32_t>::max()?

Perhaps we should change Visit to pass a int32_t num_bytes instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If source is a well-constructed array, then num_bytes shouldn't be greater than int32 max.

But yeah, changing num_bytes to int32_t will be more consistent with other part. I'll change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 586 to 587
for (int i = 0; i < num_rows_to_append; ++i) {
int32_t length = offsets[num_rows_before + i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I find it amusing that we're looping over the offsets a second time here. Perhaps Visit for variable-length types isn't useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I don't quite follow. Could you elaborate more on this? Thanks.

Comment on lines 588 to 589
offsets[num_rows_before + i] = static_cast<int32_t>(sum);
if (ARROW_PREDICT_FALSE(sum + length > BinaryBuilder::memory_limit())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, but an alternative would be to use AddWithOverflow instead of a 64-bit addition. Both will probably have the same performance, of course.

Copy link
Contributor Author

@zanmato1984 zanmato1984 Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know there is such handy function!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -19,7 +19,9 @@

#include <type_traits>

#include "arrow/array/builder_binary.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit overkill to include this just for the binary size limit. You can instead use std::numeric_limits<int32_t>::max().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there is a pitfall about BaseBinaryBuilder::memory_limit() - it's std::numeric_limits<int32_t>::max() - 1. I only noticed this til my initial test failed. So I thought it might be necessary to keep using BaseBinaryBuilder::memory_limit() to avoid the potential mismatch. Does that make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it might be an artifact of how BinaryBuilder is implemented, but it is not being used here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I got the point. They are indeed not necessarily related. I'll update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


auto num_rows = std::numeric_limits<int32_t>::max() / eight_mb;
StringBuilder values_ref_builder;
ASSERT_OK(values_ref_builder.Reserve(num_rows + 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also call ReserveData to avoid cascading resizes to grow the data buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! Thanks for pointing this out. I'll change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -407,6 +408,63 @@ TEST(ExecBatchBuilder, AppendValuesBeyondLimit) {
ASSERT_EQ(0, pool->bytes_allocated());
}

TEST(ExecBatchBuilder, AppendVarLengthBeyondLimit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment referring to the GH issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::string str_8mb(eight_mb, 'a');
std::string str_8mb_minus_1(eight_mb - 1, 'b');
std::string str_8mb_minus_2(eight_mb - 2,
'b'); // BaseBinaryBuilder::memory_limit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this string has to do with BaseBinaryBuilder::memory_limit()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to use this string to be the last element of an array that occupies exactly memory_limit() bytes. Unfortunately it is not a good comment. My bad. I'll change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ExecBatchBuilder builder;
ASSERT_OK(builder.AppendSelected(pool, batch_8mb, num_rows, first_set_row_ids.data(),
/*num_cols=*/1));
ASSERT_RAISES(Invalid, builder.AppendSelected(pool, batch_8mb_minus_1, 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could perhaps use ASSERT_RAISES_WITH_MESSAGE to make sure that the error message is as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! I'll change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ASSERT_OK(values_ref_builder.Append(str_8mb_minus_2));
ASSERT_OK_AND_ASSIGN(auto values_ref, values_ref_builder.Finish());
auto values_ref_1 = values_ref->Slice(0, num_rows);
ExecBatch batch_ref_1({values_ref_1}, num_rows);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one isn't used, is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad. Thanks for pointing this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 17, 2024
@zanmato1984
Copy link
Contributor Author

Thank you @pitrou so much for the comprehensive review comments! Learned a lot. I will make the change later.

@@ -480,7 +480,7 @@ void ExecBatchBuilder::Visit(const std::shared_ptr<ArrayData>& column, int num_r
const uint8_t* field_ptr =
column->buffers[1]->data() +
(column->offset + row_id) * static_cast<int64_t>(metadata.fixed_length);
process_value_fn(i, field_ptr, metadata.fixed_length);
process_value_fn(i, field_ptr, static_cast<int32_t>(metadata.fixed_length));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to change the type of metadata.fixed_length to int32_t, but that would bring big amount related changes overwhelming to this small PR. So I tend to leave it as is and do a simple cast here. Changing fixed_length of RowTableMetadata and KeyColumnMetaData to signed type could be a future enhancement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing fixed_length of RowTableMetadata and KeyColumnMetaData to signed type could be a future enhancement.

Cool, thank you!

@pitrou
Copy link
Member

pitrou commented Jan 18, 2024

A couple compiler errors appeared on CI, can you take a look?

@zanmato1984
Copy link
Contributor Author

A couple compiler errors appeared on CI, can you take a look?

A namespace resolving issue. Fixed.

@pitrou
Copy link
Member

pitrou commented Jan 18, 2024

@github-actions crossbow submit -g cpp

This comment was marked as outdated.

@zanmato1984
Copy link
Contributor Author

I saw the test failed by OOM in this run because the test would use about 2~3GB RAM. Will this be an issue?

@pitrou
Copy link
Member

pitrou commented Jan 18, 2024

It's not an OOM, it's a 32-bit build that cannot go past 2GB of memory per process.

@pitrou
Copy link
Member

pitrou commented Jan 18, 2024

@github-actions crossbow submit -g cpp

@zanmato1984
Copy link
Contributor Author

It's not an OOM, it's a 32-bit build that cannot go past 2GB of memory per process.

Got it. And thanks for the modification to the test code.

Copy link

Revision: 34d2d41

Submitted crossbow builds: ursacomputing/crossbow @ actions-9312816a35

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-fedora-38-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions

@pitrou
Copy link
Member

pitrou commented Jan 18, 2024

Ok, the remaining CI failure is unrelated, so I'll merge. Thanks again for this @zanmato1984 !

@pitrou pitrou merged commit a2aa1c4 into apache:main Jan 18, 2024
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jan 18, 2024
@pitrou
Copy link
Member

pitrou commented Jan 18, 2024

@raulcd This is another candidate for 15.0.0 if you cut another RC.

Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit a2aa1c4.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…ng var length data exceeds offset limit (int32 max) (apache#39383)

### Rationale for this change

When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment).

The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens.

### What changes are included in this PR?

1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw.
2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

No.

* Closes: apache#39332

Lead-authored-by: zanmato <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Rossi(Ruoxi) Sun <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…ng var length data exceeds offset limit (int32 max) (apache#39383)

### Rationale for this change

When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment).

The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens.

### What changes are included in this PR?

1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw.
2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

No.

* Closes: apache#39332

Lead-authored-by: zanmato <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Rossi(Ruoxi) Sun <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
raulcd pushed a commit that referenced this pull request Feb 20, 2024
… length data exceeds offset limit (int32 max) (#39383)

### Rationale for this change

When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: #39332 (comment).

The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens.

### What changes are included in this PR?

1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw.
2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

No.

* Closes: #39332

Lead-authored-by: zanmato <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Rossi(Ruoxi) Sun <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
zanmato1984 added a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…ng var length data exceeds offset limit (int32 max) (apache#39383)

### Rationale for this change

When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment).

The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens.

### What changes are included in this PR?

1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw.
2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

No.

* Closes: apache#39332

Lead-authored-by: zanmato <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Rossi(Ruoxi) Sun <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…ng var length data exceeds offset limit (int32 max) (apache#39383)

### Rationale for this change

When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment).

The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens.

### What changes are included in this PR?

1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw.
2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

No.

* Closes: apache#39332

Lead-authored-by: zanmato <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Rossi(Ruoxi) Sun <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SIGSEGV during join on large table
2 participants