Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support row group skip for Parquet decimal #11646

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rui-mo
Copy link
Collaborator

@rui-mo rui-mo commented Nov 25, 2024

Retrieves the minimum and maximum decimal value from Parquet column chunk
statistics based on the specified physical type and precision. Supports three
physical types to meet the requirements of Arrow: INT32, INT64, and
FIXED_LEN_BYTE_ARRAY. Supports row group skip based on the decimal statistics.

@rui-mo rui-mo requested a review from majetideepak as a code owner November 25, 2024 08:40
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 25, 2024
Copy link

netlify bot commented Nov 25, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 0d472e4
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/674838d2f0878d0008d8698d

if (max == std::numeric_limits<T>::max()) {
return std::make_unique<velox::common::AlwaysFalse>();
}
max += 1;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is needed to make 'testInt128Range' return false for [0, max] so the row group can be skipped.

@rui-mo rui-mo force-pushed the wip_row_group branch 2 times, most recently from 93b2a1e to 98e1932 Compare November 25, 2024 08:55
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo some comments. Thanks.

@@ -202,6 +202,50 @@ bool testIntFilter(
return true;
}

template <typename T>
bool testDecimalFilter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine this with testIntFilter above?
We can then say using testBigIntFilter = testIntFilter<int64_t>, using testHugeIntFilter = testIntFilter<int128_t>. and use them below inside testFilter(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Combined it with testIntFilter. Since testIntFilter<int64_t> and testIntFilter<int128_t> are used only once, changed to call them directly.

@@ -18,6 +18,7 @@

#include "velox/dwio/common/Statistics.h"
#include "velox/dwio/common/compression/Compression.h"
#include "velox/dwio/parquet/thrift/ParquetThriftTypes.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goal of this class is not to expose the thrift types and there-by the thrift dependency.

Copy link
Collaborator Author

@rui-mo rui-mo Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This include was removed. Thanks.

@@ -65,6 +66,9 @@ class ColumnChunkMetaDataPtr {
/// This information is optional and may be 0 if omitted.
int64_t totalUncompressedSize() const;

// The physical type of this column.
thrift::Type::type physicalType() const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this as a class function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API was removed. Thanks.

return testIntFilter(filter, intStats, mayHaveNull);
}
case TypeKind::HUGEINT: {
auto* decimalStats =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check that the type is LongDecimal.

@@ -380,6 +380,69 @@ class IntegerColumnStatistics : public virtual ColumnStatistics {
std::optional<int64_t> sum_;
};

// Statistics for decimal columns. T could be int64_t or int128_t.
template <typename T>
class DecimalColumnStatistics : public virtual ColumnStatistics {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Template IntegerColumnStatistics similar to filter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Templated IntegerColumnStatistics to support int128_t.

@@ -98,7 +98,7 @@ class DataSetBuilder {
if (counter % 100 < repeats) {
numbers->set(row, T(counter % repeats));
} else if (counter % 100 > 90 && row > 0) {
numbers->copy(numbers, row - 1, row, 1);
numbers->copy(numbers, row, row - 1, 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor fix for the data builder: #11644, and can be removed after the relevant PR got merged. Thanks.

@@ -380,6 +380,69 @@ class IntegerColumnStatistics : public virtual ColumnStatistics {
std::optional<int64_t> sum_;
};

// Statistics for decimal columns. T could be int64_t or int128_t.
template <typename T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If T is int64, why not just use IntegerColumnStatistics?

Copy link
Collaborator Author

@rui-mo rui-mo Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. IntegerColumnStatistics was used instead for both int64_t and int128_t. Thanks.

return true;
}

if (decimalStats->getMinimum().has_value() &&
Copy link
Contributor

@Yuhta Yuhta Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure the value stored in statistics having the same scale as the schema type? Could one file store values in one scale and another with a different scale? Where is this scale stored in file? We should probably rescale it to align with the scale in schema type before applying the filter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In parquet file, the schema contains the information of precision and scale, and the statistics have the same scale as the schema type. In orc file, the stats for decimal column are stored as decimal string (link) so a conversion back to the bigint or hugeint of the schema scale is needed. I'm not clear on how decimal stats are stored in dwrf. Please kindly provide more information or resources if there are.

This PR focuses on parquet and to avoid affecting other file formats, added a fileFormat parameter in testFilter function to ensure the row group skip only takes effect for parquet. Do you think it makes sense? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the scale in Filter objects are the table schema type, not necessarily the same as file schema type. So we need to do a check between the two and do rescale if needed.

Copy link
Collaborator Author

@rui-mo rui-mo Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point. I tested the schema evolution of decimal reader as rui-mo@eae60f5 and found the result was incorrect. I suppose we need to support this feature first and I will attempt to find the gap. Thanks.

Failed
Expected 20, got 20
20 extra rows, 20 missing rows
10 of extra rows:
0.10001
0.10002
0.10003
0.10004
0.10005
0.10006
0.10007
0.10008
0.10009
0.10010

10 of missing rows:
100.01000
100.02000
100.03000
100.04000
100.05000
100.06000
100.07000
100.08000
100.09000
100.10000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot check in code that would produce incorrect result.

@rui-mo rui-mo force-pushed the wip_row_group branch 3 times, most recently from 63a9cff to 0d472e4 Compare November 28, 2024 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants