-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support row group skip for Parquet decimal #11646
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
if (max == std::numeric_limits<T>::max()) { | ||
return std::make_unique<velox::common::AlwaysFalse>(); | ||
} | ||
max += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is needed to make 'testInt128Range' return false for [0, max] so the row group can be skipped.
93b2a1e
to
98e1932
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo some comments. Thanks.
velox/dwio/common/ScanSpec.cpp
Outdated
@@ -202,6 +202,50 @@ bool testIntFilter( | |||
return true; | |||
} | |||
|
|||
template <typename T> | |||
bool testDecimalFilter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we combine this with testIntFilter above?
We can then say using testBigIntFilter = testIntFilter<int64_t>, using testHugeIntFilter = testIntFilter<int128_t>
. and use them below inside testFilter(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Combined it with testIntFilter. Since testIntFilter<int64_t>
and testIntFilter<int128_t>
are used only once, changed to call them directly.
velox/dwio/parquet/reader/Metadata.h
Outdated
@@ -18,6 +18,7 @@ | |||
|
|||
#include "velox/dwio/common/Statistics.h" | |||
#include "velox/dwio/common/compression/Compression.h" | |||
#include "velox/dwio/parquet/thrift/ParquetThriftTypes.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goal of this class is not to expose the thrift types and there-by the thrift dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This include was removed. Thanks.
velox/dwio/parquet/reader/Metadata.h
Outdated
@@ -65,6 +66,9 @@ class ColumnChunkMetaDataPtr { | |||
/// This information is optional and may be 0 if omitted. | |||
int64_t totalUncompressedSize() const; | |||
|
|||
// The physical type of this column. | |||
thrift::Type::type physicalType() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this as a class function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API was removed. Thanks.
velox/dwio/common/ScanSpec.cpp
Outdated
return testIntFilter(filter, intStats, mayHaveNull); | ||
} | ||
case TypeKind::HUGEINT: { | ||
auto* decimalStats = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check that the type is LongDecimal.
velox/dwio/common/Statistics.h
Outdated
@@ -380,6 +380,69 @@ class IntegerColumnStatistics : public virtual ColumnStatistics { | |||
std::optional<int64_t> sum_; | |||
}; | |||
|
|||
// Statistics for decimal columns. T could be int64_t or int128_t. | |||
template <typename T> | |||
class DecimalColumnStatistics : public virtual ColumnStatistics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Template IntegerColumnStatistics similar to filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Templated IntegerColumnStatistics
to support int128_t.
@@ -98,7 +98,7 @@ class DataSetBuilder { | |||
if (counter % 100 < repeats) { | |||
numbers->set(row, T(counter % repeats)); | |||
} else if (counter % 100 > 90 && row > 0) { | |||
numbers->copy(numbers, row - 1, row, 1); | |||
numbers->copy(numbers, row, row - 1, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a minor fix for the data builder: #11644, and can be removed after the relevant PR got merged. Thanks.
velox/dwio/common/Statistics.h
Outdated
@@ -380,6 +380,69 @@ class IntegerColumnStatistics : public virtual ColumnStatistics { | |||
std::optional<int64_t> sum_; | |||
}; | |||
|
|||
// Statistics for decimal columns. T could be int64_t or int128_t. | |||
template <typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If T
is int64, why not just use IntegerColumnStatistics
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. IntegerColumnStatistics
was used instead for both int64_t and int128_t. Thanks.
velox/dwio/common/ScanSpec.cpp
Outdated
return true; | ||
} | ||
|
||
if (decimalStats->getMinimum().has_value() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure the value stored in statistics having the same scale as the schema type? Could one file store values in one scale and another with a different scale? Where is this scale stored in file? We should probably rescale it to align with the scale in schema type before applying the filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In parquet file, the schema contains the information of precision and scale, and the statistics have the same scale as the schema type. In orc file, the stats for decimal column are stored as decimal string (link) so a conversion back to the bigint or hugeint of the schema scale is needed. I'm not clear on how decimal stats are stored in dwrf. Please kindly provide more information or resources if there are.
This PR focuses on parquet and to avoid affecting other file formats, added a fileFormat
parameter in testFilter
function to ensure the row group skip only takes effect for parquet. Do you think it makes sense? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is the scale in Filter
objects are the table schema type, not necessarily the same as file schema type. So we need to do a check between the two and do rescale if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point. I tested the schema evolution of decimal reader as rui-mo@eae60f5 and found the result was incorrect. I suppose we need to support this feature first and I will attempt to find the gap. Thanks.
Failed
Expected 20, got 20
20 extra rows, 20 missing rows
10 of extra rows:
0.10001
0.10002
0.10003
0.10004
0.10005
0.10006
0.10007
0.10008
0.10009
0.10010
10 of missing rows:
100.01000
100.02000
100.03000
100.04000
100.05000
100.06000
100.07000
100.08000
100.09000
100.10000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot check in code that would produce incorrect result.
63a9cff
to
0d472e4
Compare
Retrieves the minimum and maximum decimal value from Parquet column chunk
statistics based on the specified physical type and precision. Supports three
physical types to meet the requirements of Arrow: INT32, INT64, and
FIXED_LEN_BYTE_ARRAY. Supports row group skip based on the decimal statistics.