feat: Support row group skip for Parquet decimal #11646

rui-mo · 2024-11-25T08:40:58Z

Retrieves the minimum and maximum decimal value from Parquet column chunk
statistics based on the specified physical type and precision. Supports three
physical types to meet the requirements of Arrow: INT32, INT64, and
FIXED_LEN_BYTE_ARRAY. Supports row group skip based on the decimal statistics.

netlify · 2024-11-25T08:41:13Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`0d472e4`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/674838d2f0878d0008d8698d

rui-mo · 2024-11-25T08:49:24Z

velox/dwio/common/tests/utils/FilterGenerator.h

+    if (max == std::numeric_limits<T>::max()) {
+      return std::make_unique<velox::common::AlwaysFalse>();
+    }
+    max += 1;


This change is needed to make 'testInt128Range' return false for [0, max] so the row group can be skipped.

majetideepak

@rui-mo some comments. Thanks.

majetideepak · 2024-11-25T20:57:24Z

velox/dwio/common/ScanSpec.cpp

@@ -202,6 +202,50 @@ bool testIntFilter(
  return true;
 }

+template <typename T>
+bool testDecimalFilter(


Can we combine this with testIntFilter above?
We can then say using testBigIntFilter = testIntFilter<int64_t>, using testHugeIntFilter = testIntFilter<int128_t>. and use them below inside testFilter(

Thanks for the suggestion. Combined it with testIntFilter. Since testIntFilter<int64_t> and testIntFilter<int128_t> are used only once, changed to call them directly.

majetideepak · 2024-11-25T21:04:29Z

velox/dwio/parquet/reader/Metadata.h

@@ -18,6 +18,7 @@

 #include "velox/dwio/common/Statistics.h"
 #include "velox/dwio/common/compression/Compression.h"
+#include "velox/dwio/parquet/thrift/ParquetThriftTypes.h"


Goal of this class is not to expose the thrift types and there-by the thrift dependency.

This include was removed. Thanks.

majetideepak · 2024-11-25T21:05:06Z

velox/dwio/parquet/reader/Metadata.h

@@ -65,6 +66,9 @@ class ColumnChunkMetaDataPtr {
  /// This information is optional and may be 0 if omitted.
  int64_t totalUncompressedSize() const;

+  // The physical type of this column.
+  thrift::Type::type physicalType() const;


Do we need this as a class function?

This API was removed. Thanks.

majetideepak · 2024-11-26T18:12:46Z

velox/dwio/common/ScanSpec.cpp

+      return testIntFilter(filter, intStats, mayHaveNull);
+    }
+    case TypeKind::HUGEINT: {
+      auto* decimalStats =


Check that the type is LongDecimal.

majetideepak · 2024-11-26T18:13:34Z

velox/dwio/common/Statistics.h

@@ -380,6 +380,69 @@ class IntegerColumnStatistics : public virtual ColumnStatistics {
  std::optional<int64_t> sum_;
 };

+// Statistics for decimal columns. T could be int64_t or int128_t.
+template <typename T>
+class DecimalColumnStatistics : public virtual ColumnStatistics {


Template IntegerColumnStatistics similar to filter?

Thanks for the suggestion. Templated IntegerColumnStatistics to support int128_t.

majetideepak · 2024-11-26T18:14:06Z

velox/dwio/common/tests/utils/DataSetBuilder.h

@@ -98,7 +98,7 @@ class DataSetBuilder {
        if (counter % 100 < repeats) {
          numbers->set(row, T(counter % repeats));
        } else if (counter % 100 > 90 && row > 0) {
-          numbers->copy(numbers, row - 1, row, 1);
+          numbers->copy(numbers, row, row - 1, 1);


Why do we need this change?

This is a minor fix for the data builder: #11644, and can be removed after the relevant PR got merged. Thanks.

Yuhta · 2024-11-26T23:13:05Z

velox/dwio/common/Statistics.h

@@ -380,6 +380,69 @@ class IntegerColumnStatistics : public virtual ColumnStatistics {
  std::optional<int64_t> sum_;
 };

+// Statistics for decimal columns. T could be int64_t or int128_t.
+template <typename T>


If T is int64, why not just use IntegerColumnStatistics?

Updated. IntegerColumnStatistics was used instead for both int64_t and int128_t. Thanks.

Yuhta · 2024-11-26T23:16:25Z

velox/dwio/common/ScanSpec.cpp

+    return true;
+  }
+
+  if (decimalStats->getMinimum().has_value() &&


Are we sure the value stored in statistics having the same scale as the schema type? Could one file store values in one scale and another with a different scale? Where is this scale stored in file? We should probably rescale it to align with the scale in schema type before applying the filter.

In parquet file, the schema contains the information of precision and scale, and the statistics have the same scale as the schema type. In orc file, the stats for decimal column are stored as decimal string (link) so a conversion back to the bigint or hugeint of the schema scale is needed. I'm not clear on how decimal stats are stored in dwrf. Please kindly provide more information or resources if there are.

This PR focuses on parquet and to avoid affecting other file formats, added a fileFormat parameter in testFilter function to ensure the row group skip only takes effect for parquet. Do you think it makes sense? Thanks.

The problem is the scale in Filter objects are the table schema type, not necessarily the same as file schema type. So we need to do a check between the two and do rescale if needed.

I understand your point. I tested the schema evolution of decimal reader as rui-mo@eae60f5 and found the result was incorrect. I suppose we need to support this feature first and I will attempt to find the gap. Thanks.

Failed
Expected 20, got 20
20 extra rows, 20 missing rows
10 of extra rows:
0.10001
0.10002
0.10003
0.10004
0.10005
0.10006
0.10007
0.10008
0.10009
0.10010

10 of missing rows:
100.01000
100.02000
100.03000
100.04000
100.05000
100.06000
100.07000
100.08000
100.09000
100.10000

We cannot check in code that would produce incorrect result.

rui-mo requested a review from majetideepak as a code owner November 25, 2024 08:40

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 25, 2024

rui-mo commented Nov 25, 2024

View reviewed changes

rui-mo force-pushed the wip_row_group branch 2 times, most recently from 93b2a1e to 98e1932 Compare November 25, 2024 08:55

majetideepak reviewed Nov 26, 2024

View reviewed changes

Yuhta reviewed Nov 26, 2024

View reviewed changes

Support row group skip for Parquet decimal

0d472e4

rui-mo force-pushed the wip_row_group branch 3 times, most recently from 63a9cff to 0d472e4 Compare November 28, 2024 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support row group skip for Parquet decimal #11646

feat: Support row group skip for Parquet decimal #11646

rui-mo commented Nov 25, 2024

netlify bot commented Nov 25, 2024 •

edited

Loading

rui-mo Nov 25, 2024

majetideepak left a comment

majetideepak Nov 25, 2024

rui-mo Nov 28, 2024

majetideepak Nov 25, 2024

rui-mo Nov 28, 2024 •

edited

Loading

majetideepak Nov 25, 2024

rui-mo Nov 28, 2024

majetideepak Nov 26, 2024

majetideepak Nov 26, 2024

rui-mo Nov 28, 2024

majetideepak Nov 26, 2024

rui-mo Nov 28, 2024

Yuhta Nov 26, 2024

rui-mo Nov 28, 2024 •

edited

Loading

Yuhta Nov 26, 2024 •

edited

Loading

rui-mo Nov 28, 2024

Yuhta Nov 28, 2024

rui-mo Nov 29, 2024 •

edited

Loading

Yuhta Nov 29, 2024

feat: Support row group skip for Parquet decimal #11646

Are you sure you want to change the base?

feat: Support row group skip for Parquet decimal #11646

Conversation

rui-mo commented Nov 25, 2024

netlify bot commented Nov 25, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

Choose a reason for hiding this comment

majetideepak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rui-mo Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rui-mo Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Yuhta Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rui-mo Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Nov 25, 2024 •

edited

Loading

rui-mo Nov 28, 2024 •

edited

Loading

rui-mo Nov 28, 2024 •

edited

Loading

Yuhta Nov 26, 2024 •

edited

Loading

rui-mo Nov 29, 2024 •

edited

Loading