feat(parquet): Add config for datapage version #11151

svm1 · 2024-10-02T20:01:55Z

Add config and session properties hive.parquet.writer.datapage-version, hive.parquet.writer.datapage_version, to determine the parquet writer datapage version (V1 or V2). Defaults to V1.

netlify · 2024-10-02T20:02:12Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`485b404`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/67a4a15c4ca27d0008963292

velox/dwio/parquet/writer/Writer.h

velox/dwio/parquet/writer/Writer.cpp

velox/dwio/parquet/writer/Writer.h

velox/connectors/hive/HiveConnectorUtil.cpp

velox/dwio/parquet/writer/Writer.h

velox/connectors/hive/HiveConnectorUtil.cpp

svm1 · 2024-11-26T09:48:09Z

@yingsu00 @majetideepak Thank you for reviewing - made all necessary changes, please take a look!

velox/connectors/hive/HiveConnectorUtil.cpp

velox/dwio/parquet/tests/writer/ParquetWriterTest.cpp

velox/dwio/parquet/writer/Writer.h

velox/dwio/parquet/tests/writer/ParquetWriterTest.cpp

velox/dwio/parquet/writer/Writer.cpp

majetideepak · 2025-01-29T19:29:07Z

@svm1 Arrow has two versions a user can set ParquetVersion and ParquetDataPageVersion. Based on the defaults (2.6), and the issue here prestodb/presto#22595, I see the Presto ParquetWriterVersion maps to ParquetVersion.

https://github.com/apache/arrow/blob/main/cpp/src/parquet/properties.h#L258

@jkhaliqi Can you please evaluate these options with respect to RLE V2?

velox/dwio/parquet/writer/Writer.h

svm1 · 2025-01-29T20:51:50Z

@svm1 Arrow has two versions a user can set ParquetVersion and ParquetDataPageVersion. Based on the defaults (2.6), and the issue here prestodb/presto#22595, I see the Presto ParquetWriterVersion maps to ParquetVersion.

https://github.com/apache/arrow/blob/main/cpp/src/parquet/properties.h#L258

@jkhaliqi Can you please evaluate these options with respect to RLE V2?

@majetideepak

I understand the two different versions - ParquetVersion referring to the format version (2.4, 2.6, etc) and ParquetDataPageVersion referring to the datapage (V1 or V2).

I think we had a discussion about this very debate back when I first began working on this fix, and we determined that the Presto parquet_writer_version property does map to ParquetDataPageVersion - #9700 (comment)

I also investigated the Presto Java source - parquet_writer_version there maps to the org.apache.parquet.column.ParquetProperties.WriterVersion enum, which itself contains only PARQUET_1_0 and PARQUET_2_0, likely representing the V1/V2 datapage - https://www.javadoc.io/doc/org.apache.parquet/parquet-column/1.10.1/org/apache/parquet/column/ParquetProperties.WriterVersion.html#PARQUET_1_0

The Presto docs also seem to point to this mapping:

Presto now supports Parquet writer versions V1 and V2 for the Hive catalog. It can be toggled using the session property parquet_writer_version and the config property hive.parquet.writer.version. Valid values for these properties are PARQUET_1_0 and PARQUET_2_0. Default is PARQUET_1_0.

The format version also appears far more granular than simply "1 vs 2" (https://github.com/apache/parquet-format/blob/master/CHANGES.md).

Therefore I believe the mapping established in this PR would be correct - please correct me if I am mistaken.

majetideepak · 2025-01-29T22:50:41Z

@svm1 The issue reported here prestodb/presto#22595 hints that it is not the DataPageVersion but ParquetVersion.
Can you verify this?

majetideepak · 2025-01-29T22:56:11Z

Also there are 4 ParquetVersions
https://github.com/apache/arrow/blob/ca2f4d68e834e600852d5af36dc2190741e33118/cpp/src/parquet/metadata.cc#L74

svm1 · 2025-02-04T02:05:07Z

@majetideepak @czentgr

@svm1 The issue reported here prestodb/presto#22595 hints that it is not the DataPageVersion but ParquetVersion. Can you verify this?

After a thorough investigation, I believe there is a misunderstanding regarding the nature of the original issue, and that the issue might not be exactly as described.

The issue is based on the following observation:

Even if when setting set session hive.parquet_writer_version='PARQUET_1_0'; Parquet data is written in format_version: 2.6

However, I don't think the state of hive.parquet_writer_version would ever affect the FormatVersion used, because the session property seems to actually refer to the DataPageVersion.

I was able to validate this by stepping through the Presto Java Parquet writer code. I observed that modifying hive.parquet_writer_version resulted in the code path diverging at the following point:

switch (parquetProperties.getWriterVersion()) {
    case PARQUET_1_0:
        return new PrimitiveColumnWriterV1(prestoType,
                columnDescriptor,
                getValueWriter(parquetProperties.newValuesWriter(columnDescriptor), prestoType, columnDescriptor.getPrimitiveType()),
                parquetProperties.newDefinitionLevelWriter(columnDescriptor),
                parquetProperties.newRepetitionLevelWriter(columnDescriptor),
                compressionCodecName,
                parquetProperties.getPageSizeThreshold());
    case PARQUET_2_0:
        return new PrimitiveColumnWriterV2(prestoType,
                columnDescriptor,
                getValueWriter(parquetProperties.newValuesWriter(columnDescriptor), prestoType, columnDescriptor.getPrimitiveType()),
                parquetProperties.newDefinitionLevelEncoder(columnDescriptor),
                parquetProperties.newRepetitionLevelEncoder(columnDescriptor),
                compressionCodecName,
                parquetProperties.getPageSizeThreshold());

I observed that the WriterVersion field of parquetProperties is being populated with the value of the session property, hive.parquet_writer_version - this value then dictates whether a V1 or V2 writer should be instantiated, as seen above.

@jkhaliqi and I reviewed this and conducted some experiments to understand the current behavior in Java. The following table shows the resulting value of the FormatVersion, with different values of the session property used:

Presto Java:

`hive.parquet_writer_version` value	DataPage observed in code path	FormatVersion observed via output file
`PARQUET_1_0`	V1	1.0
`PARQUET_2_0`	V2	1.0

In all cases, the resulting Parquet files (analyzed via parquet-tools) are consistently created with FormatVersion 1.0, regardless of the session property value.

Additionally, we noticed that the encoding for boolean columns aligns with the expected DataPage types for each version:

PARQUET_1_0/V1 file:

jacobkhaliqi@Jacobs-MBP-69 ~ % parquet-tools meta Downloads/20250204_005554_00028_8adci_19f082cd-df97-4545-bd07-94f058abd33e
{"NumRowGroups":1,"RowGroups":[{"NumRows":1,"TotalByteSize":44,"Columns":[{"PathInSchema":["Flag"],"Type":"BOOLEAN","Encodings":["BIT_PACKED","RLE","PLAIN"],"CompressedSize":44,"UncompressedSize":24,"NumValues":1,"NullCount":0,"MaxValue":true,"MinValue":true,"CompressionCodec":"GZIP"}]}]}

PARQUET_2_0/V2 file:

jacobkhaliqi@Jacobs-MBP-69 ~ % parquet-tools meta Downloads/20250204_005648_00031_8adci_ca1a5061-bad4-4b29-b8a5-49b04d9073c4
{"NumRowGroups":1,"RowGroups":[{"NumRows":1,"TotalByteSize":50,"Columns":[{"PathInSchema":["Flag"],"Type":"BOOLEAN","Encodings":["RLE"],"CompressedSize":50,"UncompressedSize":30,"NumValues":1,"NullCount":0,"MaxValue":true,"MinValue":true,"CompressionCodec":"GZIP"}]}]}

Therefore, I believe hive.parquet_writer_version refers to the DataPageVersion and not the FormatVersion. The key change here is ensuring that this session property toggles the DataPageVersion in Prestissimo, similar to the existing behavior in Presto Java as implemented here: prestodb/presto#20957.

velox/dwio/parquet/writer/Writer.h

velox/dwio/parquet/tests/writer/ParquetWriterTest.cpp

majetideepak

Let's also document this at https://facebookincubator.github.io/velox/configs.html#parquet-file-format-configuration

velox/dwio/parquet/writer/Writer.h

svm1 · 2025-02-05T01:01:38Z

Let's also document this at https://facebookincubator.github.io/velox/configs.html#parquet-file-format-configuration

Should we wait until we reconcile the mapping with Presto session config? As this new flag will currently be unusable.

majetideepak · 2025-02-05T01:32:11Z

Should we wait until we reconcile the mapping with Presto session config? As this new flag will currently be unusable.

We should document this in Velox. This can be used by other Velox applications.

czentgr · 2025-02-05T15:53:05Z

The doc for the flag should go here: https://github.com/facebookincubator/velox/blob/main/velox/docs/configs.rst#parquet-file-format-configuration

svm1 · 2025-02-05T20:50:12Z

@majetideepak @czentgr If we're doing away with Presto naming here, I think it would be less ambiguous/more intuitive to have the values for this flag be either "V1" or "V2"? As opposed to "PARQUET_1_0"/"PARQUET_2_0".

majetideepak · 2025-02-05T21:22:40Z

@svm1 yes, let's use "V1" or "V2" for Velox.

svm1 · 2025-02-05T21:50:29Z

Documentation change pushed.

majetideepak

Looks great! Thanks for all the revisions and investigation @svm1

majetideepak · 2025-02-05T22:17:19Z

velox/dwio/parquet/tests/writer/ParquetWriterTest.cpp

+        200 * 1024 * 1024,
+        dwio::common::FileSink::Options{.pool = leafPool_.get()});
+    auto sinkPtr = sink.get();
+    parquet::WriterOptions writerOptions;


@svm1 can we use processConfigs to ensure we test the config code path? Thanks.

Great point, thanks. I originally wrote the test before you decoupled config processing logic from Hive Connector, so manually editing WriterOptions seemed like the best approach at the time. Many positive improvements from your change there :)

Rewrote the test to simulate modifying datapage flag via both config file and session property, to test the actual config code path.

czentgr

Thanks. Lets add more tests.

czentgr · 2025-02-06T00:24:53Z

velox/dwio/parquet/tests/writer/ParquetWriterTest.cpp

+  };
+
+  // Simulate setting of DataPage Version via Hive config from file.
+  std::unordered_map<std::string, std::string> configFromFile = {


I think one more tests are needed to check that the default of V1 is used without any config and when set explicitly.
This would covers all possible cases.

True, expanded the tests to validate default and toggle between both versions.

majetideepak · 2025-02-06T13:34:29Z

velox/dwio/parquet/tests/writer/ParquetWriterTest.cpp

+      {parquet::WriterOptions::kParquetHiveConnectorDataPageVersion, "V2"}};
+
+  config::ConfigBase connectorConfigV2(std::move(configFromFile));
+  writerOptions.processConfigs(


processConfigs can go inside testDataPageVersion. This lambda will then take two arguments, session and config parameters.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 2, 2024

svm1 mentioned this pull request Oct 2, 2024

Add support for parquet_writer_version session property #10573

Closed

svm1 force-pushed the parq_2 branch from b3ada01 to e79358a Compare October 2, 2024 20:11

Yuhta reviewed Oct 2, 2024

View reviewed changes

velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved

velox/dwio/parquet/writer/Writer.cpp Show resolved Hide resolved

svm1 force-pushed the parq_2 branch from e79358a to 2255b31 Compare October 3, 2024 00:52

svm1 requested a review from majetideepak October 3, 2024 00:52

Yuhta reviewed Oct 3, 2024

View reviewed changes

velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved

yingsu00 reviewed Nov 7, 2024

View reviewed changes

velox/connectors/hive/HiveConnectorUtil.cpp Outdated Show resolved Hide resolved

velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved

majetideepak reviewed Nov 21, 2024

View reviewed changes

velox/connectors/hive/HiveConnectorUtil.cpp Outdated Show resolved Hide resolved

svm1 force-pushed the parq_2 branch 2 times, most recently from 40a6c6c to 4982e1f Compare November 26, 2024 09:47

svm1 force-pushed the parq_2 branch from 4982e1f to cd990a4 Compare November 26, 2024 10:03

jkhaliqi mentioned this pull request Jan 7, 2025

feat(parquet): Add boolean rle decoder for Parquet #11282

Open

majetideepak reviewed Jan 23, 2025

View reviewed changes

svm1 force-pushed the parq_2 branch 2 times, most recently from 2f489a5 to 1807de1 Compare January 25, 2025 02:54

czentgr reviewed Jan 27, 2025

View reviewed changes

velox/dwio/parquet/tests/writer/ParquetWriterTest.cpp Outdated Show resolved Hide resolved

velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved

majetideepak changed the title ~~Add support for parquet_writer_version session property~~ feat(parquet): Add support for parquet version config Jan 29, 2025

majetideepak reviewed Jan 29, 2025

View reviewed changes

velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved

majetideepak reviewed Feb 4, 2025

View reviewed changes

svm1 force-pushed the parq_2 branch 2 times, most recently from 54cc8f1 to 0e470ea Compare February 4, 2025 20:28

majetideepak reviewed Feb 4, 2025

View reviewed changes

velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved

majetideepak changed the title ~~feat(parquet): Add support for parquet version config~~ feat(parquet): Add support for parquet data page version config Feb 4, 2025

majetideepak reviewed Feb 4, 2025

View reviewed changes

velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved

majetideepak changed the title ~~feat(parquet): Add support for parquet data page version config~~ feat(parquet): Add config for datapage version Feb 5, 2025

svm1 force-pushed the parq_2 branch 2 times, most recently from 4b64681 to 1bbfab7 Compare February 5, 2025 01:26

svm1 force-pushed the parq_2 branch 2 times, most recently from dcb4a9b to a11bd66 Compare February 5, 2025 21:48

majetideepak approved these changes Feb 5, 2025

View reviewed changes

majetideepak reviewed Feb 5, 2025

View reviewed changes

svm1 force-pushed the parq_2 branch 3 times, most recently from cf5fd27 to 0ff765b Compare February 5, 2025 23:55

czentgr reviewed Feb 6, 2025

View reviewed changes

add support for parquet_datapage_version config property

485b404

svm1 force-pushed the parq_2 branch from 0ff765b to 485b404 Compare February 6, 2025 11:47

majetideepak reviewed Feb 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parquet): Add config for datapage version #11151

feat(parquet): Add config for datapage version #11151

svm1 commented Oct 2, 2024 •

edited

Loading

netlify bot commented Oct 2, 2024 •

edited

Loading

svm1 commented Nov 26, 2024

majetideepak commented Jan 29, 2025 •

edited

Loading

svm1 commented Jan 29, 2025 •

edited

Loading

majetideepak commented Jan 29, 2025

majetideepak commented Jan 29, 2025 •

edited

Loading

svm1 commented Feb 4, 2025 •

edited

Loading

majetideepak left a comment

svm1 commented Feb 5, 2025

majetideepak commented Feb 5, 2025

czentgr commented Feb 5, 2025 •

edited

Loading

svm1 commented Feb 5, 2025

majetideepak commented Feb 5, 2025

svm1 commented Feb 5, 2025

majetideepak left a comment

majetideepak Feb 5, 2025

svm1 Feb 5, 2025 •

edited

Loading

czentgr left a comment

czentgr Feb 6, 2025 •

edited

Loading

svm1 Feb 6, 2025

majetideepak Feb 6, 2025

feat(parquet): Add config for datapage version #11151

Are you sure you want to change the base?

feat(parquet): Add config for datapage version #11151

Conversation

svm1 commented Oct 2, 2024 • edited Loading

netlify bot commented Oct 2, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

svm1 commented Nov 26, 2024

majetideepak commented Jan 29, 2025 • edited Loading

svm1 commented Jan 29, 2025 • edited Loading

majetideepak commented Jan 29, 2025

majetideepak commented Jan 29, 2025 • edited Loading

svm1 commented Feb 4, 2025 • edited Loading

majetideepak left a comment

Choose a reason for hiding this comment

svm1 commented Feb 5, 2025

majetideepak commented Feb 5, 2025

czentgr commented Feb 5, 2025 • edited Loading

svm1 commented Feb 5, 2025

majetideepak commented Feb 5, 2025

svm1 commented Feb 5, 2025

majetideepak left a comment

Choose a reason for hiding this comment

majetideepak Feb 5, 2025

Choose a reason for hiding this comment

svm1 Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

czentgr left a comment

Choose a reason for hiding this comment

czentgr Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

svm1 Feb 6, 2025

Choose a reason for hiding this comment

majetideepak Feb 6, 2025

Choose a reason for hiding this comment

svm1 commented Oct 2, 2024 •

edited

Loading

netlify bot commented Oct 2, 2024 •

edited

Loading

majetideepak commented Jan 29, 2025 •

edited

Loading

svm1 commented Jan 29, 2025 •

edited

Loading

majetideepak commented Jan 29, 2025 •

edited

Loading

svm1 commented Feb 4, 2025 •

edited

Loading

czentgr commented Feb 5, 2025 •

edited

Loading

svm1 Feb 5, 2025 •

edited

Loading

czentgr Feb 6, 2025 •

edited

Loading