Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parquet): Add config for datapage version #11151

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

svm1
Copy link
Collaborator

@svm1 svm1 commented Oct 2, 2024

Add config and session properties hive.parquet.writer.datapage-version, hive.parquet.writer.datapage_version, to determine the parquet writer datapage version (V1 or V2). Defaults to V1.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 2, 2024
Copy link

netlify bot commented Oct 2, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 485b404
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67a4a15c4ca27d0008963292

velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.cpp Show resolved Hide resolved
velox/connectors/hive/HiveConnectorUtil.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved
@svm1 svm1 force-pushed the parq_2 branch 2 times, most recently from 40a6c6c to 4982e1f Compare November 26, 2024 09:47
@svm1
Copy link
Collaborator Author

svm1 commented Nov 26, 2024

@yingsu00 @majetideepak Thank you for reviewing - made all necessary changes, please take a look!

velox/connectors/hive/HiveConnectorUtil.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveConnectorUtil.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/tests/writer/ParquetWriterTest.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved
@svm1 svm1 force-pushed the parq_2 branch 2 times, most recently from 2f489a5 to 1807de1 Compare January 25, 2025 02:54
@majetideepak majetideepak changed the title Add support for parquet_writer_version session property feat(parquet): Add support for parquet version config Jan 29, 2025
@majetideepak
Copy link
Collaborator

majetideepak commented Jan 29, 2025

@svm1 Arrow has two versions a user can set ParquetVersion and ParquetDataPageVersion. Based on the defaults (2.6), and the issue here prestodb/presto#22595, I see the Presto ParquetWriterVersion maps to ParquetVersion.

https://github.com/apache/arrow/blob/main/cpp/src/parquet/properties.h#L258

@jkhaliqi Can you please evaluate these options with respect to RLE V2?

@svm1
Copy link
Collaborator Author

svm1 commented Jan 29, 2025

@svm1 Arrow has two versions a user can set ParquetVersion and ParquetDataPageVersion. Based on the defaults (2.6), and the issue here prestodb/presto#22595, I see the Presto ParquetWriterVersion maps to ParquetVersion.

https://github.com/apache/arrow/blob/main/cpp/src/parquet/properties.h#L258

@jkhaliqi Can you please evaluate these options with respect to RLE V2?

@majetideepak

I understand the two different versions - ParquetVersion referring to the format version (2.4, 2.6, etc) and ParquetDataPageVersion referring to the datapage (V1 or V2).

I think we had a discussion about this very debate back when I first began working on this fix, and we determined that the Presto parquet_writer_version property does map to ParquetDataPageVersion - #9700 (comment)

I also investigated the Presto Java source - parquet_writer_version there maps to the org.apache.parquet.column.ParquetProperties.WriterVersion enum, which itself contains only PARQUET_1_0 and PARQUET_2_0, likely representing the V1/V2 datapage - https://www.javadoc.io/doc/org.apache.parquet/parquet-column/1.10.1/org/apache/parquet/column/ParquetProperties.WriterVersion.html#PARQUET_1_0

The Presto docs also seem to point to this mapping:

Presto now supports Parquet writer versions V1 and V2 for the Hive catalog. It can be toggled using the session property parquet_writer_version and the config property hive.parquet.writer.version. Valid values for these properties are PARQUET_1_0 and PARQUET_2_0. Default is PARQUET_1_0.

The format version also appears far more granular than simply "1 vs 2" (https://github.com/apache/parquet-format/blob/master/CHANGES.md).

Therefore I believe the mapping established in this PR would be correct - please correct me if I am mistaken.

@majetideepak
Copy link
Collaborator

@svm1 The issue reported here prestodb/presto#22595 hints that it is not the DataPageVersion but ParquetVersion.
Can you verify this?

@majetideepak
Copy link
Collaborator

majetideepak commented Jan 29, 2025

@svm1
Copy link
Collaborator Author

svm1 commented Feb 4, 2025

@majetideepak @czentgr

@svm1 The issue reported here prestodb/presto#22595 hints that it is not the DataPageVersion but ParquetVersion. Can you verify this?

After a thorough investigation, I believe there is a misunderstanding regarding the nature of the original issue, and that the issue might not be exactly as described.

The issue is based on the following observation:

Even if when setting set session hive.parquet_writer_version='PARQUET_1_0'; Parquet data is written in format_version: 2.6

However, I don't think the state of hive.parquet_writer_version would ever affect the FormatVersion used, because the session property seems to actually refer to the DataPageVersion.

I was able to validate this by stepping through the Presto Java Parquet writer code. I observed that modifying hive.parquet_writer_version resulted in the code path diverging at the following point:

switch (parquetProperties.getWriterVersion()) {
    case PARQUET_1_0:
        return new PrimitiveColumnWriterV1(prestoType,
                columnDescriptor,
                getValueWriter(parquetProperties.newValuesWriter(columnDescriptor), prestoType, columnDescriptor.getPrimitiveType()),
                parquetProperties.newDefinitionLevelWriter(columnDescriptor),
                parquetProperties.newRepetitionLevelWriter(columnDescriptor),
                compressionCodecName,
                parquetProperties.getPageSizeThreshold());
    case PARQUET_2_0:
        return new PrimitiveColumnWriterV2(prestoType,
                columnDescriptor,
                getValueWriter(parquetProperties.newValuesWriter(columnDescriptor), prestoType, columnDescriptor.getPrimitiveType()),
                parquetProperties.newDefinitionLevelEncoder(columnDescriptor),
                parquetProperties.newRepetitionLevelEncoder(columnDescriptor),
                compressionCodecName,
                parquetProperties.getPageSizeThreshold());

I observed that the WriterVersion field of parquetProperties is being populated with the value of the session property, hive.parquet_writer_version - this value then dictates whether a V1 or V2 writer should be instantiated, as seen above.

@jkhaliqi and I reviewed this and conducted some experiments to understand the current behavior in Java. The following table shows the resulting value of the FormatVersion, with different values of the session property used:

Presto Java:

hive.parquet_writer_version value DataPage observed in code path FormatVersion observed via output file
PARQUET_1_0 V1 1.0
PARQUET_2_0 V2 1.0

In all cases, the resulting Parquet files (analyzed via parquet-tools) are consistently created with FormatVersion 1.0, regardless of the session property value.

Additionally, we noticed that the encoding for boolean columns aligns with the expected DataPage types for each version:

PARQUET_1_0/V1 file:

jacobkhaliqi@Jacobs-MBP-69 ~ % parquet-tools meta Downloads/20250204_005554_00028_8adci_19f082cd-df97-4545-bd07-94f058abd33e
{"NumRowGroups":1,"RowGroups":[{"NumRows":1,"TotalByteSize":44,"Columns":[{"PathInSchema":["Flag"],"Type":"BOOLEAN","Encodings":["BIT_PACKED","RLE","PLAIN"],"CompressedSize":44,"UncompressedSize":24,"NumValues":1,"NullCount":0,"MaxValue":true,"MinValue":true,"CompressionCodec":"GZIP"}]}]}

PARQUET_2_0/V2 file:

jacobkhaliqi@Jacobs-MBP-69 ~ % parquet-tools meta Downloads/20250204_005648_00031_8adci_ca1a5061-bad4-4b29-b8a5-49b04d9073c4
{"NumRowGroups":1,"RowGroups":[{"NumRows":1,"TotalByteSize":50,"Columns":[{"PathInSchema":["Flag"],"Type":"BOOLEAN","Encodings":["RLE"],"CompressedSize":50,"UncompressedSize":30,"NumValues":1,"NullCount":0,"MaxValue":true,"MinValue":true,"CompressionCodec":"GZIP"}]}]}

Therefore, I believe hive.parquet_writer_version refers to the DataPageVersion and not the FormatVersion. The key change here is ensuring that this session property toggles the DataPageVersion in Prestissimo, similar to the existing behavior in Presto Java as implemented here: prestodb/presto#20957.

velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved
velox/dwio/parquet/tests/writer/ParquetWriterTest.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/tests/writer/ParquetWriterTest.cpp Outdated Show resolved Hide resolved
@svm1 svm1 force-pushed the parq_2 branch 2 times, most recently from 54cc8f1 to 0e470ea Compare February 4, 2025 20:28
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved
@majetideepak majetideepak changed the title feat(parquet): Add support for parquet version config feat(parquet): Add support for parquet data page version config Feb 4, 2025
@majetideepak majetideepak changed the title feat(parquet): Add support for parquet data page version config feat(parquet): Add config for datapage version Feb 5, 2025
@svm1
Copy link
Collaborator Author

svm1 commented Feb 5, 2025

Let's also document this at https://facebookincubator.github.io/velox/configs.html#parquet-file-format-configuration

Should we wait until we reconcile the mapping with Presto session config? As this new flag will currently be unusable.

@svm1 svm1 force-pushed the parq_2 branch 2 times, most recently from 4b64681 to 1bbfab7 Compare February 5, 2025 01:26
@majetideepak
Copy link
Collaborator

Should we wait until we reconcile the mapping with Presto session config? As this new flag will currently be unusable.

We should document this in Velox. This can be used by other Velox applications.

@czentgr
Copy link
Collaborator

czentgr commented Feb 5, 2025

The doc for the flag should go here: https://github.com/facebookincubator/velox/blob/main/velox/docs/configs.rst#parquet-file-format-configuration

@svm1
Copy link
Collaborator Author

svm1 commented Feb 5, 2025

@majetideepak @czentgr If we're doing away with Presto naming here, I think it would be less ambiguous/more intuitive to have the values for this flag be either "V1" or "V2"? As opposed to "PARQUET_1_0"/"PARQUET_2_0".

@majetideepak
Copy link
Collaborator

@svm1 yes, let's use "V1" or "V2" for Velox.

@svm1 svm1 force-pushed the parq_2 branch 2 times, most recently from dcb4a9b to a11bd66 Compare February 5, 2025 21:48
@svm1
Copy link
Collaborator Author

svm1 commented Feb 5, 2025

Documentation change pushed.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for all the revisions and investigation @svm1

200 * 1024 * 1024,
dwio::common::FileSink::Options{.pool = leafPool_.get()});
auto sinkPtr = sink.get();
parquet::WriterOptions writerOptions;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svm1 can we use processConfigs to ensure we test the config code path? Thanks.

Copy link
Collaborator Author

@svm1 svm1 Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, thanks. I originally wrote the test before you decoupled config processing logic from Hive Connector, so manually editing WriterOptions seemed like the best approach at the time. Many positive improvements from your change there :)

Rewrote the test to simulate modifying datapage flag via both config file and session property, to test the actual config code path.

@svm1 svm1 force-pushed the parq_2 branch 3 times, most recently from cf5fd27 to 0ff765b Compare February 5, 2025 23:55
Copy link
Collaborator

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Lets add more tests.

};

// Simulate setting of DataPage Version via Hive config from file.
std::unordered_map<std::string, std::string> configFromFile = {
Copy link
Collaborator

@czentgr czentgr Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one more tests are needed to check that the default of V1 is used without any config and when set explicitly.
This would covers all possible cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, expanded the tests to validate default and toggle between both versions.

{parquet::WriterOptions::kParquetHiveConnectorDataPageVersion, "V2"}};

config::ConfigBase connectorConfigV2(std::move(configFromFile));
writerOptions.processConfigs(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processConfigs can go inside testDataPageVersion. This lambda will then take two arguments, session and config parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants