GH-29781: [C++][Parquet] Switch to use compliant nested types by default #35146

wjones127 · 2023-04-14T20:03:20Z

Rationale for this change

This has been a long-standing TODO.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

This PR includes breaking changes to public APIs.

Closes: [C++][Parquet] Default to compliant nested types in Parquet writer #29781

github-actions · 2023-04-14T20:03:45Z

Closes: [C++][Parquet] Default to compliant nested types in Parquet writer #29781

mapleFU · 2023-04-16T15:27:53Z

cpp/src/parquet/properties.h

@@ -819,8 +819,7 @@ class PARQUET_EXPORT ArrowWriterProperties {
          coerce_timestamps_unit_(::arrow::TimeUnit::SECOND),
          truncated_timestamps_allowed_(false),
          store_schema_(false),
-          // TODO: At some point we should flip this.


/// This is disabled by default, but will be enabled by default in future.

Should we change this in Loc 880?

The line 320 in the parquet.rst doc should also be changed.

std::shared_ptr<ArrowWriterProperties> arrow_props = ArrowWriterProperties::Builder() .enable_deprecated_int96_timestamps() // default False ->store_schema() // default False ->enable_compliant_nested_types() // default False ->build();

mapleFU

public abstract class ConversionPatterns {

  static final String MAP_REPEATED_NAME = "key_value";
  private static final String ELEMENT_NAME = "element";

Go through the Java code, seems that it also uses "element". LGTM

mapleFU · 2023-04-19T16:08:49Z

In C++, code shows force. But when I go through code in arrow-rs, it doesn't say is force. And though parquet-format use "element" in description, it doesn't says it's forced. Maybe we should ask the parquet maillist later?

wjones127 · 2023-04-19T18:24:32Z

In C++, code shows force. But when I go through code in arrow-rs, it doesn't say is force. And though parquet-format use "element" in description, it doesn't says it's forced. Maybe we should ask the parquet maillist later?

The rules were laid out earlier here:

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules

Before this PR I did work to make sure that the Arrow C++ implementation (1) didn't care about the field names in equality comparison (PR) and (2) could cheaply cast between types that differed only in field names (PR).

mapleFU · 2023-04-20T02:41:04Z

Thanks! So, this flag could make converting and reading a bit cheaper?

wgtmac · 2023-05-04T14:47:49Z

Is it time to revive this PR and check it in? @wjones127

wgtmac

LGTM! cc @pitrou

cpp/src/parquet/properties.h

Co-authored-by: Antoine Pitrou <[email protected]>

ursabot · 2023-05-14T01:43:57Z

Benchmark runs are scheduled for baseline = 1624d5a and contender = e324f9a. e324f9a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.66% ⬆️0.06%] test-mac-arm
[Finished ⬇️1.52% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.06%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] e324f9ab ec2-t3-xlarge-us-east-2
[Failed] e324f9ab test-mac-arm
[Finished] e324f9ab ursa-i9-9960x
[Finished] e324f9ab ursa-thinkcentre-m75q
[Finished] 1624d5aa ec2-t3-xlarge-us-east-2
[Finished] 1624d5aa test-mac-arm
[Finished] 1624d5aa ursa-i9-9960x
[Finished] 1624d5aa ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2023-05-14T01:49:12Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

…y default (apache#35146) ### Rationale for this change This has been a long-standing TODO. ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** * Closes: apache#29781 Lead-authored-by: Will Jones <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…compliant nested types (#2428) This solves some of our flaky unit tests that try to read a Ray dataset tensor type under a Pyarrow version that does not properly serialize it. Adds a new exception when we detect this issue, and also updates our pyarrow dependencies to 13.0.0. I also found a separate bug upon upgrading to pyarrow, where versions >= 13.0.0 will use compliant Parquet nested type names by default (see apache/arrow#35146), which would then cause a discrepancy between the Parquet and Arrow schemas for extension types. This would lead to incorrect conversion from Parquet to Arrow when we would read a file. This PR disables that explicitly. Finally I also added a `coerce_temporal_nanoseconds` parameter to `to_pandas` to revert it to its pre pyarrow>=13.0.0 behavior, which makes the sql integration tests pass again Confirmed to raise proper error in local testing

github-actions bot added Component: C++ Component: Parquet awaiting committer review Awaiting committer review labels Apr 14, 2023

wjones127 mentioned this pull request Apr 14, 2023

ARROW-14196: [C++][Parquet] Make compliant nested types default #14651

Closed

mapleFU reviewed Apr 16, 2023

View reviewed changes

wjones127 force-pushed the GH-29781-parquet-compliant-nested-types branch from 4d0b91d to 72b4c67 Compare April 17, 2023 22:06

github-actions bot added the Component: Python label Apr 17, 2023

wjones127 force-pushed the GH-29781-parquet-compliant-nested-types branch from 72b4c67 to 564000c Compare April 17, 2023 22:07

github-actions bot added the Component: Documentation label Apr 17, 2023

wjones127 marked this pull request as ready for review April 18, 2023 20:37

wjones127 requested a review from AlenkaF as a code owner April 18, 2023 20:37

mapleFU approved these changes Apr 19, 2023

View reviewed changes

wjones127 added 3 commits May 10, 2023 14:11

feat(api!) switch to use compliant nested types by default

9d56092

chore: update Python and docs

2d55596

test: fix another python test

0ea421d

wjones127 force-pushed the GH-29781-parquet-compliant-nested-types branch from 8f28f2a to 0ea421d Compare May 10, 2023 21:11

wgtmac approved these changes May 11, 2023

View reviewed changes

pitrou reviewed May 11, 2023

View reviewed changes

cpp/src/parquet/properties.h Outdated Show resolved Hide resolved

Update cpp/src/parquet/properties.h

26c0fa2

Co-authored-by: Antoine Pitrou <[email protected]>

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 11, 2023

mapleFU approved these changes May 12, 2023

View reviewed changes

pitrou approved these changes May 12, 2023

View reviewed changes

pitrou merged commit e324f9a into apache:main May 12, 2023

wjones127 deleted the GH-29781-parquet-compliant-nested-types branch May 12, 2023 16:10

mapleFU mentioned this pull request Oct 26, 2023

MINOR: [Python][Docs] Fix default for use_compliant_nested_type in parquet write_table docstring #38471

Merged

kevinzwang mentioned this pull request Jun 21, 2024

[BUG] Raise error when Ray Data tensor cannot be pickled and disable compliant nested types Eventual-Inc/Daft#2428

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-29781: [C++][Parquet] Switch to use compliant nested types by default #35146

GH-29781: [C++][Parquet] Switch to use compliant nested types by default #35146

wjones127 commented Apr 14, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Apr 14, 2023

mapleFU Apr 16, 2023

wgtmac Apr 17, 2023

mapleFU left a comment

mapleFU commented Apr 19, 2023 •

edited

Loading

wjones127 commented Apr 19, 2023

mapleFU commented Apr 20, 2023

wgtmac commented May 4, 2023

wgtmac left a comment

ursabot commented May 14, 2023

ursabot commented May 14, 2023

GH-29781: [C++][Parquet] Switch to use compliant nested types by default #35146

GH-29781: [C++][Parquet] Switch to use compliant nested types by default #35146

Conversation

wjones127 commented Apr 14, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Apr 14, 2023

mapleFU Apr 16, 2023

Choose a reason for hiding this comment

wgtmac Apr 17, 2023

Choose a reason for hiding this comment

mapleFU left a comment

Choose a reason for hiding this comment

mapleFU commented Apr 19, 2023 • edited Loading

wjones127 commented Apr 19, 2023

mapleFU commented Apr 20, 2023

wgtmac commented May 4, 2023

wgtmac left a comment

Choose a reason for hiding this comment

ursabot commented May 14, 2023

ursabot commented May 14, 2023

wjones127 commented Apr 14, 2023 •

edited by github-actions bot

Loading

mapleFU commented Apr 19, 2023 •

edited

Loading