Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2465: Fall back to HadoopConfig #1339

Merged
merged 1 commit into from
May 3, 2024

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented May 1, 2024

We see that this causes the 1.14 to be incompatible with the previous releases.

    protected WriteSupport<T> getWriteSupport(ParquetConfiguration conf) {
      throw new UnsupportedOperationException(
          "Override ParquetWriter$Builder#getWriteSupport(ParquetConfiguration)");
    }

It is not implemented and causes an UnsupportedOperationException if you don't supply a config.

  • I think this breaks backward compatibility.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines
    from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Style

  • My contribution adheres to the code style guidelines and Spotless passes.
    • To apply the necessary changes, run mvn spotless:apply -Pvector-plugins

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@Fokko Fokko force-pushed the fd-dont-init-config branch 2 times, most recently from baa6245 to 17908f6 Compare May 2, 2024 05:53
@Fokko Fokko changed the title Do not initalize the config Fall back to HadoopConfig May 2, 2024
@Fokko Fokko force-pushed the fd-dont-init-config branch 3 times, most recently from a416cef to 6a1bd96 Compare May 2, 2024 07:50
@Fokko Fokko changed the title Fall back to HadoopConfig PARQUET-2465: Fall back to HadoopConfig May 2, 2024
@@ -503,8 +503,7 @@ protected Builder(OutputFile path) {
* @return an appropriate WriteSupport for the object model.
*/
protected WriteSupport<T> getWriteSupport(ParquetConfiguration conf) {
throw new UnsupportedOperationException(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When people want to decouple from Hadoop, they can just override the methods, otherwise the old behavior is preserved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use a flag from ParquetConfiguration to determine if backward compatibility is required? By default we should enable backward compatibility and disable it in the parquet-mr 2.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of it, I believe we should deprecate the ones where you need to pass in Configuration (from Hadoop), and then we can move to ParquetConfiguration where you can also pass in a HadoopParquetConfiguration. Thanks for raising this, let me create a commit and let me know what you think!

Add fallback logic

We see that this causes the 1.14 to be incompatible
with the previous releases. The config will be created
and right after that the `getWriteSupport(conf)` is called.

But since this method is freshly introduced:

```java
    protected WriteSupport<T> getWriteSupport(ParquetConfiguration conf) {
      throw new UnsupportedOperationException(
          "Override ParquetWriter$Builder#getWriteSupport(ParquetConfiguration)");
    }
```
@Fokko Fokko force-pushed the fd-dont-init-config branch from 9610965 to 6e754d9 Compare May 2, 2024 20:13
@Fokko Fokko marked this pull request as ready for review May 2, 2024 20:13
@Fokko
Copy link
Contributor Author

Fokko commented May 2, 2024

Good to go. @wgtmac @shangxinli @amousavigourabi @vinooganesh LMKWYT

@vinooganesh
Copy link
Contributor

👍 this looks good to me, but do we want to actually mark the hadoop methods as deprecated if we are going to assume that parquet-mr 1.x will always rely on hadoop? Or is there actually a plan to drop the hadoop dependency on future 1.x releases?

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix makes sense to me. Thanks!

@Fokko
Copy link
Contributor Author

Fokko commented May 3, 2024

👍 this looks good to me, but do we want to actually mark the hadoop methods as deprecated if we are going to assume that parquet-mr 1.x will always rely on hadoop? Or is there actually a plan to drop the hadoop dependency on future 1.x releases?

You can still use the Hadoop config, but you'll need to wrap it into a HadoopParquetConfiguration: https://github.com/apache/parquet-mr/blob/68609198c4fecaa0e8fb1bcaa2c8a353030de962/parquet-hadoop/src/main/java/org/apache/parquet/conf/HadoopParquetConfiguration.java#L42-L44

@Fokko Fokko merged commit 408c18b into apache:master May 3, 2024
9 checks passed
@Fokko Fokko deleted the fd-dont-init-config branch May 3, 2024 09:37
Fokko added a commit to Fokko/parquet-mr that referenced this pull request May 3, 2024
Add fallback logic

We see that this causes the 1.14 to be incompatible
with the previous releases. The config will be created
and right after that the `getWriteSupport(conf)` is called.

But since this method is freshly introduced:

```java
    protected WriteSupport<T> getWriteSupport(ParquetConfiguration conf) {
      throw new UnsupportedOperationException(
          "Override ParquetWriter$Builder#getWriteSupport(ParquetConfiguration)");
    }
```
Fokko added a commit that referenced this pull request May 3, 2024
Add fallback logic

We see that this causes the 1.14 to be incompatible
with the previous releases. The config will be created
and right after that the `getWriteSupport(conf)` is called.

But since this method is freshly introduced:

```java
    protected WriteSupport<T> getWriteSupport(ParquetConfiguration conf) {
      throw new UnsupportedOperationException(
          "Override ParquetWriter$Builder#getWriteSupport(ParquetConfiguration)");
    }
```
@amousavigourabi
Copy link
Contributor

👍 this looks good to me, but do we want to actually mark the hadoop methods as deprecated if we are going to assume that parquet-mr 1.x will always rely on hadoop? Or is there actually a plan to drop the hadoop dependency on future 1.x releases?

I'd like to note that we have other stuff that is deprecated because they will be dropped in 2.0 without any plans to remove them in 1.x releases as well (see: org.apache.parquet.avro.AvroParquetReader#builder(Path) for an example of this), so this is consistent with the usage in the rest of the project.

@vinooganesh
Copy link
Contributor

Sounds great, thank @Fokko and @amousavigourabi!

clairemcginty pushed a commit to clairemcginty/parquet-mr that referenced this pull request May 17, 2024
Add fallback logic

We see that this causes the 1.14 to be incompatible
with the previous releases. The config will be created
and right after that the `getWriteSupport(conf)` is called.

But since this method is freshly introduced:

```java
    protected WriteSupport<T> getWriteSupport(ParquetConfiguration conf) {
      throw new UnsupportedOperationException(
          "Override ParquetWriter$Builder#getWriteSupport(ParquetConfiguration)");
    }
```
@wgtmac wgtmac added this to the 1.15.0 milestone Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants