-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-3035: ParquetRewriter: Add a column renaming feature #3036
GH-3035: ParquetRewriter: Add a column renaming feature #3036
Conversation
@wgtmac |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MaxNevermind for the proposal! I just took an initial pass and have the following questions:
- What is your use case? Do you need reordering fields in the future?
- How does renaming work with other features, including join, mask, and encrypt?
cc @ConeyLiu
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
Outdated
Show resolved
Hide resolved
We have a large base dataset which is split into multiple versions at the end of the flow. Those final datasets have majority of columns overlapping but some columns are dropped & some columns are renamed. Similar thing can be achieved by multiple HMS views on the top of base HMS table but it is not always possible, for example if different users are supposed to have access just to a single version of a dataset, it is hard to achieve that with HMS views without giving access to users to that underlying base table.
In our case we don't care about column ordering.
mask - it is not supported as it seems meaningless to rename a column that is dropped, I want to throw exception if that is detected, not entirely sure about that though, does it make sense to allow nullification + renaming? 🤷 |
Thanks for the explanation! I asked for
I am also skeptical of this case. But it seems to be valid if one wants to nullify a renamed column which has sensitive data? |
In case of column reordering there is a need to provide a new order of column, correct? If I would be implementing it I guess I would create a RewriteOptions's option like
I've just checked |
@wgtmac |
Yes, I think this is a fair use case, provided that the code logic does not change a lot. |
… dynamic extraction
…et-rewriter-add-column-renaming-feature
@wgtmac |
…et-rewriter-add-column-renaming-feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good. Some nits to fix. Thanks!
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
Rationale for this change
This feature extension is based on a real use-case when a input parquet dataset need to transformed to a new one using a set of basic schema transformations.
ParquetRewriter
already supports some of transformations: pruning, masking, encrypting, changing a codec. This PR add one of missing transformations - renaming.What changes are included in this PR?
renameColumns
toRewriteOptions
class which is options builder forParquetRewriter
renameColumns
toParquetRewriter
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.