Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-3035: ParquetRewriter: Add a column renaming feature #3036

Conversation

MaxNevermind
Copy link
Contributor

@MaxNevermind MaxNevermind commented Oct 21, 2024

Rationale for this change

This feature extension is based on a real use-case when a input parquet dataset need to transformed to a new one using a set of basic schema transformations. ParquetRewriter already supports some of transformations: pruning, masking, encrypting, changing a codec. This PR add one of missing transformations - renaming.

What changes are included in this PR?

  • Add a new option renameColumns to RewriteOptions class which is options builder for ParquetRewriter
  • Add support of to renameColumns to ParquetRewriter

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@MaxNevermind
Copy link
Contributor Author

MaxNevermind commented Oct 21, 2024

@wgtmac
Can you check this out? I need your opinion - If you see any problems with such a feature conceptually. Then I can polish the implementation. Right now there is a single test for a new feature and the tests for more complex cases with a mix of options are missing. Will work on it after you green light the feature.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MaxNevermind for the proposal! I just took an initial pass and have the following questions:

  • What is your use case? Do you need reordering fields in the future?
  • How does renaming work with other features, including join, mask, and encrypt?

cc @ConeyLiu

@MaxNevermind
Copy link
Contributor Author

MaxNevermind commented Oct 24, 2024

What is your use case?

We have a large base dataset which is split into multiple versions at the end of the flow. Those final datasets have majority of columns overlapping but some columns are dropped & some columns are renamed. Similar thing can be achieved by multiple HMS views on the top of base HMS table but it is not always possible, for example if different users are supposed to have access just to a single version of a dataset, it is hard to achieve that with HMS views without giving access to users to that underlying base table.

Do you need reordering fields in the future?

In our case we don't care about column ordering.

How does renaming work with other features, including join, mask, and encrypt?

mask - it is not supported as it seems meaningless to rename a column that is dropped, I want to throw exception if that is detected, not entirely sure about that though, does it make sense to allow nullification + renaming? 🤷
join - I planned to support it, so the column could be joined and then renamed if needed
encrypt - I planned to support it, so the column can be encrypted and renamed if needed

@wgtmac
Copy link
Member

wgtmac commented Oct 25, 2024

Thanks for the explanation! I asked for column ordering because renaming and reordering may happen together in the case of schema evolution. We can support them in the future when needed.

does it make sense to allow nullification + renaming

I am also skeptical of this case. But it seems to be valid if one wants to nullify a renamed column which has sensitive data?

@MaxNevermind
Copy link
Contributor Author

I asked for column ordering because renaming and reordering may happen together in the case of schema evolution. We can support them in the future when needed.

In case of column reordering there is a need to provide a new order of column, correct? If I would be implementing it I guess I would create a RewriteOptions's option like Map<SrcColName, Map.Entry<DistColName, DestColIdx>> columnsReordered which would have merged capabilities of both renaming and reordering. I think current implementation can be extended to something like that in the future, for example if columnsReordered is used we should prohibit renameColumns option usage, wdyt?

I am also skeptical of this case. But it seems to be valid if one wants to nullify a renamed column which has sensitive data?

I've just checked MaskMode enum, it has planned hashing feature as I understand. Renaming hashed column would definitely should be beneficial to support. Okay, I think we should try to support renaming of masked columns.

@MaxNevermind
Copy link
Contributor Author

@wgtmac
In general do you think this feature looks worth adding, no hard no-go traits in this feature? If yes, I will work on extending functionality to those overlapping cases with masking and will start polishing it.

@MaxNevermind MaxNevermind requested a review from wgtmac October 28, 2024 21:14
@wgtmac
Copy link
Member

wgtmac commented Oct 29, 2024

Yes, I think this is a fair use case, provided that the code logic does not change a lot.

@MaxNevermind
Copy link
Contributor Author

@wgtmac
I think I'm done with implementation, can you take a look?

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good. Some nits to fix. Thanks!

@MaxNevermind MaxNevermind requested a review from wgtmac November 12, 2024 00:55
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@wgtmac wgtmac changed the title [GH-3035] [WIP] ParquetRewriter: Add a column renaming feature GH-3035: ParquetRewriter: Add a column renaming feature Nov 12, 2024
@wgtmac wgtmac merged commit 686f071 into apache:master Nov 13, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants