Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prefixesToRename config for renaming fields upon ingestion #8273

Merged
merged 8 commits into from
Mar 17, 2022

Conversation

jeffreyliu34
Copy link
Contributor

@jeffreyliu34 jeffreyliu34 commented Mar 2, 2022

Description

This change allows fields to be renamed upon ingestion to avoid duplication of columns/data. More specifically, the proposed change will enabling dropping specified prefixes of fields that are in the source data, so that the resulting field names do not have those prefixes. This addresses issue #8161

More details here in this design doc here - https://docs.google.com/document/d/1U_vQC0BiCCEcx49Tsp499V5F075iJ3fW9IrsD8sNdU0/edit?usp=sharing

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

  • No (Please label as backward-incompat, and complete the section below on Release Notes)

Does this PR fix a zero-downtime upgrade introduced earlier?

  • No (Please label this as backward-incompat, and complete the section below on Release Notes)

Does this PR otherwise need attention when creating release notes? Things to consider:

  • New configuration options
  • Deprecation of configurations
  • Signature changes to public methods/interfaces
  • New plugins added or old plugins removed
  • Yes (Please label this PR as release-notes and complete the section on Release Notes)

Release Notes

  • Added prefixesToRename config to rename columns upon ingestion based on their prefixes

Documentation

@codecov-commenter
Copy link

codecov-commenter commented Mar 2, 2022

Codecov Report

Merging #8273 (5072901) into master (e8c849e) will decrease coverage by 0.30%.
The diff coverage is 69.04%.

@@             Coverage Diff              @@
##             master    #8273      +/-   ##
============================================
- Coverage     71.09%   70.78%   -0.31%     
+ Complexity     4314     4265      -49     
============================================
  Files          1626     1636      +10     
  Lines         84881    85844     +963     
  Branches      12788    12931     +143     
============================================
+ Hits          60343    60764     +421     
- Misses        20394    20886     +492     
- Partials       4144     4194      +50     
Flag Coverage Δ
integration1 28.90% <0.00%> (+0.09%) ⬆️
integration2 27.66% <0.00%> (+0.13%) ⬆️
unittests1 66.92% <69.04%> (-0.49%) ⬇️
unittests2 14.18% <0.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...he/pinot/segment/local/utils/TableConfigUtils.java 65.34% <25.00%> (-0.73%) ⬇️
...ache/pinot/segment/local/utils/IngestionUtils.java 25.00% <66.66%> (+5.74%) ⬆️
...ocal/recordtransformer/ComplexTypeTransformer.java 58.57% <78.57%> (+2.89%) ⬆️
.../spi/config/table/ingestion/ComplexTypeConfig.java 100.00% <100.00%> (ø)
...data/manager/realtime/DefaultSegmentCommitter.java 0.00% <0.00%> (-80.00%) ⬇️
...ls/nativefst/automaton/MinimizationOperations.java 0.00% <0.00%> (-36.20%) ⬇️
...org/apache/pinot/spi/config/table/QueryConfig.java 53.33% <0.00%> (-26.67%) ⬇️
...a/manager/realtime/RealtimeSegmentDataManager.java 50.00% <0.00%> (-25.00%) ⬇️
...tils/nativefst/automaton/TransitionComparator.java 9.37% <0.00%> (-25.00%) ⬇️
...er/api/resources/LLCSegmentCompletionHandlers.java 43.56% <0.00%> (-18.82%) ⬇️
... and 170 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8c849e...5072901. Read the comment docs.

@icefury71 icefury71 requested a review from yupeng9 March 3, 2022 21:33
@yupeng9
Copy link
Contributor

yupeng9 commented Mar 4, 2022

I feel the config of prefixesToDropFromFields is a bit limited. Perhaps we can consider a config of prefixRename or sth similar to allow more manipulation of the prefix? So prefixesToDropFromFields will become a special case of this which is rename to empty.

Also, how do we handle the collision from multiple nested columns that happen to have the same name, after the prefix drop?

@jeffreyliu34
Copy link
Contributor Author

jeffreyliu34 commented Mar 4, 2022

Yea I like that. How does something like this look? @yupeng9

prefixesToRename: [
    {
         "prefixName": "after."
         "prefixReplace": ""
    },
    ...
]

In terms of avoiding collisions with same column names, perhaps we can add a validation check when the table config is added? I believe there's similar validation checks in place currently for comparing column names in the configs against the schema cc @icefury71

Also since the general idea is to allow for renaming columns from the source data upon ingestion, we could also end up expanding on this via some different approaches. One approach could be to specify the exact full column names we want to rename, and specify what we want to rename it to, like how we have that capability in transform configs. Another approach would be to apply the config in batch, so like batch removing/renaming common prefixes, or batch converting snake_case to camelCase, etc. so that it's not tedious to specify everything like in the first approach. For this PR, I was thinking of just starting with prefixRename but definitely appreciate any thoughts.

@yupeng9
Copy link
Contributor

yupeng9 commented Mar 4, 2022

Yea I like that. How does something like this look? @yupeng9

prefixesToRename: [
    {
         "prefixName": "after."
         "prefixReplace": ""
    },
    ...
]

In terms of avoiding collisions with same column names, perhaps we can add a validation check when the table config is added? I believe there's similar validation checks in place currently for comparing column names in the configs against the schema cc @icefury71

Also since the general idea is to allow for renaming columns from the source data upon ingestion, we could also end up expanding on this via some different approaches. One approach could be to specify the exact full column names we want to rename, and specify what we want to rename it to, like how we have that capability in transform configs. Another approach would be to apply the config in batch, so like batch removing/renaming common prefixes, or batch converting snake_case to camelCase, etc. so that it's not tedious to specify everything like in the first approach. For this PR, I was thinking of just starting with prefixRename but definitely appreciate any thoughts.

I think it could be even simpler:

prefixesToRename:   { "after.": "" },

Yes, validation works, since this is at the schema level.

@jeffreyliu34 jeffreyliu34 changed the title Add prefixesToDropFromFields config for renaming fields upon ingestion Add prefixesToRename config for renaming fields upon ingestion Mar 11, 2022
@jeffreyliu34
Copy link
Contributor Author

Thanks @yupeng9 , I've updated based on your suggestion and added validation

Copy link
Contributor

@yupeng9 yupeng9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for adding this option

@Jackie-Jiang Jackie-Jiang added the release-notes Referenced by PRs that need attention when compiling the next release notes label Mar 13, 2022
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the release note session, could you please add the actual config key so that user knows how to config it by reading the release note?

if (_prefixesToRename.isEmpty()) {
return;
}
List<String> fields = new ArrayList<>(record.getFieldToValueMap().keySet());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid creating this extra list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah actually, we need this list to avoid ConcurrentModificationException

…g/TableConfigSerDeTest.java

Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>
@jeffreyliu34
Copy link
Contributor Author

In the release note session, could you please add the actual config key so that user knows how to config it by reading the release note?

Yep updated. Thanks for the review @Jackie-Jiang !

@icefury71 icefury71 merged commit 360a205 into apache:master Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants