Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate the numbers of input and output files in HadoopSegmentCreationJob #8098

Merged
merged 1 commit into from
Feb 1, 2022

Conversation

jackjlli
Copy link
Member

@jackjlli jackjlli commented Jan 31, 2022

Description

If the exclude.sequence.id property is set to true, segment name will not have the sequence id as the suffix. If there are multiple input files for the same day within the batch job, all these new segments will share the same segment name, which leads to the scenario that only 1 pinot segment instead of N stored in the file system.

This PR adds the validation between the number of input files and the number of output files, so that once these two number don't match, an exception will be thrown and then fail the job.

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

  • Yes (Please label as backward-incompat, and complete the section below on Release Notes)

Does this PR fix a zero-downtime upgrade introduced earlier?

  • Yes (Please label this as backward-incompat, and complete the section below on Release Notes)

Does this PR otherwise need attention when creating release notes? Things to consider:

  • New configuration options
  • Deprecation of configurations
  • Signature changes to public methods/interfaces
  • New plugins added or old plugins removed
  • Yes (Please label this PR as release-notes and complete the section on Release Notes)

Release Notes

Documentation

@codecov-commenter
Copy link

codecov-commenter commented Jan 31, 2022

Codecov Report

Merging #8098 (af6b6ce) into master (2a36d4c) will decrease coverage by 6.46%.
The diff coverage is n/a.

❗ Current head af6b6ce differs from pull request most recent head 82cd206. Consider uploading reports for the commit 82cd206 to get more accurate results
Impacted file tree graph

@@             Coverage Diff              @@
##             master    #8098      +/-   ##
============================================
- Coverage     71.20%   64.73%   -6.47%     
- Complexity     4303     4304       +1     
============================================
  Files          1617     1572      -45     
  Lines         83800    81919    -1881     
  Branches      12517    12313     -204     
============================================
- Hits          59669    53030    -6639     
- Misses        20079    25125    +5046     
+ Partials       4052     3764     -288     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 67.97% <ø> (+0.01%) ⬆️
unittests2 14.16% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...a/org/apache/pinot/common/metrics/MinionMeter.java 0.00% <0.00%> (-100.00%) ⬇️
...g/apache/pinot/common/metrics/ControllerMeter.java 0.00% <0.00%> (-100.00%) ⬇️
.../apache/pinot/common/metrics/BrokerQueryPhase.java 0.00% <0.00%> (-100.00%) ⬇️
.../apache/pinot/common/metrics/MinionQueryPhase.java 0.00% <0.00%> (-100.00%) ⬇️
...ache/pinot/server/access/AccessControlFactory.java 0.00% <0.00%> (-100.00%) ⬇️
...he/pinot/common/messages/SegmentReloadMessage.java 0.00% <0.00%> (-100.00%) ⬇️
...pinot/core/data/manager/realtime/TimerService.java 0.00% <0.00%> (-100.00%) ⬇️
...che/pinot/controller/util/TableMetadataReader.java 0.00% <0.00%> (-100.00%) ⬇️
...pinot/minion/exception/TaskCancelledException.java 0.00% <0.00%> (-100.00%) ⬇️
...not/common/exception/HttpErrorStatusException.java 0.00% <0.00%> (-100.00%) ⬇️
... and 367 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a36d4c...82cd206. Read the comment docs.

@Jackie-Jiang
Copy link
Contributor

What if user wants to replace an existing segment with a new generated one?
We should allow overriding existing segment, but not the newly pushed ones.

@jackjlli jackjlli force-pushed the disable-overwrite-if-sequence-id-is-excluded branch from 82cd206 to 5f1f61a Compare January 31, 2022 23:47
@jackjlli jackjlli changed the title Set overwrite to false if exclude.sequence.id is enabled Validate the numbers of input and output files in HadoopSegmentCreationJob Jan 31, 2022
@jackjlli
Copy link
Member Author

jackjlli commented Feb 1, 2022

What if user wants to replace an existing segment with a new generated one? We should allow overriding existing segment, but not the newly pushed ones.

If user wants to replace an existing segment, he/she can still use the current logic to do that. I've update the logic of the PR to validate the number of input and output files. Since there is 1:1 mapping between the input and output files, if these two number doesn't match, we should fail the job.
The previous logic that sets overwrite flag doesn't work as the destination is just a temp dir for each of the mapper. The actual merge step from mapper temp dir to final output dir is done inside the commitTask method of FileOutputCommitter class, which is out of the scope of our MR job.

Sample log from the mapper:

2022-01-31 21:49:15,627 INFO [main] org.apache.pinot.hadoop.job.mappers.HadoopSegmentCreationMapper: Copying segment tar file from: pinot_hadoop_tmp/segmentTar/table1_2022-01-25_2022-01-25.tar.gz to: hdfs://path1/pinot_segments/3f5420af-6422-4035-9d53-2dd1895c2747/output/_temporary/1/_temporary/attempt_1632281309592_18465840_m_000001_0/segmentTar/table1_2022-01-25_2022-01-25.tar.gz
...
2022-01-31 21:49:17,484 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_1632281309592_18465840_m_000001_0' to hdfs://path1/pinot_segments/3f5420af-6422-4035-9d53-2dd1895c2747/output

Copy link
Contributor

@sajjad-moradi sajjad-moradi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The changes look simple enough, but did you happen to test it?

@jackjlli
Copy link
Member Author

jackjlli commented Feb 1, 2022

@sajjad-moradi Yes, that's tested in the Hadoop env.

@jackjlli jackjlli merged commit 88dda0e into master Feb 1, 2022
@jackjlli jackjlli deleted the disable-overwrite-if-sequence-id-is-excluded branch February 1, 2022 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants