Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download offline segments from peers #9710

Merged
merged 10 commits into from
Nov 22, 2022

Conversation

wirybeaver
Copy link
Contributor

@wirybeaver wirybeaver commented Nov 2, 2022

Design doc

Justification

Download offline segments from peers if the remote copies disappeared. #9709

Backward Compatibility

Add a new instance level config pinot.server.peer.download.scheme

IF table type is reatlime, final peer download scheme = table config's peerSegmentDownloadScheme, which is an existing feature.

ELIF table type is offline and streaming download is disabled, final peer download schem = table config's peerSegmentDownloadScheme != null ? table config's peerSegmentDownloadScheme : instancel level pinot.server.peer.download.scheme

ELSE peer downloading is disabled.

Implementation

Get peers' segment URI from ZK. Randomly pick up one and do downloading.

Manual Test

Before this change, delete an offline segment’s HDFS copy and replace an offline server node holding a replica of that segment. We can see that the associated segment replica goes to the error state from Helix external view. After rolling out this feature, setting up either the instance level or table level peerDownloadingScheme brings back the missing replica.

@wirybeaver wirybeaver marked this pull request as draft November 2, 2022 07:51
@wirybeaver wirybeaver changed the title download offline segments from peers when segment streaming download … download offline segments from peers Nov 2, 2022
@jadami10
Copy link
Contributor

jadami10 commented Nov 2, 2022

this is awesome.

How are you thinking about observability here? If you didn't expect the deep store data to be missing, but servers are just silently downloading segments from one another, how would you know? Attempts exceeded when downloading segment seems to be the only indicator. Maybe we need 2 new metrics for "download failed from deep store" vs "download failed from peer" and keep the current "download failed" as is to be used if both fail for backwards compatibility?

@wirybeaver
Copy link
Contributor Author

this is awesome.

How are you thinking about observability here? If you didn't expect the deep store data to be missing, but servers are just silently downloading segments from one another, how would you know? Attempts exceeded when downloading segment seems to be the only indicator. Maybe we need 2 new metrics for "download failed from deep store" vs "download failed from peer" and keep the current "download failed" as is to be used if both fail for backwards compatibility?

Thanks for pointing it out. Will do.

@wirybeaver
Copy link
Contributor Author

this is awesome.

How are you thinking about observability here? If you didn't expect the deep store data to be missing, but servers are just silently downloading segments from one another, how would you know? Attempts exceeded when downloading segment seems to be the only indicator. Maybe we need 2 new metrics for "download failed from deep store" vs "download failed from peer" and keep the current "download failed" as is to be used if both fail for backwards compatibility?

Done

@wirybeaver wirybeaver force-pushed the xuanyili/offlinePeersSecurity branch from 8401a48 to 7685439 Compare November 14, 2022 07:24
@wirybeaver
Copy link
Contributor Author

wirybeaver commented Nov 14, 2022

rebase and solve conflict. mvn test -Dtest=BaseTableDataManagerTest -pl pinot-core

A newbie question, the guide recommend to use git rebase. AFAIK, all commits will be squash to a single commit and thus can guarantee that make the commit history linear when the PR got merged. If so, shall we use git merge when pull the remote master change? Otherwise, I need to force overriding every time push after git rebase. @jadami10

@wirybeaver wirybeaver marked this pull request as ready for review November 14, 2022 07:41
@wirybeaver wirybeaver changed the title download offline segments from peers [WIP] download offline segments from peers Nov 14, 2022
@jadami10
Copy link
Contributor

A newbie question, the guide recommend to use git rebase. AFAIK, all commits will be squash to a single commit and thus can guarantee that make the commit history linear when the PR got merged. If so, shall we use git merge when pull the remote master change? Otherwise, I need to force overriding every time push after git rebase. @jadami10

With github, I generally use git merge when updating PRs because it keeps the change history. When you rebase/force push, the reviewer loses the "see changes since your last review" button. It's not a big deal, just a quality of life change.

If you've ever used something like phabricator, it does a better job of decoupling changes from their actual commits, so even if someone rebases/force pushes, it can show you just the changes that user made.

Copy link
Contributor

@jadami10 jadami10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good. left a few comments. I think you'll need one of the maintainers to approve the tests to run

private void fetchAndDecryptSegmentToLocalInternal(@NonNull List<URI> uris, File dest, String crypterName)
throws Exception {
Preconditions.checkArgument(!uris.isEmpty(), "empty uris passed into the fetchAndDecryptSegmentToLocalInternal");
URI uri = uris.get(RANDOM.nextInt(uris.size()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are all the peer URIs, and you're randomly selecting one? If so, this feels like it should just live in the calling function to this. No need to duplicate the fetch functions here just to get a random selector in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am considering whether we should introduces the strategy pattern here: user can either choose to pick up a random or go over all peers. The trade of between responsiveness and data reliability can be determined by user. For the sake of time, I hide the random implementation inside this function. In the future, I think the function signature would be

private void fetchAndDecryptSegmentToLocalInternal(@NonNull List<URI> uris, File dest, String crypterName, PeerDownloadderStrategy strategy);

interface PeerDownloaderStrategy {
Response download(List<URI> uris, File dest, Context);
}

class Context {
 String crypterName;
}

Copy link
Contributor Author

@wirybeaver wirybeaver Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, I won't integrate the strategy pattern and directly use the random pick strategy by default. What do you think? @jadami10

@wirybeaver
Copy link
Contributor Author

wirybeaver commented Nov 16, 2022

A newbie question, the guide recommend to use git rebase. AFAIK, all commits will be squash to a single commit and thus can guarantee that make the commit history linear when the PR got merged. If so, shall we use git merge when pull the remote master change? Otherwise, I need to force overriding every time push after git rebase. @jadami10

With github, I generally use git merge when updating PRs because it keeps the change history. When you rebase/force push, the reviewer loses the "see changes since your last review" button. It's not a big deal, just a quality of life change.

If you've ever used something like phabricator, it does a better job of decoupling changes from their actual commits, so even if someone rebases/force pushes, it can show you just the changes that user made.

Thanks. Totally agree with you I use github before joining uber. I am meant to use get merge but the official Pinot OSS developer guide uses git rebase at the bottom when merge the master change into my local feature branch, which made me confused. We need to update the instruction. https://docs.pinot.apache.org/developers/developers-and-contributors/contribution-guidelines#creating-a-pull-request-pr

@codecov-commenter
Copy link

codecov-commenter commented Nov 17, 2022

Codecov Report

Merging #9710 (dc84e8f) into master (3724ba2) will increase coverage by 34.00%.
The diff coverage is 62.74%.

@@              Coverage Diff              @@
##             master    #9710       +/-   ##
=============================================
+ Coverage     34.69%   68.70%   +34.00%     
- Complexity      190     4995     +4805     
=============================================
  Files          1965     1965               
  Lines        105115   105164       +49     
  Branches      15909    15914        +5     
=============================================
+ Hits          36474    72251    +35777     
+ Misses        65542    27793    -37749     
- Partials       3099     5120     +2021     
Flag Coverage Δ
integration1 25.19% <11.76%> (+0.07%) ⬆️
unittests1 67.95% <62.00%> (?)
unittests2 15.82% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ot/common/utils/fetcher/SegmentFetcherFactory.java 86.30% <50.00%> (+40.39%) ⬆️
...ent/local/data/manager/TableDataManagerConfig.java 26.31% <55.55%> (+26.31%) ⬆️
.../pinot/core/data/manager/BaseTableDataManager.java 74.58% <65.38%> (+12.90%) ⬆️
...a/org/apache/pinot/common/metrics/ServerMeter.java 100.00% <100.00%> (ø)
...pache/pinot/core/util/PeerServerSegmentFinder.java 75.00% <100.00%> (+75.00%) ⬆️
.../starter/helix/HelixInstanceDataManagerConfig.java 81.63% <100.00%> (+0.38%) ⬆️
...a/manager/realtime/RealtimeSegmentDataManager.java 75.00% <0.00%> (-25.00%) ⬇️
...lix/core/realtime/PinotRealtimeSegmentManager.java 75.39% <0.00%> (-2.62%) ⬇️
.../helix/core/realtime/SegmentCompletionManager.java 71.95% <0.00%> (-1.02%) ⬇️
.../core/realtime/PinotLLCRealtimeSegmentManager.java 71.01% <0.00%> (-0.61%) ⬇️
... and 1040 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@wirybeaver wirybeaver changed the title [WIP] download offline segments from peers Download offline segments from peers Nov 21, 2022
@wirybeaver
Copy link
Contributor Author

wirybeaver commented Nov 21, 2022

@chenboat I haven't integrated the instance level config into realtime table in this approach. Can I skip it for this PR even though it brings obscurity into the rule of determining the value of final peer download scheme. Check the PR description for details on the rule.

@wirybeaver wirybeaver requested review from jadami10 and chenboat and removed request for jadami10 November 21, 2022 22:14
@chenboat chenboat merged commit 6cfa6dc into apache:master Nov 22, 2022
@wirybeaver wirybeaver deleted the xuanyili/offlinePeersSecurity branch January 8, 2024 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants