Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite FailureDetector interface and implementations to also work with the multi-stage engine #15005

Merged
merged 4 commits into from
Feb 14, 2025

Conversation

yashmayya
Copy link
Collaborator

  • The connection based failure detector was added in Add connection based FailureDetector #8491 and doesn't currently support the multi-stage engine.
  • This patch rewrites the FailureDetector interface and implementations so that it works with both the engines.
  • If a multi-stage query fails to be dispatched to a server due to connectivity issues, the broker should remove the server from its routing table to prevent further query failures.
  • A server will be removed from the broker routing table in case of failures in either QueryRouter (v1) or QueryDispatcher (v2). In the FailureDetector implementation that retries the connection with an exponential backoff, we'll only re-add a server to the routing table if both the connections succeed - the Netty channel used for v1 queries as well as the gRPC channel used for v2 queries.
  • The FailureDetector and listener based interface has been rewritten significantly in order to avoid multiple attempts to modify the broker routing table whenever a server is detected as healthy / unhealthy.
  • This patch also updated the single-stage engine's GrpcBrokerRequestHandler to make use of the FailureDetector.

@yashmayya yashmayya added enhancement multi-stage Related to the multi-stage query engine labels Feb 6, 2025
@codecov-commenter
Copy link

codecov-commenter commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 50.00000% with 51 lines in your changes missing coverage. Please review.

Project coverage is 63.31%. Comparing base (59551e4) to head (439cd37).
Report is 1709 commits behind head on master.

Files with missing lines Patch % Lines
...roker/requesthandler/GrpcBrokerRequestHandler.java 0.00% 28 Missing ⚠️
.../pinot/query/service/dispatch/QueryDispatcher.java 57.89% 8 Missing ⚠️
...requesthandler/MultiStageBrokerRequestHandler.java 40.00% 6 Missing ⚠️
...thandler/SingleConnectionBrokerRequestHandler.java 28.57% 5 Missing ⚠️
...e/pinot/broker/broker/helix/BaseBrokerStarter.java 71.42% 2 Missing ⚠️
...che/pinot/broker/routing/BrokerRoutingManager.java 0.00% 1 Missing ⚠️
...pache/pinot/common/utils/grpc/GrpcQueryClient.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #15005      +/-   ##
============================================
+ Coverage     61.75%   63.31%   +1.56%     
- Complexity      207     1483    +1276     
============================================
  Files          2436     2731     +295     
  Lines        133233   153622   +20389     
  Branches      20636    23730    +3094     
============================================
+ Hits          82274    97267   +14993     
- Misses        44911    49002    +4091     
- Partials       6048     7353    +1305     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.29% <50.00%> (+1.58%) ⬆️
java-21 63.21% <50.00%> (+1.58%) ⬆️
skip-bytebuffers-false 63.31% <50.00%> (+1.56%) ⬆️
skip-bytebuffers-true 63.18% <50.00%> (+35.45%) ⬆️
temurin 63.31% <50.00%> (+1.56%) ⬆️
unittests 63.31% <50.00%> (+1.56%) ⬆️
unittests1 56.07% <81.63%> (+9.18%) ⬆️
unittests2 33.73% <17.64%> (+6.00%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +27 to +29
* <p>
* This class doesn't currently implement any additional logic over BaseExponentialBackoffRetryFailureDetector and is
* retained for backward compatibility.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought we could probably remove this class altogether since the class name is not exposed via the user configuration directly.

@yashmayya yashmayya force-pushed the failure-detector-msqe branch 2 times, most recently from 67f10b0 to e8c1cfa Compare February 7, 2025 02:50
@yashmayya yashmayya marked this pull request as ready for review February 7, 2025 03:55
@gortiz gortiz self-requested a review February 13, 2025 07:44
@yashmayya yashmayya force-pushed the failure-detector-msqe branch from e8c1cfa to 46616f0 Compare February 13, 2025 08:54
@yashmayya yashmayya force-pushed the failure-detector-msqe branch from 46616f0 to a8aabd3 Compare February 13, 2025 08:57
/**
* Check if a server that was previously detected as unhealthy is now healthy.
*/
public boolean retryUnhealthyServer(String instanceId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why public?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, my bad, left over from a previous iteration with a different rewrite implementation. I've update it to private.

Comment on lines +113 to +115
for (Function<String, Boolean> unhealthyServerRetrier : _unhealthyServerRetriers) {
if (!unhealthyServerRetrier.apply(instanceId)) {
recovered = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just going to comment about this section when it was written the other way around, but it looks like you just changed it!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah with the previous implementation, it was possible for multiple listeners to mark the same server as healthy on a call to retryUnhealthyServer. I've updated the implementation so that the failure detector itself marks the server as healthy if all the listeners / retriers report that the server is healthy again (to be safe).

@yashmayya yashmayya merged commit f65f845 into apache:master Feb 14, 2025
21 checks passed
yashmayya added a commit to yashmayya/pinot that referenced this pull request Feb 18, 2025
zeronerdzerogeekzerocool pushed a commit to zeronerdzerogeekzerocool/pinot that referenced this pull request Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement multi-stage Related to the multi-stage query engine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants