-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement 'StoppableCheck' Flag to Ensure Maintenance Mode is Avoided After Stoppable Instances are Shutdown #2736
Implement 'StoppableCheck' Flag to Ensure Maintenance Mode is Avoided After Stoppable Instances are Shutdown #2736
Conversation
…ded After Stoppable Instances are Shutdown
...rc/main/java/org/apache/helix/rest/clusterMaintenanceService/StoppableInstancesSelector.java
Outdated
Show resolved
Hide resolved
} | ||
// CUSTOM_INSTANCE_CHECK and CUSTOM_PARTITION_CHECK can only be added to the failedReasonsNode | ||
// if continueOnFailure is true and there is no failed Helix_OWN_CHECKS. | ||
if (_continueOnFailure && !failedHelixOwnChecks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should skip check when continueOnFailure is false as a perf improvement. Instead of filter it when consolidate result..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say our maxAllowedOffline = 3, and there are 5 instances. The problem is when we do batchGetStoppable for these 5 instances, we would have to process all 5 in parallel. However, if we process them in parallel, then there is no easy way for the individual instance to know the stop of itself will violate the maxAllowedOffline = 3 constraint unless we put a lock on everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to handle this is to process instances by the amount of maxAllowedOffline. Say our maxAllowedOffline = 3, and there are 10 instances in the same zone. In the first iteration, we process instance1-3. If the cumulative stoppable instances count doesn't exceed maxAllowedOffline, we do the next iteration of instances4-6 and so on. But I'm worried about the performance in this design because now our API can only batchProcess maxAllowedOffline
number of instances in parallel. If the instanceList is super long, our check could take many iteration to be finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2cents: first comes correctness and then comes performance optimization. i am not completely plugged into your work, but correctness is very important to focus on.
helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/InstancesAccessor.java
Outdated
Show resolved
Hide resolved
// If maxOfflineInstancesAllowed is not set, it means there is no limit on the number of offline instances. | ||
// Therefore, builder sets the maxOfflineInstancesAllowed to the default value, Integer.MAX_VALUE. | ||
if (clusterConfig.getMaxOfflineInstancesAllowed() != -1) { | ||
builder.setMaxAdditionalOfflineInstances(clusterConfig.getMaxOfflineInstancesAllowed()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep -1 and have special handling when MaxOfflineInstances <0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a even more reasonable solution is to not allow user do stoppableCheck if they didn't provide maxOfflineInstancesAllowed
in their cluster config. What do you think?
...rc/main/java/org/apache/helix/rest/clusterMaintenanceService/StoppableInstancesSelector.java
Outdated
Show resolved
Hide resolved
stoppableInstancesSelector.calculateOrderOfZone(instances, random); | ||
Set<String> finalCurrentOfflineInstances = currentOfflineInstances; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we want to create a new list?
Thanks for addressing comments. I think the Max offline instance in cluster config also include disabled instances.
|
This PR is approved by @xyuanlu, final commit message: |
@xyuanlu the failed test is |
56dd5d2
into
apache:ApplicationClusterManager
Issues
(#200 - Link your issue number here: You can write "Fixes #XXX". Please use the proper keyword so that the issue gets closed automatically. See https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue
Any of the following keywords can be used: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved)
Description
notExceedMaxOfflineInstances
sets to true, the max Stoppable instances won't surpass the limit on the cluster configuration, which ensures the cluster won't enter the maintenance mode and do the emergency rebalance.(Write a concise description including what, why, how)
Tests
mvn test -Dtest=TestInstancesAccessor,TestMaintenanceManagementService,TestInstanceValidationUtilInRest,TestPerInstanceAccessor -pl helix-rest && mvn test -Dtest=TestInstanceValidationUtil -pl helix-core
(List the names of added unit/integration tests)
(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)
Changes that Break Backward Compatibility (Optional)
(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)
Documentation (Optional)
(Link the GitHub wiki you added)
Commits
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)