Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement 'StoppableCheck' Flag to Ensure Maintenance Mode is Avoided After Stoppable Instances are Shutdown #2736

Merged

Conversation

MarkGaox
Copy link
Contributor

@MarkGaox MarkGaox commented Jan 11, 2024

Issues

(#200 - Link your issue number here: You can write "Fixes #XXX". Please use the proper keyword so that the issue gets closed automatically. See https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue
Any of the following keywords can be used: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved)

Description

  • Here are some details about my PR, including screenshots of any UI changes:
  • This PR adds a new option in StoppableCheck API to allow users to choose whether allow the stoppable instances to exceed the maximum of offline instances allowed. If notExceedMaxOfflineInstances sets to true, the max Stoppable instances won't surpass the limit on the cluster configuration, which ensures the cluster won't enter the maintenance mode and do the emergency rebalance.

(Write a concise description including what, why, how)

Tests

  • The following tests are written for this issue:
    mvn test -Dtest=TestInstancesAccessor,TestMaintenanceManagementService,TestInstanceValidationUtilInRest,TestPerInstanceAccessor -pl helix-rest && mvn test -Dtest=TestInstanceValidationUtil -pl helix-core

(List the names of added unit/integration tests)

  • The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

[INFO] Tests run: 52, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 91.416 s - in TestSuite
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 52, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-rest ---
[INFO] Loading execution data file /Users/xiaxgao/IdeaProjects/helix_ps/helix-rest/target/jacoco.exec
[INFO] Analyzed bundle 'Apache Helix :: Restful Interface' with 95 classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:35 min
[INFO] Finished at: 2024-01-30T14:40:47-08:00
[INFO] ------------------------------------------------------------------------
[INFO] Tests run: 24, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.94 s - in org.apache.helix.util.TestInstanceValidationUtil
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 24, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-core ---
[INFO] Loading execution data file /Users/xiaxgao/IdeaProjects/helix_ps/helix-core/target/jacoco.exec
[INFO] Analyzed bundle 'Apache Helix :: Core' with 947 classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  11.892 s
[INFO] Finished at: 2024-01-11T12:58:10-08:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@MarkGaox MarkGaox changed the title Implement 'StoppableCheck' Feature to Ensure Maintenance Mode is Avoided After Stoppable Instances are Shutdown Implement 'StoppableCheck' Flag to Ensure Maintenance Mode is Avoided After Stoppable Instances are Shutdown Jan 17, 2024
}
// CUSTOM_INSTANCE_CHECK and CUSTOM_PARTITION_CHECK can only be added to the failedReasonsNode
// if continueOnFailure is true and there is no failed Helix_OWN_CHECKS.
if (_continueOnFailure && !failedHelixOwnChecks) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should skip check when continueOnFailure is false as a perf improvement. Instead of filter it when consolidate result..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say our maxAllowedOffline = 3, and there are 5 instances. The problem is when we do batchGetStoppable for these 5 instances, we would have to process all 5 in parallel. However, if we process them in parallel, then there is no easy way for the individual instance to know the stop of itself will violate the maxAllowedOffline = 3 constraint unless we put a lock on everything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to handle this is to process instances by the amount of maxAllowedOffline. Say our maxAllowedOffline = 3, and there are 10 instances in the same zone. In the first iteration, we process instance1-3. If the cumulative stoppable instances count doesn't exceed maxAllowedOffline, we do the next iteration of instances4-6 and so on. But I'm worried about the performance in this design because now our API can only batchProcess maxAllowedOffline number of instances in parallel. If the instanceList is super long, our check could take many iteration to be finished.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2cents: first comes correctness and then comes performance optimization. i am not completely plugged into your work, but correctness is very important to focus on.

// If maxOfflineInstancesAllowed is not set, it means there is no limit on the number of offline instances.
// Therefore, builder sets the maxOfflineInstancesAllowed to the default value, Integer.MAX_VALUE.
if (clusterConfig.getMaxOfflineInstancesAllowed() != -1) {
builder.setMaxAdditionalOfflineInstances(clusterConfig.getMaxOfflineInstancesAllowed());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep -1 and have special handling when MaxOfflineInstances <0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a even more reasonable solution is to not allow user do stoppableCheck if they didn't provide maxOfflineInstancesAllowed in their cluster config. What do you think?

stoppableInstancesSelector.calculateOrderOfZone(instances, random);
Set<String> finalCurrentOfflineInstances = currentOfflineInstances;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we want to create a new list?

@xyuanlu
Copy link
Contributor

xyuanlu commented Jan 30, 2024

Thanks for addressing comments. I think the Max offline instance in cluster config also include disabled instances.

@MarkGaox
Copy link
Contributor Author

This PR is approved by @xyuanlu, final commit message:
"Implement 'StoppableCheck' Flag to Ensure Maintenance Mode is Avoided After Stoppable Instances are offline"

@MarkGaox
Copy link
Contributor Author

@xyuanlu the failed test is testEvacuationWithOfflineInstancesInCluster which is a known flaky test.

@xyuanlu xyuanlu merged commit 56dd5d2 into apache:ApplicationClusterManager Jan 31, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants