Implement 'StoppableCheck' Flag to Ensure Maintenance Mode is Avoided After Stoppable Instances are Shutdown #2736

MarkGaox · 2024-01-11T21:30:30Z

Issues

My PR addresses the following Helix issues and references them in the PR description:
Cross-zone Stoppable Check #2655

(#200 - Link your issue number here: You can write "Fixes #XXX". Please use the proper keyword so that the issue gets closed automatically. See https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue
Any of the following keywords can be used: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved)

Description

Here are some details about my PR, including screenshots of any UI changes:
This PR adds a new option in StoppableCheck API to allow users to choose whether allow the stoppable instances to exceed the maximum of offline instances allowed. If notExceedMaxOfflineInstances sets to true, the max Stoppable instances won't surpass the limit on the cluster configuration, which ensures the cluster won't enter the maintenance mode and do the emergency rebalance.

(Write a concise description including what, why, how)

Tests

The following tests are written for this issue:
mvn test -Dtest=TestInstancesAccessor,TestMaintenanceManagementService,TestInstanceValidationUtilInRest,TestPerInstanceAccessor -pl helix-rest && mvn test -Dtest=TestInstanceValidationUtil -pl helix-core

(List the names of added unit/integration tests)

The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

[INFO] Tests run: 52, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 91.416 s - in TestSuite
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 52, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-rest ---
[INFO] Loading execution data file /Users/xiaxgao/IdeaProjects/helix_ps/helix-rest/target/jacoco.exec
[INFO] Analyzed bundle 'Apache Helix :: Restful Interface' with 95 classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:35 min
[INFO] Finished at: 2024-01-30T14:40:47-08:00
[INFO] ------------------------------------------------------------------------

[INFO] Tests run: 24, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.94 s - in org.apache.helix.util.TestInstanceValidationUtil
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 24, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-core ---
[INFO] Loading execution data file /Users/xiaxgao/IdeaProjects/helix_ps/helix-core/target/jacoco.exec
[INFO] Analyzed bundle 'Apache Helix :: Core' with 947 classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  11.892 s
[INFO] Finished at: 2024-01-11T12:58:10-08:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

…ded After Stoppable Instances are Shutdown

...rc/main/java/org/apache/helix/rest/clusterMaintenanceService/StoppableInstancesSelector.java

xyuanlu · 2024-01-24T06:56:49Z

...rc/main/java/org/apache/helix/rest/clusterMaintenanceService/StoppableInstancesSelector.java

+          }
+          // CUSTOM_INSTANCE_CHECK and CUSTOM_PARTITION_CHECK can only be added to the failedReasonsNode
+          // if continueOnFailure is true and there is no failed Helix_OWN_CHECKS.
+          if (_continueOnFailure && !failedHelixOwnChecks) {


I feel like we should skip check when continueOnFailure is false as a perf improvement. Instead of filter it when consolidate result..?

Say our maxAllowedOffline = 3, and there are 5 instances. The problem is when we do batchGetStoppable for these 5 instances, we would have to process all 5 in parallel. However, if we process them in parallel, then there is no easy way for the individual instance to know the stop of itself will violate the maxAllowedOffline = 3 constraint unless we put a lock on everything.

Another way to handle this is to process instances by the amount of maxAllowedOffline. Say our maxAllowedOffline = 3, and there are 10 instances in the same zone. In the first iteration, we process instance1-3. If the cumulative stoppable instances count doesn't exceed maxAllowedOffline, we do the next iteration of instances4-6 and so on. But I'm worried about the performance in this design because now our API can only batchProcess maxAllowedOffline number of instances in parallel. If the instanceList is super long, our check could take many iteration to be finished.

My 2cents: first comes correctness and then comes performance optimization. i am not completely plugged into your work, but correctness is very important to focus on.

helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/InstancesAccessor.java

xyuanlu · 2024-01-25T05:59:15Z

helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/InstancesAccessor.java

+        // If maxOfflineInstancesAllowed is not set, it means there is no limit on the number of offline instances.
+        // Therefore, builder sets the maxOfflineInstancesAllowed to the default value, Integer.MAX_VALUE.
+        if (clusterConfig.getMaxOfflineInstancesAllowed() != -1) {
+          builder.setMaxAdditionalOfflineInstances(clusterConfig.getMaxOfflineInstancesAllowed());


Can we keep -1 and have special handling when MaxOfflineInstances <0?

I think a even more reasonable solution is to not allow user do stoppableCheck if they didn't provide maxOfflineInstancesAllowed in their cluster config. What do you think?

...rc/main/java/org/apache/helix/rest/clusterMaintenanceService/StoppableInstancesSelector.java

xyuanlu · 2024-01-29T23:04:46Z

helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/InstancesAccessor.java

      stoppableInstancesSelector.calculateOrderOfZone(instances, random);
+      Set<String> finalCurrentOfflineInstances = currentOfflineInstances;


Why we want to create a new list?

xyuanlu · 2024-01-30T21:31:29Z

Thanks for addressing comments. I think the Max offline instance in cluster config also include disabled instances.

helix/helix-core/src/main/java/org/apache/helix/model/ClusterConfig.java

Line 74 in 054d77a

MAX_OFFLINE_INSTANCES_ALLOWED,

MarkGaox · 2024-01-31T00:18:48Z

This PR is approved by @xyuanlu, final commit message:
"Implement 'StoppableCheck' Flag to Ensure Maintenance Mode is Avoided After Stoppable Instances are offline"

MarkGaox · 2024-01-31T04:33:54Z

@xyuanlu the failed test is testEvacuationWithOfflineInstancesInCluster which is a known flaky test.

Implement 'StoppableCheck' Feature to Ensure Maintenance Mode is Avoi…

dc32bd4

…ded After Stoppable Instances are Shutdown

MarkGaox changed the title ~~Implement 'StoppableCheck' Feature to Ensure Maintenance Mode is Avoided After Stoppable Instances are Shutdown~~ Implement 'StoppableCheck' Flag to Ensure Maintenance Mode is Avoided After Stoppable Instances are Shutdown Jan 17, 2024

xyuanlu reviewed Jan 24, 2024

View reviewed changes

...rc/main/java/org/apache/helix/rest/clusterMaintenanceService/StoppableInstancesSelector.java Outdated Show resolved Hide resolved

xyuanlu reviewed Jan 24, 2024

View reviewed changes

xyuanlu reviewed Jan 25, 2024

View reviewed changes

helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/InstancesAccessor.java Outdated Show resolved Hide resolved

xyuanlu reviewed Jan 25, 2024

View reviewed changes

...rc/main/java/org/apache/helix/rest/clusterMaintenanceService/StoppableInstancesSelector.java Outdated Show resolved Hide resolved

MarkGaox added 2 commits January 28, 2024 17:51

Filter out current offline instances

50cc90a

serialize tests

9a6188f

xyuanlu reviewed Jan 29, 2024

View reviewed changes

Add disabled instances

77cac6e

xyuanlu approved these changes Jan 30, 2024

View reviewed changes

xyuanlu merged commit 56dd5d2 into apache:ApplicationClusterManager Jan 31, 2024
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement 'StoppableCheck' Flag to Ensure Maintenance Mode is Avoided After Stoppable Instances are Shutdown #2736

Implement 'StoppableCheck' Flag to Ensure Maintenance Mode is Avoided After Stoppable Instances are Shutdown #2736

MarkGaox commented Jan 11, 2024 •

edited

Loading

xyuanlu Jan 24, 2024

MarkGaox Jan 24, 2024

MarkGaox Jan 24, 2024

desaikomal Jan 24, 2024

xyuanlu Jan 25, 2024

MarkGaox Jan 29, 2024

xyuanlu Jan 29, 2024

xyuanlu commented Jan 30, 2024

MarkGaox commented Jan 31, 2024

MarkGaox commented Jan 31, 2024

		stoppableInstancesSelector.calculateOrderOfZone(instances, random);
		Set<String> finalCurrentOfflineInstances = currentOfflineInstances;

Implement 'StoppableCheck' Flag to Ensure Maintenance Mode is Avoided After Stoppable Instances are Shutdown #2736

Implement 'StoppableCheck' Flag to Ensure Maintenance Mode is Avoided After Stoppable Instances are Shutdown #2736

Conversation

MarkGaox commented Jan 11, 2024 • edited Loading

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Documentation (Optional)

Commits

Code Quality

xyuanlu Jan 24, 2024

Choose a reason for hiding this comment

MarkGaox Jan 24, 2024

Choose a reason for hiding this comment

MarkGaox Jan 24, 2024

Choose a reason for hiding this comment

desaikomal Jan 24, 2024

Choose a reason for hiding this comment

xyuanlu Jan 25, 2024

Choose a reason for hiding this comment

MarkGaox Jan 29, 2024

Choose a reason for hiding this comment

xyuanlu Jan 29, 2024

Choose a reason for hiding this comment

xyuanlu commented Jan 30, 2024

MarkGaox commented Jan 31, 2024

MarkGaox commented Jan 31, 2024

MarkGaox commented Jan 11, 2024 •

edited

Loading