WagedInstanceCapacity Null Pointer Exception due to stale _instanceCapacityMap #2891

GrantPSpencer · 2024-08-24T03:54:41Z

Describe the bug

Waged pipeline will fail due to NPE during BestPossibleStateCalcStage as it will call checkAndReduceInstanceCapacity on an instance that is not in the WagedInstanceCapacity's _instanceCapacityMap. This will occur when the WagedInstanceCapacity is calculated at point A, a new instance is added at point B, and then at at point C the WagedInstanceCapacity is not refreshed to include this instance during the CurrentStateComputationStage The specific circumstances are detailed below

To Reproduce

Add at least 1 waged enabled resource to a cluster and rebalance so assignments are made.
Drop all resources from the cluster.
Add a new instance ("new_instance")to the cluster.
Add 1 waged enabled resource to the cluster
NPE will occur

This occurs because "new_instance" is an assignable instance and is in the newly calculated preference list. So checkAndReduceInstanceCapacity is called on the instance. However, WagedInstanceCapacity's _instanceCapacityMap has not been updated and therefore has a stale view that does not include "new_instance"

This is because the skipCapacityCalculation method (a very effective optimization) causes the CurrentStateComputationStage to not refresh the cache if there are no resources in the resourceMap. However, the resourceMap is constructed based off the idealStates in the cluster which does not exist at this point. When a resource is added, a ResourceConfigChange event is first fired. Afterwards, an IdealStateChange will fire. In this case of a new resource being added, the CurrentStateComputationStage will not recalculate the WagedInstanceCapacity as the resourceMap is empty when we encounter a ResourceConfigChange and then we do not recalculate on subsequent IdealStateChange

Adding a WAGED resource to a new cluster does not trigger this NPE because there is no WagedInstanceCapacity so

    if (Objects.isNull(cache.getWagedInstanceCapacity())) {
      return false;
    }

will force it to be refreshed.

GrantPSpencer#32
The testcase in this draft PR will fail on master and follows the steps outlined above.

Expected behavior

WagedInstanceCapacity should be recalculated in the case of a new resource being added prior to the BestPossibleStateCalcStage.

Additional context

10539 [HelixController-pipeline-default-TestWagedNPE_cluster-(45df0f8d_DEFAULT)] ERROR org.apache.helix.controller.GenericHelixController [] - Exception while executing DEFAULT pipeline for cluster TestWagedNPE_cluster. Will not continue to next pipeline
java.lang.NullPointerException: null
	at org.apache.helix.controller.rebalancer.waged.WagedInstanceCapacity.checkAndReduceInstanceCapacity(WagedInstanceCapacity.java:206) ~[classes/:?]
	at org.apache.helix.controller.dataproviders.ResourceControllerDataProvider.checkAndReduceCapacity(ResourceControllerDataProvider.java:535) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.DelayedAutoRebalancer.computeBestPossibleStateForPartition(DelayedAutoRebalancer.java:377) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.DelayedAutoRebalancer.computeBestPossiblePartitionState(DelayedAutoRebalancer.java:271) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.DelayedAutoRebalancer.computeBestPossiblePartitionState(DelayedAutoRebalancer.java:54) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.lambda$computeNewIdealStates$0(WagedRebalancer.java:281) ~[classes/:?]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
	at java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1692) ~[?:?]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
	at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290) ~[?:?]
	at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746) ~[?:?]
	at java.util.concurrent.ForkJoinTask.doExec$$$capture(ForkJoinTask.java:290) ~[?:?]
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java) ~[?:?]
	at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:408) ~[?:?]
	at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:736) ~[?:?]
	at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159) ~[?:?]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173) ~[?:?]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) ~[?:?]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) ~[?:?]
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:661) ~[?:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeNewIdealStates(WagedRebalancer.java:277) ~[classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleStateWithWagedRebalancer(BestPossibleStateCalcStage.java:445) ~[classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:289) ~[classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:94) ~[classes/:?]
	at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) ~[classes/:?]
	at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903) [classes/:?]
	at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554) [classes/:?]

The text was updated successfully, but these errors were encountered:

GrantPSpencer added the bug Something isn't working label Aug 24, 2024

GrantPSpencer mentioned this issue Nov 25, 2024

Fix waged instance capacity npe on new resource #2969

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WagedInstanceCapacity Null Pointer Exception due to stale _instanceCapacityMap #2891

WagedInstanceCapacity Null Pointer Exception due to stale _instanceCapacityMap #2891

GrantPSpencer commented Aug 24, 2024

WagedInstanceCapacity Null Pointer Exception due to stale _instanceCapacityMap #2891

WagedInstanceCapacity Null Pointer Exception due to stale _instanceCapacityMap #2891

Comments

GrantPSpencer commented Aug 24, 2024

Describe the bug

To Reproduce

Expected behavior

Additional context