Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce adaptive tasklist scaler #6506

Merged
merged 3 commits into from
Nov 25, 2024

Conversation

Shaddoll
Copy link
Member

@Shaddoll Shaddoll commented Nov 18, 2024

This PR introduces a new component AdaptiveScaler to Matching's task List Manager. This component only runs in the root partition or a Normal task list and is turned on only if the following 2 dynamic config properties are set true:

  • MatchingEnableAdaptiveScaler
  • MatchingEnableGetNumberOfPartitionsFromCache

This component monitors the add task QPS of the root partition of a task list and decides whether the task list need more partitions or the number of partitions need to be decreased. It's based on the assumption that the add task QPS is evenly distributed among all task list partitions.

When the adaptive scaler decides to increase the number of partitions, it increases the number of read partitions and write partitions at the same time. When it decreases the number of partitions, it decreases the number of write partitions first and the number of read partitions is decreased only after all the backlog of the read partitions are drained.

The component is configured by the following 5 dynamic config properties:

  • MatchingPartitionUpscaleRPS: default to 200
  • MatchingPartitionDownscaleFactor: default to 0.75
  • MatchingPartitionUpscaleSustainedDuration: default to 1 min
  • MatchingPartitionDownscaleSustainedDuration: default to 2 min
  • MatchingAdaptiveScalerUpdateInterval: default to 15 seconds

MatchingAdaptiveScalerUpdateInterval configures how often it checks the QPS.

MatchingPartitionUpscaleSustainedDuration determines the minimum duration a high load must be sustained on matching task list to trigger the operation to increase the number of partitions.

MatchingPartitionDownscaleSustainedDuration determines the minimum duration a low load must be sustained on matching task list to trigger the operation to decrease the number of partitions.

High load definition:

total QPS > MatchingPartitionUpscaleRPS * Number of Write Partitions

Low load definition:

total QPS < MatchingPartitionUpscaleRPS * (Number of Write Partitions - 1) * MatchingPartitionDownscaleFactor

Other minor changes:

  • Disable manual update of TaskListPartitionConfig if adaptive scaler is turned on
  • Do not accept add task requests if the partition is removed
  • If current partition config is nil and we want to update the number of partitions to 1, it should be no-op.

service/matching/tasklist/task_list_manager.go Outdated Show resolved Hide resolved
service/matching/tasklist/adaptive_scaler.go Show resolved Hide resolved
a.underLoad = true
a.underLoadStartTime = a.timeSource.Now()
} else if a.timeSource.Now().Sub(a.underLoadStartTime) > a.config.PartitionDownscaleSustainedDuration() {
numWritePartitions = getNumberOfPartitions(partitionConfig.NumWritePartitions, qps, upscaleThreshold) // NOTE: this has to be upscaleThreshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach can make some counterintuitive scaling decisions. Any time (upscaleThreshold - downscaleThreshold) * numWritePartitions > (2 * upscaleThreshold) we're only able to scale down by multiple partitions at once.

For example, if we have thresholds of (500, 1000) with 10,000 global traffic then we would end up with 10 partitions and we would never scale down until global traffic drops below 5,000, at which point we dramatically scale from 10 partitions to 5. If traffic goes back to 5,001 then we'd scale up to 6 partitions, but we'd only scale down again if traffic drops below 3,000.

This approach also can end up in kind of strange scenarios where we're continually underLoad but we never change the number of partitions. If we have thresholds of (500, 600) with 1800 global traffic then we would have 3 partitions. When traffic drops to 1499 we're considered underLoad and would try to update numWritePartitions but it won't actually change the value until traffic drops below 1200.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this, I think to avoid fluctuation, the thresholds need to satisfy this inequality

downscale threshold <= upscale threshold * N / (N + 1) (for all N)

Let's take your (500, 600) thresholds as an example, if the global qps is 1801, then we would have 4 partitions. But if we have 4 partitions, the load of each partition will be around 450, which is less than 500, and we will need to downscale. When the downscale operation is triggered, we have to recalculate the number of partitions based on the qps of root partition. The estimation is around 450, so it could be 451 or 449. If the estimation is larger than 450, then we still have 4 partitions, so no change. But the estimation is 449, then downscale will be triggered.

@Shaddoll Shaddoll force-pushed the partition branch 2 times, most recently from f2c4533 to ffbd993 Compare November 22, 2024 03:46
@Shaddoll Shaddoll force-pushed the partition branch 2 times, most recently from 5af33e8 to ef9b3c4 Compare November 22, 2024 18:03
common/dynamicconfig/constants.go Show resolved Hide resolved
common/dynamicconfig/constants.go Show resolved Hide resolved
service/matching/tasklist/adaptive_scaler.go Outdated Show resolved Hide resolved
service/matching/tasklist/adaptive_scaler.go Show resolved Hide resolved
service/matching/tasklist/adaptive_scaler.go Outdated Show resolved Hide resolved
service/matching/tasklist/adaptive_scaler.go Show resolved Hide resolved
a.overLoad = false
}
} else {
a.overLoad = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation requires consecutive overloaded windows to scale up. If qps drops momentarily for some reason (rate limit, qps tracker calculation issue etc.) then we will not be able to scale up.
Have you considered doing this calculation more often (every 1s instead of every 15s) and generating a series of overloaded/not-overloaded results. After a minute you would have 60 data points representing whether it was overloaded for that particular second. Then determine whether scale up is needed if it's overloaded more than half of the time?
I just thought of this idea so it may not be ideal solution but something to address consecutiveness requirement would be needed IMO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can discuss this offline, but I think with downscaleFactor, we can handle fluctuation. The default factor is 0.75, which means the number of partitions won't change unless the traffic drop by 25%.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need matching simulator to validate different options of scale formula. I assume you will update simulation next and iterate on this. If so this looks like a good start. Can you confirm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't fit the simulation framework because the output of simulation tests assume that the number of partitions don't change. I can run bench tests instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simulation framework can be enhanced to support this. Feedback loop is faster in simulations and also more repeatable than benchmarking in dev env so let's invest in this.

@Shaddoll Shaddoll merged commit 5b2be37 into cadence-workflow:master Nov 25, 2024
17 checks passed
@Shaddoll Shaddoll deleted the partition branch November 25, 2024 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants