Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC Parallelism and Adaptive Threading #5829

Closed
RSalman opened this issue Feb 25, 2021 · 5 comments
Closed

GC Parallelism and Adaptive Threading #5829

RSalman opened this issue Feb 25, 2021 · 5 comments

Comments

@RSalman
Copy link
Contributor

RSalman commented Feb 25, 2021

Summary

In general, parallelism is most beneficial when done with correct number of threads (optimal threads), anything more than this number results in unnecessary overhead. On the other hand, a thread count less than optimal threads is considered to be sub-optimal as there’s opportunity to parallelize a task further and gain benefits. The point at which we reach peak performance can be referred to as the equilibrium point. At this point, work is optimally distributed for all the utilized threads (or threads are evenly allocated between JVMs for multi-JVM scenarios) and the overhead is minimal while benefits are maximized.

By default, all available system threads are reserved for GC. This means a large system will utilize more threads when compared to a smaller system to complete the same task. Intuitively this should be effective as the large system is more powerful and has more resources (e.g., 48 core system vs. 4 core system). However, in the case where overhead of parallelism is greater than benefits (discussed below), the larger system will suffer. Additionally, in a multi-JVM scenario, reserving all system threads has further implications as the same set of threads will be reserved for each running JVM. As more threads are shared, effectiveness of parallelization decreases since more threads can become unavailable and limit GC progression. Hence, defaulting to use all available computing resources (all the available threads) is not efficient.

Using overhead data (busy/stall times for managing and synchronizing threads), Adaptive Threading aims to identify sub-optimal/detrimental parallelism and continuously adjusts the GC thread count to reach optimal parallelism.

Background: Parallelism and Overhead

Task parallelization is a key optimization in reducing GC cycle times. Optimal parallelism results from efficiently making use of the available threads throughout the duration of a collection cycle. However, there may be overhead associated with parallelizing tasks, this is often overlooked and unaccounted for. There can be various reasons for this overhead, all of which are a result of additional requirements needed with multi-threading to synchronize (critical sections and accessing global resources) and manage (dispatch and suspend) threads. We can say parallelization overhead is any time cost that’s incurred from utilizing a multiple of threads to perform a task. This overhead can be significant as it increases proportionally with the number of threads utilized.

In ideal situations, parallelization provides significantly more performance benefits than overhead and it can be said that parallelization overhead is negligible. However, this is not always the case, for instance, there may not be enough work to be distributed across the utilized threads. This is common with light workloads (with total collection times usually in the range of microseconds to few milliseconds) as they have limited amount of work to be divided thus leaving threads under-utilized and incurring overhead. Furthermore, depending on the workload and object graph, collection may be effectively parallelized only up to a certain number of threads; as a result, we can see an imbalanced distribution of workload when too many threads are utilized. In these cases, threads are left idle/unused yet they must still participate (be managed – dispatched/suspended) and go through the steps of GC (e.g., to reach synchronization points). With this type of parallelization, we end up incurring overhead without gaining any benefits. Overall, unless the benefits of parallelization are greater than the overhead incurred, parallelization will be detrimental and cause GC times to increase unnecessarily.

In addition to the scenarios described, effectiveness of parallelization may be limited by high CPU utilization. This is true for a system running multiple JVMs; in this case, system threads can be shared among different JVMs and thus among different GCs. With multiple JVMs, GC progress will depend on availability of threads.

Need for Adaptive Threading

A potential solution is for the user to force GC to use less threads by manually specifying it when starting the application. However, this is not an effective strategy as it requires tedious experimentation and data collection to find the correct number of threads for each application. Another potential solution is for GC to take pre-emptive action by using a fraction of the total available threads by default. A criteria (e.g., based on system size, based on heap size, etc.) can be used to determine the fraction of threads. For instance, we may want to significantly decrease the thread count as we anticipate overhead with a large system or when we know multiple JVMs will be running on the system. However, this is an incomplete solution, there are many assumptions that need to be made with such an approach. This approach will lead to in accuracies that will still result in detrimental or sub-optimal performance. We will still have cases where the fraction of threads are too many or not enough compared to optimal threads. All in all, this approach is unreliable. Another thing to consider is the fact that GC workloads and object graphs change over the lifetime of an application (i.e., cycle to cycle) or JVMs can be started or shutdown at any time, as a consequence the ability to parallelize varies cycles to cycle. Subsequent cycles may reach peak performance with more or less threads than required by previous cycles to be optimal. For instance, available work can increase in later collection cycles, this would mean that more threads will be required to reach peak performance than before. Thus, having a static number of threads or trivially adjusting threads (e.g., pre-emptive approach) can be problematic. Likewise, manually tuning/forcing the thread count can be ineffective. A systematic approach needs to be taken to identify detrimental/sub-optimal parallelism and react by adjusting the number of working threads for each subsequent collection.

The problem with parallelization overhead is illustrated in table 1 and figure 1 below. In figure 1, cycle times of GC running with 48 threads can observed to be much greater (~75 %) than the same workload being restricted to 4 threads. Furthermore, table 1 shows comparison of the same workload running with 48 threads, 8 threads and 4 threads. It can be seen that as GC threads are decreased (decreased parallelism), average scavenge times (cycle times) are decreased and as a result performance scores are increased.

In another example illustrated in table 2 below, we observe that peak performance is obtained when 8 threads are utilized. With 48 threads we incur too much overhead, while with 4 threads we still have opportunity to further parallelize the task with more threads and gain benefits. In other words, with 8 threads we have detrimental parallelism and with 4 threads we end up with sub-optimal parallelism.

image
image

From these examples, we can tell there’s a point of equilibrium that results in peak performance. For the workload measured in table 2, this is approximately 8 threads and for the workload measured in table 1 we obtain peak performance around 4 threads.

image

Adaptive Threading

Using heuristics and a threading model, an optimal thread count can be projected and the thread count can be adjusted dynamically between cycles to minimize inefficient use of threads. This cannot be done trivially; accurate decisions must be made as to when to adjust and how much to adjust by.

Adaptive Threading presents a systematic approach based on

  1. Number of threads utilized
  2. Overhead data (busy/stall times for managing and synchronizing threads) aggregated from utilized threads of previous GCs
    to do the following:
  • Determine the efficiency of a collection cycle.
    • identify detrimental or sub-optimal threading and the degree to which it’s detrimental or sub-optimal.
  • Project the optimal GC thread count
    • Determine equilibrium point, where parallelization results in peak performance.
  • Give recommendation on how to adjust thread count for the next cycle to reach equilibrium point.
  • Recommendation is not invasive, for instance there should not be adverse effects given anomalies.

Adaptive Model

A threading model can be expressed mathematically as a continues function, which takes a set of inputs and projects optimal number of threads to be used. A good model is one that can accurately determine efficiency of a cycle and predict the optimal GC thread count based on current number of threads, directly proportional to busy time and inversely proportional to idle times. One such implementation of the model can be derived by finding a minimum of the following GC time function (used to project total duration of GC for m threads, with observed busy and stall times while performing GC with n threads):
image
Where,

  • m = number of threads for which total GC time (duration) is projected
  • n = number of utilized threads (number of worker threads started + main thread) for the GC cycle used to observe s and b
  • s = stall time per thread = average observed collection stall time for n threads = (Aggregated Stall Time for n Threads / n)
  • b = busy time per thread = average collection busy time for n threads
  • X = Overhead Sensitivity/Tolerance, a model constant used to help model non-linear dependency of stall times on GC thread count

Solving this function results in the model implementation expressed by expression l below. Expression 1 in combination with expression 2 gives up a complete implementation of the model:

image
Where W and H are constants:

  • W = Weighted average factor
  • H = Thread Booster

Expression (1) can be simplified to be written in terms of % stall and working threads as follows:
image

The model constants (W, H and X) are described in detail below.

Weighted average factor (W) is the importance given to current utilized threads when averaging with projected optimal thread count. This controls responsiveness to adaptive threading by controlling how we transition from current threads to ideal threads. In other words, this factor can describe how much to resist or respect the projected threads. For instance, a high value would translate to resisting recommended threads as current running threads are given more importance.

Thread booster (H) is used to boost the calculated thread count, this gives opportunity for a low thread count to grow as low thread counts have a tendency to be stagnated from growing. This factor is not significant with large thread count (large n), however it will dominate the when we have ~1-5 threads being utilized.

Overhead Sensitivity/Tolerance (X) determines how aggressive the model is, it’s used to configure sensitivity to overhead. A higher number translates to more sensitivity to overhead and thus more aggressive adjustments. From experimentation, a linear dependency of stall time on thread count has yielded good results, this translates to setting X to 1.

@RSalman
Copy link
Contributor Author

RSalman commented Feb 25, 2021

Table 3 shows how the Adaptive Threading model/heuristics approach works, it provides a matrix of inputs and resulting recommended thread count. For example, if at end of a cycle the %stall is 90% and the cycle utilized 12 threads, then the model will throttle threads down to 8 threads for the next cycle. Similarly, we would increase thread count from 12 to 15 if we determine %stall to be 35%. Note this matrix uses the model with constants: W = 50%, X = 1 & H = 0.85.

image

@RSalman
Copy link
Contributor Author

RSalman commented Feb 25, 2021

Figures 3 and 4 below illustrate how Adaptive Threading works to reduce parallelization overhead by reaching equilibrium point/optimal thread count. Figure 3 shows %stall and number of threads utilized for a given cycle, while figure 4 shows proportions of stall/busy times for a given cycle. The number of threads utilized at any given cycle M is determined using the model with inputs of threads utilized (n) and %stall from cycle M – 1. For instance, in figure 3, cycle two’s utilized thread count is determined using %stall and utilized threads from cycle one. That is, 48 threads with 94.26% stall from cycle one results in thread count being reduced to 30 for cycle two. In figure 3, we observe that initially the cycle times are high and majority of the cycle times come from stall (parallelization overhead). As we decrease the thread count, we can observe a significant drop in cycle times indicating that parallelization overhead is reduced.

Overall, it can be observed how GC threads get throttled down to 6 threads from 48 threads. This results in ~70% decrease in cycle times and ~25% increase in performance. Without Adaptive Threading, each cycle would use 48 threads and incur parallelization overhead similar to cycle 1.

image

@RSalman
Copy link
Contributor Author

RSalman commented Feb 25, 2021

Tables 4 and 5 show results from a multi-JVM scenario with Adaptive Threading enabled (table 5) and disabled (table 4). This example illustrates 6 JVMs running on the same system, where JVMs 1 to 4 run with the same GC workload, while 5 and 6 run different workloads. When comparing tables 4 and 5 we observe average collection times to be lower with adaptive threading. This is most apparent when comparing JVM 5 from both tables, ~77% lower collection times can be observed. Overall, considering all JVMs, we observe on average ~10% lower GC collection times with Adaptive Threading.

image

@RSalman
Copy link
Contributor Author

RSalman commented Feb 25, 2021

Thread distributions (from Adaptive Threading) for this multi-JVM scenario can be observed in figures 6 (VM1-4), 7 (VM5) and 8 (VM6) below. For a given number of threads, these figures show the total number of collection cycles that utilized the amount. The reduced collection times in table 5 (compared to table 4) are a result of the change in utilized threads illustrated by these figures.

image
image
image

@RSalman
Copy link
Contributor Author

RSalman commented Feb 26, 2021

This work is related to #5318

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants