Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jsonnet: set GOMAXPROCS on ingesters #9273

Merged
merged 5 commits into from
Sep 12, 2024

Conversation

dimitarvdimitrov
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov commented Sep 11, 2024

What this PR does

We've been experimenting with GOMAXPROCS on ingesters at GL for the past few months. The test included all production cells and was comparing ingester-zone-a and zone-b (with GOMAXPROCS) to zone-c (without GOMAXPROCS)

This is the summary of results:

  • Push & Query outlier pods are significantly lower with GOMAXPROCS. This makes the cell more resilient to bad nodes and to inefficient NUMA node assignment; this was the goal of the experiment
  • QueryStream performance under sudden load is worse
  • CPU usage is generally slightly lower. Reduced available cores via GOMAXPROCS explains this
  • heap might appear slightly increased. However, reduced CPU time (due to fewer cores, see point above) for garbage collection can explain why peak heap can be left to rise before GC kicks in. This is generally safe and should not lead to OOMs for example

To me GOMAXPROCS is a tradeoff between lower stable-state resource usage and resilience to outlier infrastructure and being able to handle sudden usage peaks without increasing latency. With the help of query-scheduler scheduling improvements (prioritizing queries that wouldn't touch slowed-down components) and autoscaling (more frequent ingester vertical autoscaling, querier autoscaling) I think we can accept the increased QueryStream latency.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

We've been experimenting with GOMAXPROCS on ingesters at GL for the past few months. The test included all production cells and was comparing ingester-zone-a and zone-b (with GOMAXPROCS) to zone-c (without GOMAXPROCS)

 This is the summary of results:
 > * Push & Query outlier pods are significantly better with GOMAXPROCS. This makes the cell more resilient to bad nodes and to inefficient NUMA node assignment; this was the goal of the
 > * QueryStream performance under sudden load is worse
 > * CPU usage is generally slightly better. Reduced available cores via GOMAXPROCS explains this
 > * heap might appear slightly increased. However, reduced CPU time (due to fewer cores, see point above) for garbage collection can explain why peak heap can be left to rise before GC kicks in. This is generally safe and should not lead to OOMs for example
 >
 > To me GOMAXPROCS is a tradeoff between lower stable-state resource usage and resilience to outlier infrastructure **and** being able to handle sudden usage peaks without increasing latency. With the help of query-scheduler scheduling improvements (prioritizing queries that wouldn't touch slowed-down components) and autoscaling (more frequent ingester vertical autoscaling, querier autoscaling) I think we can accept the increased QueryStream latency.

Signed-off-by: Dimitar Dimitrov <[email protected]>
Signed-off-by: Dimitar Dimitrov <[email protected]>
Comment on lines 62 to 65
// double the requests, but with headroom of at least 3 and at most 6 CPU
// too little headroom can lead to throttling
// too much headroom can lead to setting the limit above the available CPU cores
requests + std.max(3, std.min(6, requests)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this potentially set it to more than what is available for the host/container? E.g. if a VM has only 4 cores, and there is only one container running on the host (kinda line in AWS fargate), we set maxprocs to 7?

Copy link
Contributor Author

@dimitarvdimitrov dimitarvdimitrov Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good point 🙂 And Peter already thought about that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't know that 👍

Maybe it's worth it to link to clamping PR from the comment, to make sure this doesn't trigger anyone later

Suggested change
// double the requests, but with headroom of at least 3 and at most 6 CPU
// too little headroom can lead to throttling
// too much headroom can lead to setting the limit above the available CPU cores
requests + std.max(3, std.min(6, requests)),
// double the requests, but with headroom of at least 3 and at most 6 CPU
// too little headroom can lead to throttling
// too much headroom can lead to setting the limit above the available CPU cores
// if the value exceeds the number of available CPU cores on the host, mimir caps the maxprocs on application start (ref grafana/mimir#8201)
requests + std.max(3, std.min(6, requests)),

Copy link
Contributor

@narqo narqo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 works for me, although I'd expand the "headroom" comment.

Also, CI seems to be unhappy.

Signed-off-by: Dimitar Dimitrov <[email protected]>
Signed-off-by: Dimitar Dimitrov <[email protected]>
@dimitarvdimitrov
Copy link
Contributor Author

🔥 works for me, although I'd expand the "headroom" comment.

i blindly copied from the internal jsonnet, but I realized it's wrong. Corrected now

@dimitarvdimitrov dimitarvdimitrov enabled auto-merge (squash) September 12, 2024 11:05
Signed-off-by: Dimitar Dimitrov <[email protected]>
@dimitarvdimitrov dimitarvdimitrov enabled auto-merge (squash) September 12, 2024 11:38
@dimitarvdimitrov dimitarvdimitrov merged commit 85904a2 into main Sep 12, 2024
29 checks passed
@dimitarvdimitrov dimitarvdimitrov deleted the dimitar/jsonnet/set-gomaxprocs branch September 12, 2024 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants