Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added playbook for CortexAllocatingTooMuchMemory #345

Merged
merged 2 commits into from
Jul 2, 2021

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented Jul 2, 2021

What this PR does:
Added playbook for CortexAllocatingTooMuchMemory. I've also changed a bit the CortexAllocatingTooMuchMemory and CortexProvisioningTooManyWrites messages.

Which issue(s) this PR fixes:
N/A

Checklist

  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pracucci pracucci requested a review from a team as a code owner July 2, 2021 10:20
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice improvements.

@@ -479,7 +479,7 @@
},
annotations: {
message: |||
High QPS for ingesters, add more ingesters.
Ingesters in {{ $labels.namespace }} have an high samples/sec rate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative: Ingesters in {{ $labels.namespace }} ingest too many samples per second.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better!

- Cortex ingesters are a stateful service
- Having 2+ ingesters `OOMKilled` may cause a cluster outage
- Ingester memory baseline usage is primarily influenced by memory allocated by the process (mostly go heap) and mmap-ed files (used by TSDB)
- Ingester memory short spikes are primarily influenced by queries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also when cutting new blocks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

- Having 2+ ingesters `OOMKilled` may cause a cluster outage
- Ingester memory baseline usage is primarily influenced by memory allocated by the process (mostly go heap) and mmap-ed files (used by TSDB)
- Ingester memory short spikes are primarily influenced by queries
- A pod gets `OOMKilled` once it's working set memory reaches the configured limit, so it's important to prevent ingesters memory utilization (working set memory) from getting close to the limit (we need to keep at least 30% room for spikes due to queries)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"it's working set" -> "its working set"

```
kubectl -n <namespace> delete pod ingester-XXX
```
- Restarting an ingester typically reduces the memory allocated by mmap-ed files. Such memory could be reallocated again, but may let you gain more time while working on a longer term solution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Restarting an ingester typically reduces the memory allocated by mmap-ed files. Such memory could be reallocated again, but may let you gain more time while working on a longer term solution
- Restarting an ingester typically reduces the memory allocated by mmap-ed files. After the restart, ingester may allocate this memory again over time, but it may give more time while working on a longer term solution

Signed-off-by: Marco Pracucci <[email protected]>
@pracucci
Copy link
Collaborator Author

pracucci commented Jul 2, 2021

Thanks @pstibrany for your valuable feedback! Applied all changes.

@pracucci pracucci merged commit 3528572 into main Jul 2, 2021
@pracucci pracucci deleted the playbook-for-CortexAllocatingTooMuchMemory branch July 2, 2021 11:35
simonswine pushed a commit to grafana/mimir that referenced this pull request Oct 18, 2021
…or-CortexAllocatingTooMuchMemory

Added playbook for CortexAllocatingTooMuchMemory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants