Skip to content

Commit

Permalink
Add CortexRolloutStuck alert
Browse files Browse the repository at this point in the history
Signed-off-by: Marco Pracucci <[email protected]>
  • Loading branch information
pracucci committed Oct 13, 2021
1 parent ca7cc8a commit eebc529
Show file tree
Hide file tree
Showing 2 changed files with 71 additions and 0 deletions.
61 changes: 61 additions & 0 deletions jsonnet/mimir-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,67 @@
},
],
},
{
name: 'cortex-rollout-alerts',
rules: [
{
alert: 'CortexRolloutStuck',
expr: |||
(
max without (revision) (
kube_statefulset_status_current_revision
unless
kube_statefulset_status_update_revision
)
*
(
kube_statefulset_replicas
!=
kube_statefulset_status_replicas_updated
)
) and (
changes(kube_statefulset_status_replicas_updated[15m])
==
0
)
* on(%s) group_left max by(%s) (cortex_build_info)
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
The {{ $labels.statefulset }} rollout is stuck in %(alert_aggregation_variables)s.
||| % $._config,
},
},
{
alert: 'CortexRolloutStuck',
expr: |||
(
kube_deployment_spec_replicas
!=
kube_deployment_status_replicas_updated
) and (
changes(kube_deployment_status_replicas_updated[15m])
==
0
)
* on(%s) group_left max by(%s) (cortex_build_info)
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
The {{ $labels.deployment }} rollout is stuck in %(alert_aggregation_variables)s.
||| % $._config,
},
},
],
},
{
name: 'cortex-provisioning',
rules: [
Expand Down
10 changes: 10 additions & 0 deletions jsonnet/mimir-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,16 @@ How to **investigate**:
_If the alert `CortexIngesterTSDBHeadCompactionFailed` fired as well, then give priority to it because that could be the cause._
### CortexRolloutStuck
This alert fires when a Cortex service rollout is stuck, which means the number of updated replicas doesn't match the expected one and looks there's no progress in the rollout. The alert monitors services deployed as Kubernetes `StatefulSet` and `Deployment`.
How to **investigate**:
- Run `kubectl -n <namespace> get pods -l name=<statefulset|deployment>` to get a list of running pods
- Ensure there's no pod in a failing state (eg. `Error`, `OOMKilled`, `CrashLoopBackOff`)
- Ensure there's no pod `NotReady` (the number of ready containers should match the total number of containers, eg. `1/1` or `2/2`)
- Run `kubectl -n <namespace> describe statefulset <name>` or `kubectl -n <namespace> describe deployment <name>` and look at "Pod Status" and "Events" to get more information
#### Ingester hit the disk capacity
If the ingester hit the disk capacity, any attempt to append samples will fail. You should:
Expand Down

0 comments on commit eebc529

Please sign in to comment.