Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CortexRolloutStuck alert #405

Merged
merged 2 commits into from
Oct 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@
* [ENHANCEMENT] Add support for running Alertmanager in sharding mode. #394
* [ENHANCEMENT] Allow to customize PromQL engine settings via `queryEngineConfig`. #399
* [ENHANCEMENT] Add recording rules to improve responsiveness of Alertmanager dashboard. #387
* [ENHANCEMENT] Add `CortexRolloutStuck` alert. #405
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335
Expand Down
61 changes: 61 additions & 0 deletions cortex-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,67 @@
},
],
},
{
name: 'cortex-rollout-alerts',
rules: [
{
alert: 'CortexRolloutStuck',
expr: |||
(
max without (revision) (
kube_statefulset_status_current_revision
unless
kube_statefulset_status_update_revision
)
Comment on lines +421 to +426
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too sure why stateful sets need that revision check, but I guess it's also upstream: https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/6c72589035f4f49674a56cf97a3ec1a02f14671a/alerts/apps_alerts.libsonnet#L128

So it should be ok 🙂

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I learned it from there.

*
(
kube_statefulset_replicas
!=
kube_statefulset_status_replicas_updated
)
) and (
changes(kube_statefulset_status_replicas_updated[15m])
==
0
)
* on(%s) group_left max by(%s) (cortex_build_info)
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
The {{ $labels.statefulset }} rollout is stuck in %(alert_aggregation_variables)s.
||| % $._config,
},
},
{
alert: 'CortexRolloutStuck',
expr: |||
(
kube_deployment_spec_replicas
!=
kube_deployment_status_replicas_updated
) and (
changes(kube_deployment_status_replicas_updated[15m])
==
0
)
* on(%s) group_left max by(%s) (cortex_build_info)
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
The {{ $labels.deployment }} rollout is stuck in %(alert_aggregation_variables)s.
||| % $._config,
},
},
],
},
{
name: 'cortex-provisioning',
rules: [
Expand Down
9 changes: 9 additions & 0 deletions cortex-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -724,6 +724,15 @@ When an alertmanager cannot read the state for a tenant from storage it gets log
- The state could not be merged because it might be invalid and could not be decoded. This could indicate data corruption and therefore a bug in the reading or writing of the state, and would need further investigation.
- The state could not be read from storage. This could be due to a networking issue such as a timeout or an authentication and authorization issue with the remote object store.

### CortexRolloutStuck

This alert fires when a Cortex service rollout is stuck, which means the number of updated replicas doesn't match the expected one and looks there's no progress in the rollout. The alert monitors services deployed as Kubernetes `StatefulSet` and `Deployment`.

How to **investigate**:
- Run `kubectl -n <namespace> get pods -l name=<statefulset|deployment>` to get a list of running pods
- Ensure there's no pod in a failing state (eg. `Error`, `OOMKilled`, `CrashLoopBackOff`)
- Ensure there's no pod `NotReady` (the number of ready containers should match the total number of containers, eg. `1/1` or `2/2`)
- Run `kubectl -n <namespace> describe statefulset <name>` or `kubectl -n <namespace> describe deployment <name>` and look at "Pod Status" and "Events" to get more information

## Cortex routes by path

Expand Down