Skip to content

Commit

Permalink
Merge pull request #406 from grafana/alert-on-consul-failures
Browse files Browse the repository at this point in the history
Added CortexFailingToTalkToConsul alert
  • Loading branch information
pracucci authored Oct 14, 2021
2 parents 306c081 + be5af20 commit 567320d
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 0 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@
* [ENHANCEMENT] Allow to customize PromQL engine settings via `queryEngineConfig`. #399
* [ENHANCEMENT] Add recording rules to improve responsiveness of Alertmanager dashboard. #387
* [ENHANCEMENT] Add `CortexRolloutStuck` alert. #405
* [ENHANCEMENT] Added `CortexKVStoreFailure` alert. #406
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335
Expand Down
21 changes: 21 additions & 0 deletions cortex-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,27 @@
|||,
},
},
{
alert: 'CortexKVStoreFailure',
expr: |||
(
sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~"2.+"}[1m]))
/
sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count[1m]))
)
# We want to get alerted only in case there's a constant failure.
== 1
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
'for': '5m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
Cortex {{ $labels.pod }} in %(alert_aggregation_variables)s is failing to talk to the KV store {{ $labels.kv_name }}.
||| % $._config,
},
},
{
alert: 'CortexMemoryMapAreasTooHigh',
expr: |||
Expand Down
14 changes: 14 additions & 0 deletions cortex-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -734,6 +734,20 @@ How to **investigate**:
- Ensure there's no pod `NotReady` (the number of ready containers should match the total number of containers, eg. `1/1` or `2/2`)
- Run `kubectl -n <namespace> describe statefulset <name>` or `kubectl -n <namespace> describe deployment <name>` and look at "Pod Status" and "Events" to get more information
### CortexKVStoreFailure
This alert fires if a Cortex instance is failing to run any operation on a KV store (eg. consul or etcd).
How it **works**:
- Consul is typically used to store the hash ring state.
- Etcd is typically used to store by the HA tracker (distributor) to deduplicate samples.
- If an instance is failing operations on the **hash ring**, either the instance can't update the heartbeat in the ring or is failing to receive ring updates.
- If an instance is failing operations on the **HA tracker** backend, either the instance can't update the authoritative replica or is failing to receive updates.
How to **investigate**:
- Ensure Consul/Etcd is up and running.
- Investigate the logs of the affected instance to find the specific error occurring when talking to Consul/Etcd.
## Cortex routes by path
**Write path**:
Expand Down

0 comments on commit 567320d

Please sign in to comment.