diff --git a/CHANGELOG.md b/CHANGELOG.md index 090e857..698d38c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -66,6 +66,7 @@ * [ENHANCEMENT] Allow to customize PromQL engine settings via `queryEngineConfig`. #399 * [ENHANCEMENT] Add recording rules to improve responsiveness of Alertmanager dashboard. #387 * [ENHANCEMENT] Add `CortexRolloutStuck` alert. #405 +* [ENHANCEMENT] Added `CortexKVStoreFailure` alert. #406 * [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308 * [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329 * [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335 diff --git a/cortex-mixin/alerts/alerts.libsonnet b/cortex-mixin/alerts/alerts.libsonnet index 993323e..59022dd 100644 --- a/cortex-mixin/alerts/alerts.libsonnet +++ b/cortex-mixin/alerts/alerts.libsonnet @@ -235,6 +235,27 @@ |||, }, }, + { + alert: 'CortexKVStoreFailure', + expr: ||| + ( + sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~"2.+"}[1m])) + / + sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count[1m])) + ) + # We want to get alerted only in case there's a constant failure. + == 1 + ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels], + 'for': '5m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + Cortex {{ $labels.pod }} in %(alert_aggregation_variables)s is failing to talk to the KV store {{ $labels.kv_name }}. + ||| % $._config, + }, + }, { alert: 'CortexMemoryMapAreasTooHigh', expr: ||| diff --git a/cortex-mixin/docs/playbooks.md b/cortex-mixin/docs/playbooks.md index e61f24f..180ed50 100644 --- a/cortex-mixin/docs/playbooks.md +++ b/cortex-mixin/docs/playbooks.md @@ -734,6 +734,20 @@ How to **investigate**: - Ensure there's no pod `NotReady` (the number of ready containers should match the total number of containers, eg. `1/1` or `2/2`) - Run `kubectl -n describe statefulset ` or `kubectl -n describe deployment ` and look at "Pod Status" and "Events" to get more information +### CortexKVStoreFailure + +This alert fires if a Cortex instance is failing to run any operation on a KV store (eg. consul or etcd). + +How it **works**: +- Consul is typically used to store the hash ring state. +- Etcd is typically used to store by the HA tracker (distributor) to deduplicate samples. +- If an instance is failing operations on the **hash ring**, either the instance can't update the heartbeat in the ring or is failing to receive ring updates. +- If an instance is failing operations on the **HA tracker** backend, either the instance can't update the authoritative replica or is failing to receive updates. + +How to **investigate**: +- Ensure Consul/Etcd is up and running. +- Investigate the logs of the affected instance to find the specific error occurring when talking to Consul/Etcd. + ## Cortex routes by path **Write path**: