From ae827c26bc1c86ae705b0add76e8520b13dd7842 Mon Sep 17 00:00:00 2001 From: Marco Pracucci Date: Mon, 21 Jun 2021 15:50:27 +0200 Subject: [PATCH] Fixed and improved runtime config alerts and playbooks Signed-off-by: Marco Pracucci --- CHANGELOG.md | 3 +++ cortex-mixin/alerts/alerts.libsonnet | 21 ++++++--------------- cortex-mixin/docs/playbooks.md | 26 +++++++++++++++++++++++--- 3 files changed, 32 insertions(+), 18 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 1bbab7c7..2f3bed9d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,9 +12,12 @@ * [CHANGE] Dashboards: added overridable `job_labels` and `cluster_labels` to the configuration object as label lists to uniquely identify jobs and clusters in the metric names and group-by lists in dashboards. #319 * [CHANGE] Dashboards: `alert_aggregation_labels` has been removed from the configuration and overriding this value has been deprecated. Instead the labels are now defined by the `cluster_labels` list, and should be overridden accordingly through that list. #319 * [CHANGE] Ingester/Ruler: set `-server.grpc-max-send-msg-size-bytes` and `-server.grpc-max-send-msg-size-bytes` to sensible default values (10MB). #326 +* [CHANGE] Renamed `CortexInconsistentConfig` alert to `CortexInconsistentRuntimeConfig` and increased severity to `critical`. #335 +* [CHANGE] Increased `CortexBadRuntimeConfig` alert severity to `critical` and removed support for `cortex_overrides_last_reload_successful` metric (was removed in Cortex 1.3.0). #335 * [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317 * [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308 * [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329 +* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335 ## 1.9.0 / 2021-05-18 diff --git a/cortex-mixin/alerts/alerts.libsonnet b/cortex-mixin/alerts/alerts.libsonnet index 5498fbfd..a93ffe05 100644 --- a/cortex-mixin/alerts/alerts.libsonnet +++ b/cortex-mixin/alerts/alerts.libsonnet @@ -92,39 +92,30 @@ }, }, { - alert: 'CortexInconsistentConfig', + alert: 'CortexInconsistentRuntimeConfig', expr: ||| - count(count by(%s, job, sha256) (cortex_config_hash)) without(sha256) > 1 + count(count by(%s, job, sha256) (cortex_runtime_config_hash)) without(sha256) > 1 ||| % $._config.alert_aggregation_labels, 'for': '1h', labels: { - severity: 'warning', + severity: 'critical', }, annotations: { message: ||| - An inconsistent config file hash is used across cluster {{ $labels.job }}. + An inconsistent runtime config file is used across cluster {{ $labels.job }}. |||, }, }, { - // As of https://github.com/cortexproject/cortex/pull/2092, this metric is - // only exposed when it is supposed to be non-zero, so we don't need to do - // any special filtering on the job label. - // The metric itself was renamed in - // https://github.com/cortexproject/cortex/pull/2874 - // - // TODO: Remove deprecated metric name of - // cortex_overrides_last_reload_successful in the future alert: 'CortexBadRuntimeConfig', expr: ||| + # The metric value is reset to 0 on error while reloading the config at runtime. cortex_runtime_config_last_reload_successful == 0 - or - cortex_overrides_last_reload_successful == 0 |||, // Alert quicker for human errors. 'for': '5m', labels: { - severity: 'warning', + severity: 'critical', }, annotations: { message: ||| diff --git a/cortex-mixin/docs/playbooks.md b/cortex-mixin/docs/playbooks.md index cc3a3ad9..288e8d5b 100644 --- a/cortex-mixin/docs/playbooks.md +++ b/cortex-mixin/docs/playbooks.md @@ -369,13 +369,33 @@ _TODO: this playbook has not been written yet._ _TODO: this playbook has not been written yet._ -### CortexInconsistentConfig +### CortexInconsistentRuntimeConfig -_TODO: this playbook has not been written yet._ +This alert fires if multiple replicas of the same Cortex service are loading a different runtime config. + +The Cortex runtime config is a config file which gets live reloaded by Cortex at runtime. In order for Cortex to work properly, the loaded config is expected to be the exact same across multiple replicas of the same Cortex service (eg. distributors, ingesters, ...). When the config changes, there may be short periods of time during which some replicas have loaded the new config and others are still running on the previous one, but it shouldn't last for more than few minutes. + +How to **investigate**: +- Check how many different config file versions (hashes) are reported + ``` + count by (sha256) (cortex_runtime_config_hash{namespace=""}) + ``` +- Check which replicas are running a different version + ``` + cortex_runtime_config_hash{namespace="",sha256=""} + ``` +- Check if the runtime config has been updated on the affected replicas' filesystem +- Check the affected replicas logs and look for any error loading the runtime config ### CortexBadRuntimeConfig -_TODO: this playbook has not been written yet._ +This alert fires if Cortex is unable to reload the runtime config. + +This typically means an invalid runtime config was deployed. Cortex keeps running with the previous (valid) version of the runtime config; running Cortex replicas and the system availability shouldn't be affected, but new replicas won't be able to startup until the runtime config is fixed. + +How to **investigate**: +- Check the latest runtime config update (it's likely to be broken) +- Check Cortex logs to get more details about what's wrong with the config ### CortexQuerierCapacityFull