From 6b9cedc452c5e288e9d7b6e82e39070d2e0e0712 Mon Sep 17 00:00:00 2001 From: Marco Pracucci Date: Fri, 2 Jul 2021 14:07:37 +0200 Subject: [PATCH] Replaced CortexCacheRequestErrors with CortexMemcachedRequestErrors Signed-off-by: Marco Pracucci --- CHANGELOG.md | 2 ++ cortex-mixin/alerts/alerts.libsonnet | 14 +++++++------- cortex-mixin/docs/playbooks.md | 28 ++++++++++++++++++++++++++-- 3 files changed, 35 insertions(+), 9 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 835e779c..8a8a5176 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -17,10 +17,12 @@ * [CHANGE] Renamed `CortexInconsistentConfig` alert to `CortexInconsistentRuntimeConfig` and increased severity to `critical`. #335 * [CHANGE] Increased `CortexBadRuntimeConfig` alert severity to `critical` and removed support for `cortex_overrides_last_reload_successful` metric (was removed in Cortex 1.3.0). #335 * [CHANGE] Grafana 'min step' changed to 15s so dashboard show better detail. #340 +* [CHANGE] Removed `CortexCacheRequestErrors` alert. This alert was not working because the legacy Cortex cache client instrumentation doesn't track errors. #346 * [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317 * [ENHANCEMENT] Added documentation text panels and descriptions to reads and writes dashboards. #324 * [ENHANCEMENT] Dashboards: defined container functions for common resources panels: containerDiskWritesPanel, containerDiskReadsPanel, containerDiskSpaceUtilization. #331 * [ENHANCEMENT] cortex-mixin: Added `alert_excluded_routes` config to exclude specific routes from alerts. #338 +* [ENHANCEMENT] Added `CortexMemcachedRequestErrors` alert. #346 * [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308 * [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329 * [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335 diff --git a/cortex-mixin/alerts/alerts.libsonnet b/cortex-mixin/alerts/alerts.libsonnet index 71655505..ad24ac8e 100644 --- a/cortex-mixin/alerts/alerts.libsonnet +++ b/cortex-mixin/alerts/alerts.libsonnet @@ -180,20 +180,20 @@ }, }, { - alert: 'CortexCacheRequestErrors', + alert: 'CortexMemcachedRequestErrors', expr: ||| - 100 * sum by (%s, method) (rate(cortex_cache_request_duration_seconds_count{status_code=~"5.."}[1m])) - / - sum by (%s, method) (rate(cortex_cache_request_duration_seconds_count[1m])) - > 1 + ( + sum by(%s, name, operation) (rate(thanos_memcached_operation_failures_total[1m])) / + sum by(%s, name, operation) (rate(thanos_memcached_operations_total[1m])) + ) * 100 > 5 ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels], - 'for': '15m', + 'for': '5m', labels: { severity: 'warning', }, annotations: { message: ||| - Cache {{ $labels.method }} is experiencing {{ printf "%.2f" $value }}% errors. + Memcached {{ $labels.name }} used by Cortex in {{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors for {{ $labels.operation }} operation. |||, }, }, diff --git a/cortex-mixin/docs/playbooks.md b/cortex-mixin/docs/playbooks.md index dc505852..5c4cbd43 100644 --- a/cortex-mixin/docs/playbooks.md +++ b/cortex-mixin/docs/playbooks.md @@ -414,9 +414,33 @@ _TODO: this playbook has not been written yet._ _TODO: this playbook has not been written yet._ -### CortexCacheRequestErrors +### CortexMemcachedRequestErrors -_TODO: this playbook has not been written yet._ +This alert fires if Cortex memcached client is experiencing an high error rate for a specific cache and operation. + +How to **investigate**: +- The alert reports which cache is experiencing issue + - `metadata-cache`: object store metadata cache + - `index-cache`: TSDB index cache + - `chunks-cache`: TSDB chunks cache +- Check which specific error is occurring + - Run the following query to find out the reason (replace `` with the actual Cortex cluster namespace) + ``` + sum by(name, operation, reason) (rate(thanos_memcached_operation_failures_total{namespace=""}[1m])) > 0 + ``` +- Based on the **`reason`**: + - `timeout` + - Scale up the memcached replicas + - `server-error` + - Check both Cortex and memcached logs to find more details + - `network-error` + - Check Cortex logs to find more details + - `malformed-key` + - The key is too long or contains invalid characters + - Check Cortex logs to find the offending key + - Fixing this will require changes to the application code + - `other` + - Check both Cortex and memcached logs to find more details ### CortexOldChunkInMemory