grafana · pracucci · Jul 2, 2021 · Jul 2, 2021 · Jul 2, 2021
@@ -17,10 +17,12 @@
 * [CHANGE] Renamed `CortexInconsistentConfig` alert to `CortexInconsistentRuntimeConfig` and increased severity to `critical`. #335
 * [CHANGE] Increased `CortexBadRuntimeConfig` alert severity to `critical` and removed support for `cortex_overrides_last_reload_successful` metric (was removed in Cortex 1.3.0). #335
 * [CHANGE] Grafana 'min step' changed to 15s so dashboard show better detail. #340
+* [CHANGE] Removed `CortexCacheRequestErrors` alert. This alert was not working because the legacy Cortex cache client instrumentation doesn't track errors. #346
 * [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
 * [ENHANCEMENT] Added documentation text panels and descriptions to reads and writes dashboards. #324
 * [ENHANCEMENT] Dashboards: defined container functions for common resources panels: containerDiskWritesPanel, containerDiskReadsPanel, containerDiskSpaceUtilization. #331
 * [ENHANCEMENT] cortex-mixin: Added `alert_excluded_routes` config to exclude specific routes from alerts. #338
+* [ENHANCEMENT] Added `CortexMemcachedRequestErrors` alert. #346
 * [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
 * [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
 * [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335

@@ -180,20 +180,20 @@
           },
         },
         {
-          alert: 'CortexCacheRequestErrors',
+          alert: 'CortexMemcachedRequestErrors',
           expr: |||
-            100 * sum by (%s, method) (rate(cortex_cache_request_duration_seconds_count{status_code=~"5.."}[1m]))
-              /
-            sum  by (%s, method) (rate(cortex_cache_request_duration_seconds_count[1m]))
-              > 1
+            (
+              sum by(%s, name, operation) (rate(thanos_memcached_operation_failures_total[1m])) /
+              sum by(%s, name, operation) (rate(thanos_memcached_operations_total[1m]))
+            ) * 100 > 5
           ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
-          'for': '15m',
+          'for': '5m',
           labels: {
             severity: 'warning',
           },
           annotations: {
             message: |||
-              Cache {{ $labels.method }} is experiencing {{ printf "%.2f" $value }}% errors.
+              Memcached {{ $labels.name }} used by Cortex in {{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors for {{ $labels.operation }} operation.
             |||,
           },
         },

@@ -414,9 +414,33 @@ _TODO: this playbook has not been written yet._
 
 _TODO: this playbook has not been written yet._
 
-### CortexCacheRequestErrors
+### CortexMemcachedRequestErrors
 
-_TODO: this playbook has not been written yet._
+This alert fires if Cortex memcached client is experiencing an high error rate for a specific cache and operation.
+
+How to **investigate**:
+- The alert reports which cache is experiencing issue
+  - `metadata-cache`: object store metadata cache
+  - `index-cache`: TSDB index cache
+  - `chunks-cache`: TSDB chunks cache
+- Check which specific error is occurring
+  - Run the following query to find out the reason (replace `<namespace>` with the actual Cortex cluster namespace)
+    ```
+    sum by(name, operation, reason) (rate(thanos_memcached_operation_failures_total{namespace="<namespace>"}[1m])) > 0
+    ```
+- Based on the **`reason`**:
+  - `timeout`
+    - Scale up the memcached replicas
+  - `server-error`
+    - Check both Cortex and memcached logs to find more details
+  - `network-error`
+    - Check Cortex logs to find more details
+  - `malformed-key`
+    - The key is too long or contains invalid characters
+    - Check Cortex logs to find the offending key
+    - Fixing this will require changes to the application code
+  - `other`
+    - Check both Cortex and memcached logs to find more details
 
 ### CortexOldChunkInMemory