Merge pull request grafana/cortex-jsonnet#346 from grafana/playbook-f…

…or-CortexCacheRequestErrors Replaced CortexCacheRequestErrors with CortexMemcachedRequestErrors
grafana · Jul 2, 2021 · d876f21 · d876f21
2 parents 348a00d + 77d4b45
commit d876f21
Show file tree

Hide file tree

Showing 2 changed files with 33 additions and 9 deletions.
diff --git a/jsonnet/mimir-mixin/alerts/alerts.libsonnet b/jsonnet/mimir-mixin/alerts/alerts.libsonnet
@@ -165,20 +165,20 @@
           },
         },
         {
-          alert: 'CortexCacheRequestErrors',
+          alert: 'CortexMemcachedRequestErrors',
           expr: |||
-            100 * sum by (%s, method) (rate(cortex_cache_request_duration_seconds_count{status_code=~"5.."}[1m]))
-              /
-            sum  by (%s, method) (rate(cortex_cache_request_duration_seconds_count[1m]))
-              > 1
+            (
+              sum by(%s, name, operation) (rate(thanos_memcached_operation_failures_total[1m])) /
+              sum by(%s, name, operation) (rate(thanos_memcached_operations_total[1m]))
+            ) * 100 > 5
           ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
-          'for': '15m',
+          'for': '5m',
           labels: {
             severity: 'warning',
           },
           annotations: {
             message: |||
-              Cache {{ $labels.method }} is experiencing {{ printf "%.2f" $value }}% errors.
+              Memcached {{ $labels.name }} used by Cortex in {{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors for {{ $labels.operation }} operation.
             |||,
           },
         },

diff --git a/jsonnet/mimir-mixin/docs/playbooks.md b/jsonnet/mimir-mixin/docs/playbooks.md
@@ -435,9 +435,33 @@ How to **investigate**:
   - On multi-tenant Cortex cluster with **shuffle-sharing for queriers disabled**, you may consider to enable it for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant).
   - On multi-tenant Cortex cluster with **shuffle-sharding for queriers enabled**, you may consider to temporarily increase the shard size for affected tenants: be aware that this could affect other tenants too, reducing resources available to run other tenant queries. Alternatively, you may choose to do nothing and let Cortex return errors for that given user once the per-tenant queue is full.
 
-### CortexCacheRequestErrors
+### CortexMemcachedRequestErrors
 
-_TODO: this playbook has not been written yet._
+This alert fires if Cortex memcached client is experiencing an high error rate for a specific cache and operation.
+
+How to **investigate**:
+- The alert reports which cache is experiencing issue
+  - `metadata-cache`: object store metadata cache
+  - `index-cache`: TSDB index cache
+  - `chunks-cache`: TSDB chunks cache
+- Check which specific error is occurring
+  - Run the following query to find out the reason (replace `<namespace>` with the actual Cortex cluster namespace)
+    ```
+    sum by(name, operation, reason) (rate(thanos_memcached_operation_failures_total{namespace="<namespace>"}[1m])) > 0
+    ```
+- Based on the **`reason`**:
+  - `timeout`
+    - Scale up the memcached replicas
+  - `server-error`
+    - Check both Cortex and memcached logs to find more details
+  - `network-error`
+    - Check Cortex logs to find more details
+  - `malformed-key`
+    - The key is too long or contains invalid characters
+    - Check Cortex logs to find the offending key
+    - Fixing this will require changes to the application code
+  - `other`
+    - Check both Cortex and memcached logs to find more details
 
 ### CortexOldChunkInMemory