Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced CortexCacheRequestErrors with CortexMemcachedRequestErrors #346

Merged
merged 2 commits into from
Jul 2, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,12 @@
* [CHANGE] Renamed `CortexInconsistentConfig` alert to `CortexInconsistentRuntimeConfig` and increased severity to `critical`. #335
* [CHANGE] Increased `CortexBadRuntimeConfig` alert severity to `critical` and removed support for `cortex_overrides_last_reload_successful` metric (was removed in Cortex 1.3.0). #335
* [CHANGE] Grafana 'min step' changed to 15s so dashboard show better detail. #340
* [CHANGE] Removed `CortexCacheRequestErrors` alert. This alert was not working because the legacy Cortex cache client instrumentation doesn't track errors. #346
* [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
* [ENHANCEMENT] Added documentation text panels and descriptions to reads and writes dashboards. #324
* [ENHANCEMENT] Dashboards: defined container functions for common resources panels: containerDiskWritesPanel, containerDiskReadsPanel, containerDiskSpaceUtilization. #331
* [ENHANCEMENT] cortex-mixin: Added `alert_excluded_routes` config to exclude specific routes from alerts. #338
* [ENHANCEMENT] Added `CortexMemcachedRequestErrors` alert. #346
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335
Expand Down
14 changes: 7 additions & 7 deletions cortex-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -180,20 +180,20 @@
},
},
{
alert: 'CortexCacheRequestErrors',
alert: 'CortexMemcachedRequestErrors',
expr: |||
100 * sum by (%s, method) (rate(cortex_cache_request_duration_seconds_count{status_code=~"5.."}[1m]))
/
sum by (%s, method) (rate(cortex_cache_request_duration_seconds_count[1m]))
> 1
(
sum by(%s, name, operation) (rate(thanos_memcached_operation_failures_total[1m])) /
sum by(%s, name, operation) (rate(thanos_memcached_operations_total[1m]))
) * 100 > 5
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
'for': '15m',
'for': '5m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
Cache {{ $labels.method }} is experiencing {{ printf "%.2f" $value }}% errors.
Memcached {{ $labels.name }} used by Cortex in {{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors for {{ $labels.operation }} operation.
|||,
},
},
Expand Down
28 changes: 26 additions & 2 deletions cortex-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -414,9 +414,33 @@ _TODO: this playbook has not been written yet._

_TODO: this playbook has not been written yet._

### CortexCacheRequestErrors
### CortexMemcachedRequestErrors

_TODO: this playbook has not been written yet._
This alert fires if Cortex memcached client is experiencing an high error rate for a specific cache and operation.

How to **investigate**:
- The alert reports which cache is experiencing issue
- `metadata-cache`: object store metadata cache
- `index-cache`: TSDB index cache
- `chunks-cache`: TSDB chunks cache
- Check which specific error is occurring
- Run the following query to find out the reason (replace `<namespace>` with the actual Cortex cluster namespace)
```
sum by(name, operation, reason) (rate(thanos_memcached_operation_failures_total{namespace="<namespace>"}[1m])) > 0
```
- Based on the **`reason`**:
- `timeout`
- Scale up the memcached replicas
- `server-error`
- Check both Cortex and memcached logs to find more details
- `network-error`
- Check Cortex logs to find more details
- `malformed-key`
- The key is too long or contains invalid characters
- Check Cortex logs to find the offending key
- Fixing this will require changes to the application code
- `other`
- Check both Cortex and memcached logs to find more details

### CortexOldChunkInMemory

Expand Down