Skip to content

Commit

Permalink
Add new alerts for alertmanager sharding mode of operation.
Browse files Browse the repository at this point in the history
  • Loading branch information
stevesg committed Aug 24, 2021
1 parent 1cd8aca commit 0e78e94
Show file tree
Hide file tree
Showing 4 changed files with 124 additions and 0 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
* [ENHANCEMENT] Added 256MB memory ballast to querier. #369
* [ENHANCEMENT] Update gsutil command for `not healthy index found` playbook #370
* [ENHANCEMENT] Update `etcd-operator` to latest version (see https://github.com/grafana/jsonnet-libs/pull/480). #263
* [ENHANCEMENT] Added alertmanager alerts covering configuration syncs and sharding operation. #377
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335
Expand Down
1 change: 1 addition & 0 deletions cortex-mixin/alerts.libsonnet
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{
prometheusAlerts+::
(import 'alerts/alerts.libsonnet') +
(import 'alerts/alertmanager.libsonnet') +

(if std.member($._config.storage_engine, 'blocks')
then
Expand Down
98 changes: 98 additions & 0 deletions cortex-mixin/alerts/alertmanager.libsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
{
groups+: [
{
name: 'alertmanager_alerts',
rules: [
{
alert: 'CortexAlertmanagerSyncConfigsFailing',
expr: |||
rate(cortex_alertmanager_sync_configs_failed_total[5m]) > 0
|||,
'for': '30m',
labels: {
severity: 'critical',
},
annotations: {
message: |||
Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is failing to read tenant configurations from storage.
|||,
},
},
{
alert: 'CortexAlertmanagerRingCheckFailing',
expr: |||
rate(cortex_alertmanager_ring_check_errors_total[2m]) > 0
|||,
'for': '10m',
labels: {
severity: 'critical',
},
annotations: {
message: |||
Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is unable to check tenants ownership via the ring.
|||,
},
},
{
alert: 'CortexAlertmanagerPartialStateMergeFailing',
expr: |||
rate(cortex_alertmanager_partial_state_merges_failed_total[2m]) > 0
|||,
'for': '10m',
labels: {
severity: 'critical',
},
annotations: {
message: |||
Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is failing to merge partial state changes received from a replica.
|||,
},
},
{
alert: 'CortexAlertmanagerReplicationFailing',
expr: |||
rate(cortex_alertmanager_state_replication_failed_total[2m]) > 0
|||,
'for': '10m',
labels: {
severity: 'critical',
},
annotations: {
message: |||
Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is failing to replicating partial state to its replicas.
|||,
},
},
{
alert: 'CortexAlertmanagerPersistStateFailing',
expr: |||
rate(cortex_alertmanager_state_persist_failed_total[15m]) > 0
|||,
'for': '1h',
labels: {
severity: 'critical',
},
annotations: {
message: |||
Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is unable to persist full state snaphots to remote storage.
|||,
},
},
{
alert: 'CortexAlertmanagerInitialSyncFailed',
expr: |||
increase(cortex_alertmanager_state_initial_sync_completed_total{outcome="failed"}[1m]) > 0
|||,
labels: {
severity: 'critical',
},
annotations: {
message: |||
Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} was unable to obtain some initial state when starting up.
|||,
},
},
],
},
],
}
24 changes: 24 additions & 0 deletions cortex-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -636,6 +636,30 @@ This can be triggered if there are too many HA dedupe keys in etcd. We saw this
},
```
### CortexAlertmanagerSyncConfigsFailing
Work in progress.
### CortexAlertmanagerRingCheckFailing
Work in progress.
### CortexAlertmanagerPartialStateMergeFailing
Work in progress.
### CortexAlertmanagerReplicationFailing
Work in progress.
### CortexAlertmanagerPersistStateFailing
Work in progress.
### CortexAlertmanagerInitialSyncFailed
Work in progress.
## Cortex routes by path
**Write path**:
Expand Down

0 comments on commit 0e78e94

Please sign in to comment.