Skip to content

Commit

Permalink
Addressed review comments
Browse files Browse the repository at this point in the history
Signed-off-by: Marco Pracucci <[email protected]>
  • Loading branch information
pracucci committed Jul 5, 2021
1 parent 6421751 commit c6b4464
Showing 1 changed file with 28 additions and 2 deletions.
30 changes: 28 additions & 2 deletions cortex-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ How to **fix**:

This alert fires when a specific Cortex route is experiencing an high latency.

The alert message includes both the Cortex service and route experiencing the high latency. Establish if the alert is about the read or write path based on that.
The alert message includes both the Cortex service and route experiencing the high latency. Establish if the alert is about the read or write path based on that (see [Cortex routes by path](#cortex-routes-by-path)).

#### Write Latency

Expand All @@ -106,6 +106,9 @@ How to **investigate**:
- Typically, distributor p99 latency is in the range 50-100ms. If the distributor latency is higher than this, you may need to scale up the distributors.
- **`ingester`**
- Typically, ingester p99 latency is in the range 5-50ms. If the ingester latency is higher than this, you should investigate the root cause before scaling up ingesters.
- Check out the following alerts and fix them if firing:
- `CortexProvisioningTooManyActiveSeries`
- `CortexProvisioningTooManyWrites`

#### Read Latency

Expand All @@ -130,6 +133,7 @@ How to **investigate**:
- High CPU utilization in ingesters
- Scale up ingesters
- Low cache hit ratio in the store-gateways
- Check `Memcached Overview` dashboard
- If memcached eviction rate is high, then you should scale up memcached replicas. Check the recommendations by `Cortex / Scaling` dashboard and make reasonable adjustments as necessary.
- If memcached eviction rate is zero or very low, then it may be caused by "first time" queries

Expand All @@ -140,7 +144,7 @@ This alert fires when the rate of 5xx errors of a specific route is > 1% for som
This alert typically acts as a last resort to detect issues / outages. SLO alerts are expected to trigger earlier: if an **SLO alert** has triggered as well for the same read/write path, then you can ignore this alert and focus on the SLO one (but the investigation procedure is typically the same).

How to **investigate**:
- Check for which route the alert fired
- Check for which route the alert fired (see [Cortex routes by path](#cortex-routes-by-path))
- Write path: open the `Cortex / Writes` dashboard
- Read path: open the `Cortex / Reads` dashboard
- Looking at the dashboard you should see in which Cortex service the error originates
Expand Down Expand Up @@ -588,6 +592,28 @@ This can be triggered if there are too many HA dedupe keys in etcd. We saw this
},
```
## Cortex routes by path
**Write path**:
- `/distributor.Distributor/Push`
- `/cortex.Ingester/Push`
- `api_v1_push`
- `api_prom_push`
- `api_v1_push_influx_write`
**Read path**:
- `/schedulerpb.SchedulerForFrontend/FrontendLoop`
- `/cortex.Ingester/QueryStream`
- `/cortex.Ingester/QueryExemplars`
- `/gatewaypb.StoreGateway/Series`
- `api_prom_label`
- `api_prom_api_v1_query_exemplars`
**Ruler / rules path**:
- `api_v1_rules`
- `api_v1_rules_namespace`
- `api_prom_rules_namespace`
## Cortex blocks storage - What to do when things to wrong
## Recovering from a potential data loss incident
Expand Down

0 comments on commit c6b4464

Please sign in to comment.