From 025285e57b6fdd11b9522082c180df868c955767 Mon Sep 17 00:00:00 2001 From: Christian Simon Date: Tue, 19 Oct 2021 17:06:46 +0100 Subject: [PATCH] Import cortex mixin from upstream (#373) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Increased CortexAllocatingTooMuchMemory alert threshold Signed-off-by: Marco Pracucci * Add alert for etcd memory limits close Signed-off-by: Goutham Veeramachaneni * the distributor now supports push via GRPC (https://github.com/grafana/cortex-jsonnet/pull/266) Signed-off-by: Mauro Stettler * Fixed CortexQuerierHighRefetchRate alert Signed-off-by: Marco Pracucci * Fixed label matcher Signed-off-by: Marco Pracucci * Sort legend descending in the CPU/memory panels Signed-off-by: Marco Pracucci * Add slow queries dashboard Signed-off-by: Marco Pracucci * Added tenant ID field to the table Signed-off-by: Marco Pracucci * Add recording rules to calculate Cortex scaling - Update dashboard so it only shows under provisioned services and why - Add sizing rules based on limits. - Add some docs to the dashboard. Signed-off-by: Tom Wilkie * Increased CortexRequestErrors alert severity Signed-off-by: Marco Pracucci * Fixed "Disk Writes" and "Disk Reads" panels Signed-off-by: Marco Pracucci * Pre-compute aggregations to optimize scaling recording rules Signed-off-by: Marco Pracucci * Removed 5m step from subquery Signed-off-by: Marco Pracucci * Add function to customize compactor statefulset Signed-off-by: Marco Pracucci * Use the job name in compactor alerts too Signed-off-by: Marco Pracucci * Fixed CortexCompactorRunFailed threshold Signed-off-by: Marco Pracucci * Added Cortex Rollout progress dashboard Signed-off-by: Marco Pracucci * Fix 'Unhealthy pods' in Cortex Rollout dashboard Signed-off-by: Marco Pracucci * Simplify compactor alerts We should simply alert on things not having run since X. Signed-off-by: Goutham Veeramachaneni * Use the right metric Signed-off-by: Goutham Veeramachaneni * Apply suggestions from code review Co-authored-by: Marco Pracucci Signed-off-by: Goutham Veeramachaneni * Fix CortexCompactorHasNotSuccessfullyRunCompaction to avoid false positives Signed-off-by: Marco Pracucci * Introduce ingester instance limits to configuration, and add alerts. (https://github.com/grafana/cortex-jsonnet/pull/296) * Introduce ingester instance limits to configuration, and add alerts. * CHANGELOG.md * Address (internal) review feedback. * Improve CortexRulerFailedRingCheck alert Signed-off-by: Marco Pracucci * Added example Loki query to CortexTenantHasPartialBlocks playbook Signed-off-by: Marco Pracucci * Default dashboards to Cortex blocks storage only Signed-off-by: Marco Pracucci * Add missing memberlist components to alerts This adds the admin-api, compactor and store-gateway components to the memberlist alert. Signed-off-by: Christian Simon * mixin: Add gateway to valid job names (for GEM) * Only show namespaces from selected cluster. "All" works thanks to using regex matcher. (https://github.com/grafana/cortex-jsonnet/pull/311) * Only show namespaces from selected cluster. "All" works thanks to using regex matcher. * CHANGELOG.md * Fixed CortexIngesterHasNotShippedBlocks alert false positive Signed-off-by: Marco Pracucci * Fixed mixin linter Signed-off-by: Marco Pracucci * Add placeholders to make the linter pass Signed-off-by: Marco Pracucci * cortex-mixin: Use kube_pod_container_resource_{requests,limits} metrics This updates the recording rules to make them compatible with kube-state-metrics v2.0.0 which introduces some breaking changes in some metric names. With kube-state-metrics v2.0.0: - `kube_pod_container_resource_requests_cpu_cores` becomes `kube_pod_container_resource_requests{resource="cpu"}` - `kube_pod_container_resource_requests_memory_bytes` becomes `kube_pod_container_resource_requests{resource="memory"}` * cortex-mixin: Make the recording rules backwards compatible * refactor: functions to reduce code duplication - improve overrideability - making more use of `per_instance_label` from _config - added containerNetworkPanel functions for dashboards to use * fix: lint * refactor: config for job aggregation strings - to make it easier to override, define "cluster_namespace_job" in $._config as `job_aggregation_prefix`. - added some `job_aggregation_labels_*` as well The resulting output does not change (unless config is overridden). * lint * Update cortex-mixin/dashboards/writes.libsonnet simplify mapping by extending $._config Co-authored-by: Marco Pracucci * fix: syntax * refactor: added a group_config defines group-related strings based off of array-based parameters in _config. deprecated _config.alert_aggregation_labels with a std.trace warning, while maintaining (temporary?) backward compatibility. * refactor: added a group_config defines group-related strings based off of array-based parameters in _config. deprecated _config.alert_aggregation_labels with a std.trace warning, while maintaining (temporary?) backward compatibility. * refactor: added a group_config defines group-related strings based off of array-based parameters in _config. deprecated _config.alert_aggregation_labels with a std.trace warning, while maintaining (temporary?) backward compatibility. * Lower CortexIngesterRestarts severity Signed-off-by: Marco Pracucci * feature: add some text boxes and descriptions Focussing on the reads and writes dashboards, added some info panels and hover-over descriptions for some of the panels. Some common code used by the compactor also received additional text content. New functions: - addRows - addRowsIf ...to add a list of rows to a dashboard. The `thanosMemcachedCache` function has had some of its query text sprawled out for easier reading and comparison with similar dashboard queries. * fix: text replacements, repair addRows * Changing copy to add 'latency' as well. * Cut down on text from initial PR. Tucked existing text from the compactor dashboard under tooltips, rather than making them text boxes. * Getting rid of a few space/comma errors. * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * fix: formatting - limit to 4 panels per row * fmt * fix: remove accidental line * Update cortex-mixin/dashboards/dashboard-utils.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/reads.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/reads.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/reads.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * fix: Requests per second * fix: text * Apply suggestions from code review as per @osg-grafana Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * fix: clarity * Apply suggestions from code review as per @osg-grafana Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Add a simple playbook for ingester series limit alert. Signed-off-by: Callum Styan * Add cortex-gw-internal to watched gateway metrics (https://github.com/grafana/cortex-jsonnet/pull/328) * Add cortex-gw-internal to watched gateway metrics * Update CHANGELOG.md Co-authored-by: Marco Pracucci * fix: query formatting to aid in merge * fix: query formatting to aid in merge * fix: consistent labelling * fix: ensure panel titles are consistent - Most existing "per second" panel titles in `main` are written "/ sec", corrected recent commits to match. * Improved CortexIngesterReachingSeriesLimit playbook and added CortexIngesterReachingTenantsLimit playbook Signed-off-by: Marco Pracucci * Better formatting for ingester_instance_limits+ example Signed-off-by: Marco Pracucci * Clarify which alerts apply to chunks storage only Signed-off-by: Marco Pracucci * Improve compactor alerts and playbooks Signed-off-by: Marco Pracucci * Addressed review comments Signed-off-by: Marco Pracucci * Update cortex-mixin/docs/playbooks.md Signed-off-by: Marco Pracucci Co-authored-by: Peter Štibraný * Fixed and improved runtime config alerts and playbooks Signed-off-by: Marco Pracucci * fix: resolve review feedback * Update cortex-mixin/docs/playbooks.md Signed-off-by: Marco Pracucci Co-authored-by: Peter Štibraný * Update cortex-mixin/docs/playbooks.md Signed-off-by: Marco Pracucci Co-authored-by: Peter Štibraný * MarkCortexTableSyncFailure and CortexOldChunkInMemory alerts as chunks storage only Signed-off-by: Marco Pracucci * Fixed whitespace noise Signed-off-by: Marco Pracucci * refactor: resources dashboard comtainer functions added: - containerDiskWritesPanel - containerDiskReadsPanel - containerDiskSpaceUtilization * revert: matching spacing format of main * lint: white noise * Add playbook for CortexRequestErrors and config option to exclude specific routes Signed-off-by: Marco Pracucci * Change min-step to 15s to show better detail. $__rate_interval will be floored at 4x this quantity, so 15s lets us see faster transients than the previous value of 1m. Signed-off-by: Bryan Boreham * Added playbook for CortexFrontendQueriesStuck and CortexSchedulerQueriesStuck Signed-off-by: Marco Pracucci * Remove CortexQuerierCapacityFull alert Signed-off-by: Marco Pracucci * Added playbook for CortexProvisioningTooManyWrites Signed-off-by: Marco Pracucci * Added playbook for CortexAllocatingTooMuchMemory Signed-off-by: Marco Pracucci * Address review feedback Signed-off-by: Marco Pracucci * Replaced CortexCacheRequestErrors with CortexMemcachedRequestErrors Signed-off-by: Marco Pracucci * Replace ruler alerts, and add playbooks. * Addressed review comments Signed-off-by: Marco Pracucci * Fix white space. * Better alert messages. * Improve CortexIngesterReachingSeriesLimit playbook Signed-off-by: Marco Pracucci * Add playbook for CortexProvisioningTooManyActiveSeries Signed-off-by: Marco Pracucci * Improve messaging. * Fixed formatting Signed-off-by: Marco Pracucci * Improved alert messages with Cortex cluster Signed-off-by: Marco Pracucci * Improved CortexRequestLatency playbook Signed-off-by: Marco Pracucci * Added 'Per route p99 latency' to ruler configuration API Signed-off-by: Marco Pracucci * Addressed review comments Signed-off-by: Marco Pracucci * Aded object storage metrics for Ruler and Alertmanager Signed-off-by: Marco Pracucci * Add playbook entry for CortexGossipMembersMismatch. * Clarify data loss related to 'not healthy index found' issue Signed-off-by: Marco Pracucci * Review comments. * Improve CortexIngesterReachingSeriesLimit playbook Signed-off-by: Marco Pracucci * Increased CortexIngesterReachingSeriesLimit critical alert threshold from 80% to 85% Signed-off-by: Marco Pracucci * Increase CortexIngesterReachingSeriesLimit warning `for` duration As it turns out, during normal shuffle-sharding operation, the 70% mark is often exceeded, but not by much. Rather than increasing the threshold to 75%, this commit increases the `for` duration to 3h, following the thought that we want this alert to fire if ingesters are constantly above the threshold even after stale series are flushed (which occurs every 2h, when the TSDB head is compacted). We flush series with a timestamp between [-3h, -1h] after the last compaction, so the worst case scenario is that it takes 3h to flush a stale series. Signed-off-by: beorn7 * Fix scaling dashboard to work on multi-zone ingesters Signed-off-by: Marco Pracucci * Simplified cluster_namespace_deployment:actual_replicas:count recording rule Signed-off-by: Marco Pracucci * Added a comment to explain '.*?' Signed-off-by: Marco Pracucci * Fix rollout dashboard to work with multi-zone deployments Signed-off-by: Marco Pracucci * Fixed legends Signed-off-by: Marco Pracucci * Extend Alertmanager dashboard with currently unused metrics. Metrics for general operation: - Added "Tenants" stat panel using: `cortex_alertmanager_tenants_discovered` - Added "Tenant Configuration Sync" row using: `cortex_alertmanager_sync_configs_failed_total` `cortex_alertmanager_sync_configs_total` `cortex_alertmanager_ring_check_errors_total` Metrics specific to sharding operation: - Added "Sharding Initial State Sync" row using: `cortex_alertmanager_state_initial_sync_completed_total` `cortex_alertmanager_state_initial_sync_completed_total` `cortex_alertmanager_state_initial_sync_duration_seconds` - Added "Sharding State Operations" row using: `cortex_alertmanager_state_fetch_replica_state_total` `cortex_alertmanager_state_fetch_replica_state_failed_total` `cortex_alertmanager_state_replication_total` `cortex_alertmanager_state_replication_failed_total` `cortex_alertmanager_partial_state_merges_total` `cortex_alertmanager_partial_state_merges_failed_total` `cortex_alertmanager_state_persist_total` `cortex_alertmanager_state_persist_failed_total` * Review comments + fix latency panel. * Review comments. * Clarify the gsutil mv command for moving corrupted blocks Signed-off-by: Tyler Reid * Modify log message to fit example command Signed-off-by: Tyler Reid * Update grafana-builder from Mar 2019 to Feb 2021 Brings in the following changes: - Use default as a picker value for datasource variable grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/204 - allow table link in new tab grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/238 - allow setting a default datasource grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/301 - Add textPanel grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/341 - make status code label name overrideable in qpsPanel grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/397 - use $__rate_interval over $__interval grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/401 - Set shared tooltip to false by default grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/458 - Use custom 'all' value to avoid massive regexes in queries. grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/469 https://github.com/grafana/jsonnet-libs/commits/master/grafana-builder/ * Match query-frontend/query-scheduler/querier custom deployments by default Signed-off-by: Marco Pracucci * Create playbooks for sharded alertmanager * Add new alerts for alertmanager sharding mode of operation. * fix(rules): upstream recording rule switched to sum_irate ref: https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/619 * Fix CortexIngesterReachingSeriesLimit playbook Signed-off-by: Arve Knudsen * feat: Allow configuration of ring members in gossip alerts Signed-off-by: Jack Baldry * fix: Add store-gateway and compactor ring_members Also re-order names for readability. Signed-off-by: Jack Baldry * fix: Match all ingester workloads and avoid matching the cortex-gateway Signed-off-by: Jack Baldry * feat: Optionally allow use of array or string to configure ring members Signed-off-by: Jack Baldry * address review feedback Signed-off-by: Jack Baldry * fix: Correct ingester and querier regexps Signed-off-by: Jack Baldry * Fixes to initial state sync panels on alertmanager dashboard. 1) Change minimal interval to 1m for sync duration and fetch state panels. This is in order to show infrequent events at smaller time windows. 2) Change syncs/sec panel to reflect absolute value of metric not rate. The initial sync only occurs once per-tenant so the counter value is essentially 0 or 1. Due to how per-tenant metrics are aggregated, the external facing metric really acts more like a gauge reflecting the number of tenants which achieved each outcome. Also, stack this panel as it becomes easier to visually see when the initial syncs have completed for all tenants (e.g. during a rollout). * Add rate back to Alertmanager dashboard initial syncs panel. The metric in fact does act like a counter due to soft deletion of the per-user registry when the user is unconfigured (e.g. moved to another instance or configuration deleted). * Make the overrides metric name configurable. We (Grafana Labs) are about to put in a new system to control and export data about limits and we'll need to use a different name. This shouldn't affect our OSS users. Signed-off-by: Goutham Veeramachaneni * Improve Cortex / Queries dashboard Signed-off-by: Marco Pracucci * Add recording rules for speeding up Alertmanager dashboard. With large numbers of tenants the queries for some panels on thos dashboard can become quite slow as the metrics exposed are per-tenant. * Fixes from testing. * Move rules to their own group. * Split `cortex_api` recording rule group into three groups. This is a workaround for large clusters where this group can become slow to evaluate. * Update gsutil installation playbook Signed-off-by: Marco Pracucci * Use `$._config.job_names.gateway` in resources dashboards. This fixes panels where `cortex-gw` was hardcoded. * Fine tune CortexIngesterReachingSeriesLimit alert Signed-off-by: Marco Pracucci * Add CortexRolloutStuck alert Signed-off-by: Marco Pracucci * Fixed playbook Signed-off-by: Marco Pracucci * Added CortexFailingToTalkToConsul alert Signed-off-by: Marco Pracucci * Fixed alert message Signed-off-by: Marco Pracucci * Update alert to be generic to KV stores Signed-off-by: Marco Pracucci * Add README * Add mimir-mixin CI checks * Update build image * Move to operations folder * Add missing zip to build-image * Run prettifier on playbooks.md * Update build-image Co-authored-by: Marco Pracucci Co-authored-by: Goutham Veeramachaneni Co-authored-by: Mauro Stettler Co-authored-by: Tom Wilkie Co-authored-by: Tom Wilkie Co-authored-by: Goutham Veeramachaneni Co-authored-by: Peter Štibraný Co-authored-by: Alex Martin Co-authored-by: Javier Palomo Co-authored-by: Darren Janeczek Co-authored-by: Darren Janeczek <38694490+darrenjaneczek@users.noreply.github.com> Co-authored-by: Jennifer Villa Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> Co-authored-by: Callum Styan Co-authored-by: Johanna Ratliff Co-authored-by: Bryan Boreham Co-authored-by: Steve Simpson Co-authored-by: beorn7 Co-authored-by: Tyler Reid Co-authored-by: George Robinson Co-authored-by: Duologic Co-authored-by: Arve Knudsen Co-authored-by: Jack Baldry --- .github/workflows/test-build-deploy.yml | 8 +- Makefile | 37 +- mimir-build-image/Dockerfile | 24 +- operations/mimir-mixin/.gitignore | 3 + operations/mimir-mixin/README.md | 18 + operations/mimir-mixin/alerts.libsonnet | 13 + .../mimir-mixin/alerts/alertmanager.libsonnet | 98 ++ .../mimir-mixin/alerts/alerts.libsonnet | 740 ++++++++++++ .../mimir-mixin/alerts/blocks.libsonnet | 242 ++++ .../mimir-mixin/alerts/compactor.libsonnet | 96 ++ operations/mimir-mixin/config.libsonnet | 72 ++ operations/mimir-mixin/dashboards.libsonnet | 35 + .../alertmanager-resources.libsonnet | 67 ++ .../dashboards/alertmanager.libsonnet | 246 ++++ .../mimir-mixin/dashboards/chunks.libsonnet | 100 ++ .../dashboards/compactor-resources.libsonnet | 49 + .../dashboards/compactor.libsonnet | 120 ++ .../dashboards/comparison.libsonnet | 105 ++ .../mimir-mixin/dashboards/config.libsonnet | 26 + .../dashboards/dashboard-utils.libsonnet | 506 ++++++++ .../dashboards/object-store.libsonnet | 65 ++ .../mimir-mixin/dashboards/queries.libsonnet | 286 +++++ .../dashboards/reads-resources.libsonnet | 124 ++ .../mimir-mixin/dashboards/reads.libsonnet | 404 +++++++ .../dashboards/rollout-progress.libsonnet | 316 +++++ .../mimir-mixin/dashboards/ruler.libsonnet | 255 ++++ .../mimir-mixin/dashboards/scaling.libsonnet | 60 + .../dashboards/slow-queries.libsonnet | 185 +++ .../dashboards/writes-resources.libsonnet | 78 ++ .../mimir-mixin/dashboards/writes.libsonnet | 327 ++++++ operations/mimir-mixin/docs/playbooks.md | 1022 +++++++++++++++++ operations/mimir-mixin/groups.libsonnet | 62 + operations/mimir-mixin/jsonnetfile.json | 24 + operations/mimir-mixin/jsonnetfile.lock.json | 26 + operations/mimir-mixin/mixin.libsonnet | 5 + .../mimir-mixin/recording_rules.libsonnet | 445 +++++++ .../mimir-mixin/scripts/lint-playbooks.sh | 28 + 37 files changed, 6302 insertions(+), 15 deletions(-) create mode 100644 operations/mimir-mixin/.gitignore create mode 100644 operations/mimir-mixin/README.md create mode 100644 operations/mimir-mixin/alerts.libsonnet create mode 100644 operations/mimir-mixin/alerts/alertmanager.libsonnet create mode 100644 operations/mimir-mixin/alerts/alerts.libsonnet create mode 100644 operations/mimir-mixin/alerts/blocks.libsonnet create mode 100644 operations/mimir-mixin/alerts/compactor.libsonnet create mode 100644 operations/mimir-mixin/config.libsonnet create mode 100644 operations/mimir-mixin/dashboards.libsonnet create mode 100644 operations/mimir-mixin/dashboards/alertmanager-resources.libsonnet create mode 100644 operations/mimir-mixin/dashboards/alertmanager.libsonnet create mode 100644 operations/mimir-mixin/dashboards/chunks.libsonnet create mode 100644 operations/mimir-mixin/dashboards/compactor-resources.libsonnet create mode 100644 operations/mimir-mixin/dashboards/compactor.libsonnet create mode 100644 operations/mimir-mixin/dashboards/comparison.libsonnet create mode 100644 operations/mimir-mixin/dashboards/config.libsonnet create mode 100644 operations/mimir-mixin/dashboards/dashboard-utils.libsonnet create mode 100644 operations/mimir-mixin/dashboards/object-store.libsonnet create mode 100644 operations/mimir-mixin/dashboards/queries.libsonnet create mode 100644 operations/mimir-mixin/dashboards/reads-resources.libsonnet create mode 100644 operations/mimir-mixin/dashboards/reads.libsonnet create mode 100644 operations/mimir-mixin/dashboards/rollout-progress.libsonnet create mode 100644 operations/mimir-mixin/dashboards/ruler.libsonnet create mode 100644 operations/mimir-mixin/dashboards/scaling.libsonnet create mode 100644 operations/mimir-mixin/dashboards/slow-queries.libsonnet create mode 100644 operations/mimir-mixin/dashboards/writes-resources.libsonnet create mode 100644 operations/mimir-mixin/dashboards/writes.libsonnet create mode 100644 operations/mimir-mixin/docs/playbooks.md create mode 100644 operations/mimir-mixin/groups.libsonnet create mode 100644 operations/mimir-mixin/jsonnetfile.json create mode 100644 operations/mimir-mixin/jsonnetfile.lock.json create mode 100644 operations/mimir-mixin/mixin.libsonnet create mode 100644 operations/mimir-mixin/recording_rules.libsonnet create mode 100755 operations/mimir-mixin/scripts/lint-playbooks.sh diff --git a/.github/workflows/test-build-deploy.yml b/.github/workflows/test-build-deploy.yml index 7b2ebe88130..88357824d7e 100644 --- a/.github/workflows/test-build-deploy.yml +++ b/.github/workflows/test-build-deploy.yml @@ -10,7 +10,7 @@ jobs: lint: runs-on: ubuntu-20.04 container: - image: us.gcr.io/kubernetes-dev/mimir-build-image:add-prettier-08d2e2a61 + image: us.gcr.io/kubernetes-dev/mimir-build-image:20211018_import-cortex-mixin-e7b4eab3c credentials: username: _json_key password: ${{ secrets.gcr_json_key }} @@ -36,6 +36,8 @@ jobs: run: make BUILD_IN_CONTAINER=false check-protos - name: Check Generated Documentation run: make BUILD_IN_CONTAINER=false check-doc + - name: Check Mixin + run: make BUILD_IN_CONTAINER=false check-mixin - name: Check White Noise. run: make BUILD_IN_CONTAINER=false check-white-noise - name: Check License Header @@ -44,7 +46,7 @@ jobs: test: runs-on: ubuntu-20.04 container: - image: us.gcr.io/kubernetes-dev/mimir-build-image:add-prettier-08d2e2a61 + image: us.gcr.io/kubernetes-dev/mimir-build-image:20211018_import-cortex-mixin-e7b4eab3c credentials: username: _json_key password: ${{ secrets.gcr_json_key }} @@ -68,7 +70,7 @@ jobs: build: runs-on: ubuntu-20.04 container: - image: us.gcr.io/kubernetes-dev/mimir-build-image:add-prettier-08d2e2a61 + image: us.gcr.io/kubernetes-dev/mimir-build-image:20211018_import-cortex-mixin-e7b4eab3c credentials: username: _json_key password: ${{ secrets.gcr_json_key }} diff --git a/Makefile b/Makefile index ae9e8d74a13..a1015430f32 100644 --- a/Makefile +++ b/Makefile @@ -2,7 +2,7 @@ # WARNING: do not commit to a repository! -include Makefile.local -.PHONY: all test integration-tests cover clean images protos exes dist doc clean-doc check-doc push-multiarch-build-image license check-license format +.PHONY: all test integration-tests cover clean images protos exes dist doc clean-doc check-doc push-multiarch-build-image license check-license format check-mixin check-mixin-jb check-mixin-mixtool checkin-mixin-playbook build-mixin format-mixin .DEFAULT_GOAL := all # Version number @@ -25,6 +25,12 @@ GIT_REVISION := $(shell git rev-parse --short HEAD) GIT_BRANCH := $(shell git rev-parse --abbrev-ref HEAD) UPTODATE := .uptodate +# path to jsonnetfmt +JSONNET_FMT := jsonnetfmt + +# path to the mimir/mixin +MIXIN_PATH := operations/mimir-mixin + .PHONY: image-tag image-tag: @echo $(IMAGE_TAG) @@ -120,7 +126,7 @@ mimir-build-image/$(UPTODATE): mimir-build-image/* # All the boiler plate for building golang follows: SUDO := $(shell docker info >/dev/null 2>&1 || echo "sudo -E") BUILD_IN_CONTAINER := true -LATEST_BUILD_IMAGE_TAG ?= add-prettier-08d2e2a61 +LATEST_BUILD_IMAGE_TAG ?= 20211018_import-cortex-mixin-e7b4eab3c # TTY is parameterized to allow Google Cloud Builder to run builds, # as it currently disallows TTY devices. This value needs to be overridden @@ -314,6 +320,33 @@ clean-white-noise: check-white-noise: clean-white-noise @git diff --exit-code --quiet -- '*.md' || (echo "Please remove trailing whitespaces running 'make clean-white-noise'" && false) +check-mixin: format-mixin check-mixin-jb check-mixin-mixtool check-mixin-playbook + @git diff --exit-code --quiet -- $(MIXIN_PATH) || (echo "Please format mixin by running 'make format-mixin'" && false) + + @cd $(MIXIN_PATH) && \ + jb install && \ + mixtool lint mixin.libsonnet + +check-mixin-jb: + @cd $(MIXIN_PATH) && \ + jb install + +check-mixin-mixtool: check-mixin-jb + @cd $(MIXIN_PATH) && \ + mixtool lint mixin.libsonnet + +check-mixin-playbook: build-mixin + @$(MIXIN_PATH)/scripts/lint-playbooks.sh + +build-mixin: check-mixin-jb + @rm -rf $(MIXIN_PATH)/out && mkdir $(MIXIN_PATH)/out + @cd $(MIXIN_PATH) && \ + mixtool generate all --output-alerts out/alerts.yaml --output-rules out/rules.yaml --directory out/dashboards mixin.libsonnet && \ + zip -q -r mimir-mixin.zip out + +format-mixin: + @find $(MIXIN_PATH) -type f -name '*.libsonnet' -print -o -name '*.jsonnet' -print | xargs jsonnetfmt -i + web-serve: cd website && hugo --config config.toml --minify -v server diff --git a/mimir-build-image/Dockerfile b/mimir-build-image/Dockerfile index d4e8d0d8319..aae51a49dd9 100644 --- a/mimir-build-image/Dockerfile +++ b/mimir-build-image/Dockerfile @@ -6,7 +6,7 @@ FROM golang:1.16.6-buster ARG goproxyValue ENV GOPROXY=${goproxyValue} -RUN apt-get update && apt-get install -y curl python-requests python-yaml file jq unzip protobuf-compiler libprotobuf-dev shellcheck && \ +RUN apt-get update && apt-get install -y curl python-requests python-yaml file jq zip unzip protobuf-compiler libprotobuf-dev shellcheck && \ rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* RUN go get -u golang.org/x/tools/cmd/goimports@3fce476f0a782aeb5034d592c189e63be4ba6c9e RUN curl -sL https://deb.nodesource.com/setup_14.x | bash - @@ -36,15 +36,19 @@ RUN GOARCH=$(go env GOARCH) && \ RUN curl -sfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh| sh -s -- -b /usr/bin v1.27.0 -RUN GO111MODULE=on go get \ - github.com/client9/misspell/cmd/misspell@v0.3.4 \ - github.com/golang/protobuf/protoc-gen-go@v1.3.1 \ - github.com/gogo/protobuf/protoc-gen-gogoslick@v1.3.0 \ - github.com/gogo/protobuf/gogoproto@v1.3.0 \ - github.com/weaveworks/tools/cover@bdd647e92546027e12cdde3ae0714bb495e43013 \ - github.com/fatih/faillint@v1.5.0 \ - github.com/campoy/embedmd@v1.0.0 \ - && rm -rf /go/pkg /go/src /root/.cache +RUN GO111MODULE=on \ + go get github.com/client9/misspell/cmd/misspell@v0.3.4 && \ + go get github.com/golang/protobuf/protoc-gen-go@v1.3.1 && \ + go get github.com/gogo/protobuf/protoc-gen-gogoslick@v1.3.0 && \ + go get github.com/gogo/protobuf/gogoproto@v1.3.0 && \ + go get github.com/weaveworks/tools/cover@bdd647e92546027e12cdde3ae0714bb495e43013 && \ + go get github.com/fatih/faillint@v1.5.0 && \ + go get github.com/campoy/embedmd@v1.0.0 && \ + go get github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@v0.4.0 && \ + go get github.com/monitoring-mixins/mixtool/cmd/mixtool@bca3066 && \ + go get github.com/mikefarah/yq/v4@v4.13.4 && \ + go get github.com/google/go-jsonnet/cmd/jsonnetfmt@v0.17.0 && \ + rm -rf /go/pkg /go/src /root/.cache ENV NODE_PATH=/usr/lib/node_modules COPY build.sh / diff --git a/operations/mimir-mixin/.gitignore b/operations/mimir-mixin/.gitignore new file mode 100644 index 00000000000..7aac0df8ce9 --- /dev/null +++ b/operations/mimir-mixin/.gitignore @@ -0,0 +1,3 @@ +/out/ +/vendor/ +/mimir-mixin.zip diff --git a/operations/mimir-mixin/README.md b/operations/mimir-mixin/README.md new file mode 100644 index 00000000000..768fad9c5c5 --- /dev/null +++ b/operations/mimir-mixin/README.md @@ -0,0 +1,18 @@ +# Monitoring for Mimir + +To generate the Grafana dashboards and Prometheus alerts for Mimir: + +## Usage + +```console +$ GO111MODULE=on go get github.com/monitoring-mixins/mixtool/cmd/mixtool +$ GO111MODULE=on go get github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb +$ git clone https://github.com/grafana/mimir.git +$ make build-mixin +``` + +This will leave all the alerts and dashboards in jsonnet/mimir-mixin/mimir-mixin.zip (or jsonnet/mimir-mixin/out). + +## Known Problems + +If you get an error like `cannot use cli.StringSliceFlag literal (type cli.StringSliceFlag) as type cli.Flag in slice literal` when installing [mixtool](https://github.com/monitoring-mixins/mixtool/issues/27), make sure you set `GO111MODULE=on` before `go get`. diff --git a/operations/mimir-mixin/alerts.libsonnet b/operations/mimir-mixin/alerts.libsonnet new file mode 100644 index 00000000000..4dc1f85c247 --- /dev/null +++ b/operations/mimir-mixin/alerts.libsonnet @@ -0,0 +1,13 @@ +{ + prometheusAlerts+:: + (import 'alerts/alerts.libsonnet') + + (import 'alerts/alertmanager.libsonnet') + + + (if std.member($._config.storage_engine, 'blocks') + then + (import 'alerts/blocks.libsonnet') + + (import 'alerts/compactor.libsonnet') + else {}) + + + { _config:: $._config + $._group_config }, +} diff --git a/operations/mimir-mixin/alerts/alertmanager.libsonnet b/operations/mimir-mixin/alerts/alertmanager.libsonnet new file mode 100644 index 00000000000..e73d04b3e1a --- /dev/null +++ b/operations/mimir-mixin/alerts/alertmanager.libsonnet @@ -0,0 +1,98 @@ +{ + groups+: [ + { + name: 'alertmanager_alerts', + rules: [ + { + alert: 'CortexAlertmanagerSyncConfigsFailing', + expr: ||| + rate(cortex_alertmanager_sync_configs_failed_total[5m]) > 0 + |||, + 'for': '30m', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is failing to read tenant configurations from storage. + |||, + }, + }, + { + alert: 'CortexAlertmanagerRingCheckFailing', + expr: ||| + rate(cortex_alertmanager_ring_check_errors_total[2m]) > 0 + |||, + 'for': '10m', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is unable to check tenants ownership via the ring. + |||, + }, + }, + { + alert: 'CortexAlertmanagerPartialStateMergeFailing', + expr: ||| + rate(cortex_alertmanager_partial_state_merges_failed_total[2m]) > 0 + |||, + 'for': '10m', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is failing to merge partial state changes received from a replica. + |||, + }, + }, + { + alert: 'CortexAlertmanagerReplicationFailing', + expr: ||| + rate(cortex_alertmanager_state_replication_failed_total[2m]) > 0 + |||, + 'for': '10m', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is failing to replicating partial state to its replicas. + |||, + }, + }, + { + alert: 'CortexAlertmanagerPersistStateFailing', + expr: ||| + rate(cortex_alertmanager_state_persist_failed_total[15m]) > 0 + |||, + 'for': '1h', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is unable to persist full state snaphots to remote storage. + |||, + }, + }, + { + alert: 'CortexAlertmanagerInitialSyncFailed', + expr: ||| + increase(cortex_alertmanager_state_initial_sync_completed_total{outcome="failed"}[1m]) > 0 + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} was unable to obtain some initial state when starting up. + |||, + }, + }, + ], + }, + ], +} diff --git a/operations/mimir-mixin/alerts/alerts.libsonnet b/operations/mimir-mixin/alerts/alerts.libsonnet new file mode 100644 index 00000000000..59022dd8afc --- /dev/null +++ b/operations/mimir-mixin/alerts/alerts.libsonnet @@ -0,0 +1,740 @@ +{ + // simpleRegexpOpt produces a simple regexp that matches all strings in the input array. + local simpleRegexpOpt(strings) = + assert std.isArray(strings) : 'simpleRegexpOpt requires that `strings` is an array of strings`'; + '(' + std.join('|', strings) + ')', + + groups+: [ + { + name: 'cortex_alerts', + rules: [ + { + alert: 'CortexIngesterUnhealthy', + 'for': '15m', + expr: ||| + min by (%s) (cortex_ring_members{state="Unhealthy", name="ingester"}) > 0 + ||| % $._config.alert_aggregation_labels, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex cluster %(alert_aggregation_variables)s has {{ printf "%%f" $value }} unhealthy ingester(s).' % $._config, + }, + }, + { + alert: 'CortexRequestErrors', + // Note if alert_aggregation_labels is "job", this will repeat the label. But + // prometheus seems to tolerate that. + expr: ||| + 100 * sum by (%(group_by)s, job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..",route!~"%(excluded_routes)s"}[1m])) + / + sum by (%(group_by)s, job, route) (rate(cortex_request_duration_seconds_count{route!~"%(excluded_routes)s"}[1m])) + > 1 + ||| % { + group_by: $._config.alert_aggregation_labels, + excluded_routes: std.join('|', ['ready'] + $._config.alert_excluded_routes), + }, + 'for': '15m', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + The route {{ $labels.route }} in %(alert_aggregation_variables)s is experiencing {{ printf "%%.2f" $value }}%% errors. + ||| % $._config, + }, + }, + { + alert: 'CortexRequestLatency', + expr: ||| + %(group_prefix_jobs)s_route:cortex_request_duration_seconds:99quantile{route!~"%(excluded_routes)s"} + > + %(cortex_p99_latency_threshold_seconds)s + ||| % $._config { + excluded_routes: std.join('|', [ + 'metrics', + '/frontend.Frontend/Process', + 'ready', + '/schedulerpb.SchedulerForFrontend/FrontendLoop', + '/schedulerpb.SchedulerForQuerier/QuerierLoop', + ] + $._config.alert_excluded_routes), + }, + 'for': '15m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency. + |||, + }, + }, + { + // We're syncing every 10mins, and this means with a 5min rate, we will have a NaN when syncs fail + // and we will never trigger the alert. + // We also have a 3h grace-period for creation of tables which means the we can fail for 3h before it's an outage. + alert: 'CortexTableSyncFailure', + expr: ||| + 100 * rate(cortex_table_manager_sync_duration_seconds_count{status_code!~"2.."}[15m]) + / + rate(cortex_table_manager_sync_duration_seconds_count[15m]) + > 10 + |||, + 'for': '30m', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + {{ $labels.job }} is experiencing {{ printf "%.2f" $value }}% errors syncing tables. + |||, + }, + }, + { + alert: 'CortexQueriesIncorrect', + expr: ||| + 100 * sum by (%s) (rate(test_exporter_test_case_result_total{result="fail"}[5m])) + / + sum by (%s) (rate(test_exporter_test_case_result_total[5m])) > 1 + ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels], + 'for': '15m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + The Cortex cluster %(alert_aggregation_variables)s is experiencing {{ printf "%%.2f" $value }}%% incorrect query results. + ||| % $._config, + }, + }, + { + alert: 'CortexInconsistentRuntimeConfig', + expr: ||| + count(count by(%s, job, sha256) (cortex_runtime_config_hash)) without(sha256) > 1 + ||| % $._config.alert_aggregation_labels, + 'for': '1h', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + An inconsistent runtime config file is used across cluster %(alert_aggregation_variables)s. + ||| % $._config, + }, + }, + { + alert: 'CortexBadRuntimeConfig', + expr: ||| + # The metric value is reset to 0 on error while reloading the config at runtime. + cortex_runtime_config_last_reload_successful == 0 + |||, + // Alert quicker for human errors. + 'for': '5m', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + {{ $labels.job }} failed to reload runtime config. + |||, + }, + }, + { + alert: 'CortexFrontendQueriesStuck', + expr: ||| + sum by (%s) (cortex_query_frontend_queue_length) > 1 + ||| % $._config.alert_aggregation_labels, + 'for': '5m', // We don't want to block for longer. + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + There are {{ $value }} queued up queries in %(alert_aggregation_variables)s query-frontend. + ||| % $._config, + }, + }, + { + alert: 'CortexSchedulerQueriesStuck', + expr: ||| + sum by (%s) (cortex_query_scheduler_queue_length) > 1 + ||| % $._config.alert_aggregation_labels, + 'for': '5m', // We don't want to block for longer. + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + There are {{ $value }} queued up queries in %(alert_aggregation_variables)s query-scheduler. + ||| % $._config, + }, + }, + { + alert: 'CortexMemcachedRequestErrors', + expr: ||| + ( + sum by(%s, name, operation) (rate(thanos_memcached_operation_failures_total[1m])) / + sum by(%s, name, operation) (rate(thanos_memcached_operations_total[1m])) + ) * 100 > 5 + ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels], + 'for': '5m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + Memcached {{ $labels.name }} used by Cortex %(alert_aggregation_variables)s is experiencing {{ printf "%%.2f" $value }}%% errors for {{ $labels.operation }} operation. + ||| % $._config, + }, + }, + { + alert: 'CortexIngesterRestarts', + expr: ||| + changes(process_start_time_seconds{job=~".+(cortex|ingester.*)"}[30m]) >= 2 + |||, + labels: { + // This alert is on a cause not symptom. A couple of ingesters restarts may be suspicious but + // not necessarily an issue (eg. may happen because of the K8S node autoscaler), so we're + // keeping the alert as warning as a signal in case of an outage. + severity: 'warning', + }, + annotations: { + message: '{{ $labels.job }}/{{ $labels.instance }} has restarted {{ printf "%.2f" $value }} times in the last 30 mins.', + }, + }, + { + alert: 'CortexTransferFailed', + expr: ||| + max_over_time(cortex_shutdown_duration_seconds_count{op="transfer",status!="success"}[15m]) + |||, + 'for': '5m', + labels: { + severity: 'critical', + }, + annotations: { + message: '{{ $labels.job }}/{{ $labels.instance }} transfer failed.', + }, + }, + { + alert: 'CortexOldChunkInMemory', + // Even though we should flush chunks after 6h, we see that 99p of age of flushed chunks is closer + // to 10 hours. + // Ignore cortex_oldest_unflushed_chunk_timestamp_seconds that are zero (eg. distributors). + expr: ||| + (time() - cortex_oldest_unflushed_chunk_timestamp_seconds > 36000) + and + (cortex_oldest_unflushed_chunk_timestamp_seconds > 0) + |||, + 'for': '5m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + {{ $labels.job }}/{{ $labels.instance }} has very old unflushed chunk in memory. + |||, + }, + }, + { + alert: 'CortexKVStoreFailure', + expr: ||| + ( + sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~"2.+"}[1m])) + / + sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count[1m])) + ) + # We want to get alerted only in case there's a constant failure. + == 1 + ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels], + 'for': '5m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + Cortex {{ $labels.pod }} in %(alert_aggregation_variables)s is failing to talk to the KV store {{ $labels.kv_name }}. + ||| % $._config, + }, + }, + { + alert: 'CortexMemoryMapAreasTooHigh', + expr: ||| + process_memory_map_areas{job=~".+(cortex|ingester.*|store-gateway)"} / process_memory_map_areas_limit{job=~".+(cortex|ingester.*|store-gateway)"} > 0.8 + |||, + 'for': '5m', + labels: { + severity: 'critical', + }, + annotations: { + message: '{{ $labels.job }}/{{ $labels.instance }} has a number of mmap-ed areas close to the limit.', + }, + }, + ], + }, + { + name: 'cortex_ingester_instance_alerts', + rules: [ + { + alert: 'CortexIngesterReachingSeriesLimit', + expr: ||| + ( + (cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit="max_series"}) + and ignoring (limit) + (cortex_ingester_instance_limits{limit="max_series"} > 0) + ) > 0.8 + |||, + 'for': '3h', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + Ingester {{ $labels.job }}/{{ $labels.instance }} has reached {{ $value | humanizePercentage }} of its series limit. + |||, + }, + }, + { + alert: 'CortexIngesterReachingSeriesLimit', + expr: ||| + ( + (cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit="max_series"}) + and ignoring (limit) + (cortex_ingester_instance_limits{limit="max_series"} > 0) + ) > 0.9 + |||, + 'for': '5m', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + Ingester {{ $labels.job }}/{{ $labels.instance }} has reached {{ $value | humanizePercentage }} of its series limit. + |||, + }, + }, + { + alert: 'CortexIngesterReachingTenantsLimit', + expr: ||| + ( + (cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit="max_tenants"}) + and ignoring (limit) + (cortex_ingester_instance_limits{limit="max_tenants"} > 0) + ) > 0.7 + |||, + 'for': '5m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + Ingester {{ $labels.job }}/{{ $labels.instance }} has reached {{ $value | humanizePercentage }} of its tenant limit. + |||, + }, + }, + { + alert: 'CortexIngesterReachingTenantsLimit', + expr: ||| + ( + (cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit="max_tenants"}) + and ignoring (limit) + (cortex_ingester_instance_limits{limit="max_tenants"} > 0) + ) > 0.8 + |||, + 'for': '5m', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + Ingester {{ $labels.job }}/{{ $labels.instance }} has reached {{ $value | humanizePercentage }} of its tenant limit. + |||, + }, + }, + ], + }, + { + name: 'cortex_wal_alerts', + rules: [ + { + // Alert immediately if WAL is corrupt. + alert: 'CortexWALCorruption', + expr: ||| + increase(cortex_ingester_wal_corruptions_total[5m]) > 0 + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + {{ $labels.job }}/{{ $labels.instance }} has a corrupted WAL or checkpoint. + |||, + }, + }, + { + // One or more failed checkpoint creation is a warning. + alert: 'CortexCheckpointCreationFailed', + expr: ||| + increase(cortex_ingester_checkpoint_creations_failed_total[10m]) > 0 + |||, + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + {{ $labels.job }}/{{ $labels.instance }} failed to create checkpoint. + |||, + }, + }, + { + // Two or more failed checkpoint creation in 1h means something is wrong. + alert: 'CortexCheckpointCreationFailed', + expr: ||| + increase(cortex_ingester_checkpoint_creations_failed_total[1h]) > 1 + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + {{ $labels.job }}/{{ $labels.instance }} is failing to create checkpoint. + |||, + }, + }, + { + // One or more failed checkpoint deletion is a warning. + alert: 'CortexCheckpointDeletionFailed', + expr: ||| + increase(cortex_ingester_checkpoint_deletions_failed_total[10m]) > 0 + |||, + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + {{ $labels.job }}/{{ $labels.instance }} failed to delete checkpoint. + |||, + }, + }, + { + // Two or more failed checkpoint deletion in 2h means something is wrong. + // We give this more buffer than creation as this is a less critical operation. + alert: 'CortexCheckpointDeletionFailed', + expr: ||| + increase(cortex_ingester_checkpoint_deletions_failed_total[2h]) > 1 + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + {{ $labels.instance }} is failing to delete checkpoint. + |||, + }, + }, + ], + }, + { + name: 'cortex-rollout-alerts', + rules: [ + { + alert: 'CortexRolloutStuck', + expr: ||| + ( + max without (revision) ( + kube_statefulset_status_current_revision + unless + kube_statefulset_status_update_revision + ) + * + ( + kube_statefulset_replicas + != + kube_statefulset_status_replicas_updated + ) + ) and ( + changes(kube_statefulset_status_replicas_updated[15m]) + == + 0 + ) + * on(%s) group_left max by(%s) (cortex_build_info) + ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels], + 'for': '15m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + The {{ $labels.statefulset }} rollout is stuck in %(alert_aggregation_variables)s. + ||| % $._config, + }, + }, + { + alert: 'CortexRolloutStuck', + expr: ||| + ( + kube_deployment_spec_replicas + != + kube_deployment_status_replicas_updated + ) and ( + changes(kube_deployment_status_replicas_updated[15m]) + == + 0 + ) + * on(%s) group_left max by(%s) (cortex_build_info) + ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels], + 'for': '15m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + The {{ $labels.deployment }} rollout is stuck in %(alert_aggregation_variables)s. + ||| % $._config, + }, + }, + ], + }, + { + name: 'cortex-provisioning', + rules: [ + { + alert: 'CortexProvisioningMemcachedTooSmall', + // 4 x in-memory series size = 24hrs of data. + expr: ||| + ( + 4 * + sum by (%s) (cortex_ingester_memory_series * cortex_ingester_chunk_size_bytes_sum / cortex_ingester_chunk_size_bytes_count) + / 1e9 + ) + > + ( + sum by (%s) (memcached_limit_bytes{job=~".+/memcached"}) / 1e9 + ) + ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels], + 'for': '15m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + Chunk memcached cluster in %(alert_aggregation_variables)s is too small, should be at least {{ printf "%%.2f" $value }}GB. + ||| % $._config, + }, + }, + { + alert: 'CortexProvisioningTooManyActiveSeries', + // We target each ingester to 1.5M in-memory series. This alert fires if the average + // number of series / ingester in a Cortex cluster is > 1.6M for 2h (we compact + // the TSDB head every 2h). + expr: ||| + avg by (%s) (cortex_ingester_memory_series) > 1.6e6 + ||| % [$._config.alert_aggregation_labels], + 'for': '2h', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + The number of in-memory series per ingester in %(alert_aggregation_variables)s is too high. + ||| % $._config, + }, + }, + { + alert: 'CortexProvisioningTooManyWrites', + // 80k writes / s per ingester max. + expr: ||| + avg by (%s) (rate(cortex_ingester_ingested_samples_total[1m])) > 80e3 + ||| % $._config.alert_aggregation_labels, + 'for': '15m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + Ingesters in %(alert_aggregation_variables)s ingest too many samples per second. + ||| % $._config, + }, + }, + { + alert: 'CortexAllocatingTooMuchMemory', + expr: ||| + ( + container_memory_working_set_bytes{container="ingester"} + / + container_spec_memory_limit_bytes{container="ingester"} + ) > 0.65 + |||, + 'for': '15m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + Ingester {{ $labels.pod }} in %(alert_aggregation_variables)s is using too much memory. + ||| % $._config, + }, + }, + { + alert: 'CortexAllocatingTooMuchMemory', + expr: ||| + ( + container_memory_working_set_bytes{container="ingester"} + / + container_spec_memory_limit_bytes{container="ingester"} + ) > 0.8 + |||, + 'for': '15m', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + Ingester {{ $labels.pod }} in %(alert_aggregation_variables)s is using too much memory. + ||| % $._config, + }, + }, + ], + }, + { + name: 'ruler_alerts', + rules: [ + { + alert: 'CortexRulerTooManyFailedPushes', + expr: ||| + 100 * ( + sum by (%s, instance) (rate(cortex_ruler_write_requests_failed_total[1m])) + / + sum by (%s, instance) (rate(cortex_ruler_write_requests_total[1m])) + ) > 1 + ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels], + 'for': '5m', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + Cortex Ruler {{ $labels.instance }} in %(alert_aggregation_variables)s is experiencing {{ printf "%%.2f" $value }}%% write (push) errors. + ||| % $._config, + }, + }, + { + alert: 'CortexRulerTooManyFailedQueries', + expr: ||| + 100 * ( + sum by (%s, instance) (rate(cortex_ruler_queries_failed_total[1m])) + / + sum by (%s, instance) (rate(cortex_ruler_queries_total[1m])) + ) > 1 + ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels], + 'for': '5m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + Cortex Ruler {{ $labels.instance }} in %(alert_aggregation_variables)s is experiencing {{ printf "%%.2f" $value }}%% errors while evaluating rules. + ||| % $._config, + }, + }, + { + alert: 'CortexRulerMissedEvaluations', + expr: ||| + sum by (%s, instance, rule_group) (rate(cortex_prometheus_rule_group_iterations_missed_total[1m])) + / + sum by (%s, instance, rule_group) (rate(cortex_prometheus_rule_group_iterations_total[1m])) + > 0.01 + ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels], + 'for': '5m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + Cortex Ruler {{ $labels.instance }} in %(alert_aggregation_variables)s is experiencing {{ printf "%%.2f" $value }}%% missed iterations for the rule group {{ $labels.rule_group }}. + ||| % $._config, + }, + }, + { + alert: 'CortexRulerFailedRingCheck', + expr: ||| + sum by (%s, job) (rate(cortex_ruler_ring_check_errors_total[1m])) + > 0 + ||| % $._config.alert_aggregation_labels, + 'for': '5m', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + Cortex Rulers in %(alert_aggregation_variables)s are experiencing errors when checking the ring for rule group ownership. + ||| % $._config, + }, + }, + ], + }, + { + name: 'gossip_alerts', + rules: [ + { + alert: 'CortexGossipMembersMismatch', + expr: + ||| + memberlist_client_cluster_members_count + != on (%s) group_left + sum by (%s) (up{job=~".+/%s"}) + ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels, simpleRegexpOpt($._config.job_names.ring_members)], + 'for': '5m', + labels: { + severity: 'warning', + }, + annotations: { + message: 'Cortex instance {{ $labels.instance }} in %(alert_aggregation_variables)s sees incorrect number of gossip members.' % $._config, + }, + }, + ], + }, + { + name: 'etcd_alerts', + rules: [ + { + alert: 'EtcdAllocatingTooMuchMemory', + expr: ||| + ( + container_memory_working_set_bytes{container="etcd"} + / + container_spec_memory_limit_bytes{container="etcd"} + ) > 0.65 + |||, + 'for': '15m', + labels: { + severity: 'warning', + }, + annotations: { + message: ||| + Too much memory being used by {{ $labels.namespace }}/{{ $labels.pod }} - bump memory limit. + |||, + }, + }, + { + alert: 'EtcdAllocatingTooMuchMemory', + expr: ||| + ( + container_memory_working_set_bytes{container="etcd"} + / + container_spec_memory_limit_bytes{container="etcd"} + ) > 0.8 + |||, + 'for': '15m', + labels: { + severity: 'critical', + }, + annotations: { + message: ||| + Too much memory being used by {{ $labels.namespace }}/{{ $labels.pod }} - bump memory limit. + |||, + }, + }, + ], + }, + ], +} diff --git a/operations/mimir-mixin/alerts/blocks.libsonnet b/operations/mimir-mixin/alerts/blocks.libsonnet new file mode 100644 index 00000000000..a60ac2da263 --- /dev/null +++ b/operations/mimir-mixin/alerts/blocks.libsonnet @@ -0,0 +1,242 @@ +{ + groups+: [ + { + name: 'cortex_blocks_alerts', + rules: [ + { + // Alert if the ingester has not shipped any block in the last 4h. It also checks cortex_ingester_ingested_samples_total + // to avoid false positives on ingesters not receiving any traffic yet (eg. a newly created cluster). + alert: 'CortexIngesterHasNotShippedBlocks', + 'for': '15m', + expr: ||| + (min by(%(alert_aggregation_labels)s, instance) (time() - thanos_objstore_bucket_last_successful_upload_time{job=~".+/ingester.*"}) > 60 * 60 * 4) + and + (max by(%(alert_aggregation_labels)s, instance) (thanos_objstore_bucket_last_successful_upload_time{job=~".+/ingester.*"}) > 0) + and + # Only if the ingester has ingested samples over the last 4h. + (max by(%(alert_aggregation_labels)s, instance) (rate(cortex_ingester_ingested_samples_total[4h])) > 0) + and + # Only if the ingester was ingesting samples 4h ago. This protects from the case the ingester instance + # had ingested samples in the past, then no traffic was received for a long period and then it starts + # receiving samples again. Without this check, the alert would fire as soon as it gets back receiving + # samples, while the a block shipping is expected within the next 4h. + (max by(%(alert_aggregation_labels)s, instance) (rate(cortex_ingester_ingested_samples_total[1h] offset 4h)) > 0) + ||| % $._config, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Ingester {{ $labels.instance }} in %(alert_aggregation_variables)s has not shipped any block in the last 4 hours.' % $._config, + }, + }, + { + // Alert if the ingester has not shipped any block since start. It also checks cortex_ingester_ingested_samples_total + // to avoid false positives on ingesters not receiving any traffic yet (eg. a newly created cluster). + alert: 'CortexIngesterHasNotShippedBlocksSinceStart', + 'for': '4h', + expr: ||| + (max by(%(alert_aggregation_labels)s, instance) (thanos_objstore_bucket_last_successful_upload_time{job=~".+/ingester.*"}) == 0) + and + (max by(%(alert_aggregation_labels)s, instance) (rate(cortex_ingester_ingested_samples_total[4h])) > 0) + ||| % $._config, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Ingester {{ $labels.instance }} in %(alert_aggregation_variables)s has not shipped any block in the last 4 hours.' % $._config, + }, + }, + { + // Alert if the ingester has compacted some blocks that haven't been successfully uploaded to the storage yet since + // more than 1 hour. The metric tracks the time of the oldest unshipped block, measured as the time when the + // TSDB head has been compacted to a block. The metric is 0 if all blocks have been shipped. + alert: 'CortexIngesterHasUnshippedBlocks', + 'for': '15m', + expr: ||| + (time() - cortex_ingester_oldest_unshipped_block_timestamp_seconds > 3600) + and + (cortex_ingester_oldest_unshipped_block_timestamp_seconds > 0) + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: "Cortex Ingester {{ $labels.instance }} in %(alert_aggregation_variables)s has compacted a block {{ $value | humanizeDuration }} ago but it hasn't been successfully uploaded to the storage yet." % $._config, + }, + }, + { + // Alert if the ingester is failing to compact TSDB head into a block, for any opened TSDB. Once the TSDB head is + // compactable, the ingester will try to compact it every 1 minute. Repeatedly failing it is a critical condition + // that should never happen. + alert: 'CortexIngesterTSDBHeadCompactionFailed', + 'for': '15m', + expr: ||| + rate(cortex_ingester_tsdb_compactions_failed_total[5m]) > 0 + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Ingester {{ $labels.instance }} in %(alert_aggregation_variables)s is failing to compact TSDB head.' % $._config, + }, + }, + { + alert: 'CortexIngesterTSDBHeadTruncationFailed', + expr: ||| + rate(cortex_ingester_tsdb_head_truncations_failed_total[5m]) > 0 + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Ingester {{ $labels.instance }} in %(alert_aggregation_variables)s is failing to truncate TSDB head.' % $._config, + }, + }, + { + alert: 'CortexIngesterTSDBCheckpointCreationFailed', + expr: ||| + rate(cortex_ingester_tsdb_checkpoint_creations_failed_total[5m]) > 0 + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Ingester {{ $labels.instance }} in %(alert_aggregation_variables)s is failing to create TSDB checkpoint.' % $._config, + }, + }, + { + alert: 'CortexIngesterTSDBCheckpointDeletionFailed', + expr: ||| + rate(cortex_ingester_tsdb_checkpoint_deletions_failed_total[5m]) > 0 + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Ingester {{ $labels.instance }} in %(alert_aggregation_variables)s is failing to delete TSDB checkpoint.' % $._config, + }, + }, + { + alert: 'CortexIngesterTSDBWALTruncationFailed', + expr: ||| + rate(cortex_ingester_tsdb_wal_truncations_failed_total[5m]) > 0 + |||, + labels: { + severity: 'warning', + }, + annotations: { + message: 'Cortex Ingester {{ $labels.instance }} in %(alert_aggregation_variables)s is failing to truncate TSDB WAL.' % $._config, + }, + }, + { + alert: 'CortexIngesterTSDBWALCorrupted', + expr: ||| + rate(cortex_ingester_tsdb_wal_corruptions_total[5m]) > 0 + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Ingester {{ $labels.instance }} in %(alert_aggregation_variables)s got a corrupted TSDB WAL.' % $._config, + }, + }, + { + alert: 'CortexIngesterTSDBWALWritesFailed', + 'for': '3m', + expr: ||| + rate(cortex_ingester_tsdb_wal_writes_failed_total[1m]) > 0 + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Ingester {{ $labels.instance }} in %(alert_aggregation_variables)s is failing to write to TSDB WAL.' % $._config, + }, + }, + { + // Alert if the querier is not successfully scanning the bucket. + alert: 'CortexQuerierHasNotScanTheBucket', + 'for': '5m', + expr: ||| + (time() - cortex_querier_blocks_last_successful_scan_timestamp_seconds > 60 * 30) + and + cortex_querier_blocks_last_successful_scan_timestamp_seconds > 0 + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Querier {{ $labels.instance }} in %(alert_aggregation_variables)s has not successfully scanned the bucket since {{ $value | humanizeDuration }}.' % $._config, + }, + }, + { + // Alert if the number of queries for which we had to refetch series from different store-gateways + // (because of missing blocks) is greater than a %. + alert: 'CortexQuerierHighRefetchRate', + 'for': '10m', + expr: ||| + 100 * ( + ( + sum by(%(alert_aggregation_labels)s) (rate(cortex_querier_storegateway_refetches_per_query_count[5m])) + - + sum by(%(alert_aggregation_labels)s) (rate(cortex_querier_storegateway_refetches_per_query_bucket{le="0.0"}[5m])) + ) + / + sum by(%(alert_aggregation_labels)s) (rate(cortex_querier_storegateway_refetches_per_query_count[5m])) + ) + > 1 + ||| % $._config, + labels: { + severity: 'warning', + }, + annotations: { + message: 'Cortex Queries in %(alert_aggregation_variables)s are refetching series from different store-gateways (because of missing blocks) for the {{ printf "%%.0f" $value }}%% of queries.' % $._config, + }, + }, + { + // Alert if the store-gateway is not successfully synching the bucket. + alert: 'CortexStoreGatewayHasNotSyncTheBucket', + 'for': '5m', + expr: ||| + (time() - cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component="store-gateway"} > 60 * 30) + and + cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component="store-gateway"} > 0 + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Store Gateway {{ $labels.instance }} in %(alert_aggregation_variables)s has not successfully synched the bucket since {{ $value | humanizeDuration }}.' % $._config, + }, + }, + { + // Alert if the bucket index has not been updated for a given user. + alert: 'CortexBucketIndexNotUpdated', + expr: ||| + min by(%(alert_aggregation_labels)s, user) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 7200 + ||| % $._config, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex bucket index for tenant {{ $labels.user }} in %(alert_aggregation_variables)s has not been updated since {{ $value | humanizeDuration }}.' % $._config, + }, + }, + { + // Alert if a we consistently find partial blocks for a given tenant over a relatively large time range. + alert: 'CortexTenantHasPartialBlocks', + 'for': '6h', + expr: ||| + max by(%(alert_aggregation_labels)s, user) (cortex_bucket_blocks_partials_count) > 0 + ||| % $._config, + labels: { + severity: 'warning', + }, + annotations: { + message: 'Cortex tenant {{ $labels.user }} in %(alert_aggregation_variables)s has {{ $value }} partial blocks.' % $._config, + }, + }, + ], + }, + ], +} diff --git a/operations/mimir-mixin/alerts/compactor.libsonnet b/operations/mimir-mixin/alerts/compactor.libsonnet new file mode 100644 index 00000000000..5538545e249 --- /dev/null +++ b/operations/mimir-mixin/alerts/compactor.libsonnet @@ -0,0 +1,96 @@ +{ + groups+: [ + { + name: 'cortex_compactor_alerts', + rules: [ + { + // Alert if the compactor has not successfully cleaned up blocks in the last 6h. + alert: 'CortexCompactorHasNotSuccessfullyCleanedUpBlocks', + 'for': '1h', + expr: ||| + (time() - cortex_compactor_block_cleanup_last_successful_run_timestamp_seconds > 60 * 60 * 6) + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Compactor {{ $labels.instance }} in %(alert_aggregation_variables)s has not successfully cleaned up blocks in the last 6 hours.' % $._config, + }, + }, + { + // Alert if the compactor has not successfully run compaction in the last 24h. + alert: 'CortexCompactorHasNotSuccessfullyRunCompaction', + 'for': '1h', + expr: ||| + (time() - cortex_compactor_last_successful_run_timestamp_seconds > 60 * 60 * 24) + and + (cortex_compactor_last_successful_run_timestamp_seconds > 0) + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Compactor {{ $labels.instance }} in %(alert_aggregation_variables)s has not run compaction in the last 24 hours.' % $._config, + }, + }, + { + // Alert if the compactor has not successfully run compaction in the last 24h since startup. + alert: 'CortexCompactorHasNotSuccessfullyRunCompaction', + 'for': '24h', + expr: ||| + cortex_compactor_last_successful_run_timestamp_seconds == 0 + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Compactor {{ $labels.instance }} in %(alert_aggregation_variables)s has not run compaction in the last 24 hours.' % $._config, + }, + }, + { + // Alert if compactor failed to run 2 consecutive compactions. + alert: 'CortexCompactorHasNotSuccessfullyRunCompaction', + expr: ||| + increase(cortex_compactor_runs_failed_total[2h]) >= 2 + |||, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Compactor {{ $labels.instance }} in %(alert_aggregation_variables)s failed to run 2 consecutive compactions.' % $._config, + }, + }, + { + // Alert if the compactor has not uploaded anything in the last 24h. + alert: 'CortexCompactorHasNotUploadedBlocks', + 'for': '15m', + expr: ||| + (time() - thanos_objstore_bucket_last_successful_upload_time{job=~".+/%(compactor)s"} > 60 * 60 * 24) + and + (thanos_objstore_bucket_last_successful_upload_time{job=~".+/%(compactor)s"} > 0) + ||| % $._config.job_names, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Compactor {{ $labels.instance }} in %(alert_aggregation_variables)s has not uploaded any block in the last 24 hours.' % $._config, + }, + }, + { + // Alert if the compactor has not uploaded anything since its start. + alert: 'CortexCompactorHasNotUploadedBlocks', + 'for': '24h', + expr: ||| + thanos_objstore_bucket_last_successful_upload_time{job=~".+/%(compactor)s"} == 0 + ||| % $._config.job_names, + labels: { + severity: 'critical', + }, + annotations: { + message: 'Cortex Compactor {{ $labels.instance }} in %(alert_aggregation_variables)s has not uploaded any block in the last 24 hours.' % $._config, + }, + }, + ], + }, + ], +} diff --git a/operations/mimir-mixin/config.libsonnet b/operations/mimir-mixin/config.libsonnet new file mode 100644 index 00000000000..9b8d81a283c --- /dev/null +++ b/operations/mimir-mixin/config.libsonnet @@ -0,0 +1,72 @@ +{ + grafanaDashboardFolder: 'Cortex', + grafanaDashboardShards: 4, + + _config+:: { + // Switch for overall storage engine. + // May contain 'chunks', 'blocks' or both. + // Enables chunks- or blocks- specific panels and dashboards. + storage_engine: ['blocks'], + + // For chunks backend, switch for chunk index type. + // May contain 'bigtable', 'dynamodb' or 'cassandra'. + chunk_index_backend: ['bigtable', 'dynamodb', 'cassandra'], + + // For chunks backend, switch for chunk store type. + // May contain 'bigtable', 'dynamodb', 'cassandra', 's3' or 'gcs'. + chunk_store_backend: ['bigtable', 'dynamodb', 'cassandra', 's3', 'gcs'], + + // Tags for dashboards. + tags: ['cortex'], + + // If Cortex is deployed as a single binary, set to true to + // modify the job selectors in the dashboard queries. + singleBinary: false, + + // These are used by the dashboards and allow for the simultaneous display of + // microservice and single binary cortex clusters. + job_names: { + ingester: '(ingester.*|cortex$)', // Match also custom and per-zone ingester deployments. + distributor: '(distributor|cortex$)', + querier: '(querier.*|cortex$)', // Match also custom querier deployments. + ruler: '(ruler|cortex$)', + query_frontend: '(query-frontend.*|cortex$)', // Match also custom query-frontend deployments. + query_scheduler: 'query-scheduler.*', // Not part of single-binary. Match also custom query-scheduler deployments. + table_manager: '(table-manager|cortex$)', + ring_members: ['compactor', 'distributor', 'ingester.*', 'querier.*', 'ruler', 'store-gateway', 'cortex'], + store_gateway: '(store-gateway|cortex$)', + gateway: '(gateway|cortex-gw|cortex-gw-internal)', + compactor: 'compactor.*', // Match also custom compactor deployments. + }, + + // Grouping labels, to uniquely identify and group by {jobs, clusters} + job_labels: ['cluster', 'namespace', 'job'], + cluster_labels: ['cluster', 'namespace'], + + cortex_p99_latency_threshold_seconds: 2.5, + + // Whether resources dashboards are enabled (based on cAdvisor metrics). + resources_dashboards_enabled: false, + + // The label used to differentiate between different application instances (i.e. 'pod' in a kubernetes install). + per_instance_label: 'pod', + + // Name selectors for different application instances, using the "per_instance_label". + instance_names: { + compactor: 'compactor.*', + alertmanager: 'alertmanager.*', + }, + + // The label used to differentiate between different nodes (i.e. servers). + per_node_label: 'instance', + + // Whether certain dashboard description headers should be shown + show_dashboard_descriptions: { + writes: true, + reads: true, + }, + + // The routes to exclude from alerts. + alert_excluded_routes: [], + }, +} diff --git a/operations/mimir-mixin/dashboards.libsonnet b/operations/mimir-mixin/dashboards.libsonnet new file mode 100644 index 00000000000..9e7f71c28d7 --- /dev/null +++ b/operations/mimir-mixin/dashboards.libsonnet @@ -0,0 +1,35 @@ +{ + grafanaDashboards+: + (import 'dashboards/config.libsonnet') + + (import 'dashboards/queries.libsonnet') + + (import 'dashboards/reads.libsonnet') + + (import 'dashboards/ruler.libsonnet') + + (import 'dashboards/alertmanager.libsonnet') + + (import 'dashboards/scaling.libsonnet') + + (import 'dashboards/writes.libsonnet') + + (import 'dashboards/slow-queries.libsonnet') + + (import 'dashboards/rollout-progress.libsonnet') + + + (if std.member($._config.storage_engine, 'blocks') + then + (import 'dashboards/compactor.libsonnet') + + (import 'dashboards/compactor-resources.libsonnet') + + (import 'dashboards/object-store.libsonnet') + else {}) + + + (if std.member($._config.storage_engine, 'chunks') + then import 'dashboards/chunks.libsonnet' + else {}) + + + (if std.member($._config.storage_engine, 'blocks') + && std.member($._config.storage_engine, 'chunks') + then import 'dashboards/comparison.libsonnet' + else {}) + + + (if !$._config.resources_dashboards_enabled then {} else + (import 'dashboards/reads-resources.libsonnet') + + (import 'dashboards/writes-resources.libsonnet') + + (import 'dashboards/alertmanager-resources.libsonnet')) + + + { _config:: $._config + $._group_config }, +} diff --git a/operations/mimir-mixin/dashboards/alertmanager-resources.libsonnet b/operations/mimir-mixin/dashboards/alertmanager-resources.libsonnet new file mode 100644 index 00000000000..415060206f7 --- /dev/null +++ b/operations/mimir-mixin/dashboards/alertmanager-resources.libsonnet @@ -0,0 +1,67 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + 'alertmanager-resources.json': + ($.dashboard('Cortex / Alertmanager Resources') + { uid: '68b66aed90ccab448009089544a8d6c6' }) + .addClusterSelectorTemplates(false) + .addRow( + $.row('Gateway') + .addPanel( + $.containerCPUUsagePanel('CPU', $._config.job_names.gateway), + ) + .addPanel( + $.containerMemoryWorkingSetPanel('Memory (workingset)', $._config.job_names.gateway), + ) + .addPanel( + $.goHeapInUsePanel('Memory (go heap inuse)', $._config.job_names.gateway), + ) + ) + .addRow( + $.row('Alertmanager') + .addPanel( + $.containerCPUUsagePanel('CPU', 'alertmanager'), + ) + .addPanel( + $.containerMemoryWorkingSetPanel('Memory (workingset)', 'alertmanager'), + ) + .addPanel( + $.goHeapInUsePanel('Memory (go heap inuse)', 'alertmanager'), + ) + ) + .addRow( + $.row('Instance Mapper') + .addPanel( + $.containerCPUUsagePanel('CPU', 'alertmanager-im'), + ) + .addPanel( + $.containerMemoryWorkingSetPanel('Memory (workingset)', 'alertmanager-im'), + ) + .addPanel( + $.goHeapInUsePanel('Memory (go heap inuse)', 'alertmanager-im'), + ) + ) + .addRow( + $.row('Network') + .addPanel( + $.containerNetworkReceiveBytesPanel($._config.instance_names.alertmanager), + ) + .addPanel( + $.containerNetworkTransmitBytesPanel($._config.instance_names.alertmanager), + ) + ) + .addRow( + $.row('Disk') + .addPanel( + $.containerDiskWritesPanel('Writes', 'alertmanager'), + ) + .addPanel( + $.containerDiskReadsPanel('Reads', 'alertmanager'), + ) + ) + .addRow( + $.row('') + .addPanel( + $.containerDiskSpaceUtilization('Disk Space Utilization', 'alertmanager'), + ) + ), +} diff --git a/operations/mimir-mixin/dashboards/alertmanager.libsonnet b/operations/mimir-mixin/dashboards/alertmanager.libsonnet new file mode 100644 index 00000000000..8897034eea9 --- /dev/null +++ b/operations/mimir-mixin/dashboards/alertmanager.libsonnet @@ -0,0 +1,246 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + 'alertmanager.json': + ($.dashboard('Cortex / Alertmanager') + { uid: 'a76bee5913c97c918d9e56a3cc88cc28' }) + .addClusterSelectorTemplates() + .addRow( + ($.row('Headlines') + { + height: '100px', + showTitle: false, + }) + .addPanel( + $.panel('Total Alerts') + + $.statPanel('sum(cluster_job_%s:cortex_alertmanager_alerts:sum{%s})' % [$._config.per_instance_label, $.jobMatcher('alertmanager')], format='short') + ) + .addPanel( + $.panel('Total Silences') + + $.statPanel('sum(cluster_job_%s:cortex_alertmanager_silences:sum{%s})' % [$._config.per_instance_label, $.jobMatcher('alertmanager')], format='short') + ) + .addPanel( + $.panel('Tenants') + + $.statPanel('max(cortex_alertmanager_tenants_discovered{%s})' % $.jobMatcher('alertmanager'), format='short') + ) + ) + .addRow( + $.row('Alerts Received') + .addPanel( + $.panel('APS') + + $.queryPanel( + [ + ||| + sum(cluster_job:cortex_alertmanager_alerts_received_total:rate5m{%s}) + - + sum(cluster_job:cortex_alertmanager_alerts_invalid_total:rate5m{%s}) + ||| % [$.jobMatcher('alertmanager'), $.jobMatcher('alertmanager')], + 'sum(cluster_job:cortex_alertmanager_alerts_invalid_total:rate5m{%s})' % $.jobMatcher('alertmanager'), + ], + ['success', 'failed'] + ) + ) + ) + .addRow( + $.row('Alert Notifications') + .addPanel( + $.panel('NPS') + + $.queryPanel( + [ + ||| + sum(cluster_job_integration:cortex_alertmanager_notifications_total:rate5m{%s}) + - + sum(cluster_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s}) + ||| % [$.jobMatcher('alertmanager'), $.jobMatcher('alertmanager')], + 'sum(cluster_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s})' % $.jobMatcher('alertmanager'), + ], + ['success', 'failed'] + ) + ) + .addPanel( + $.panel('NPS by integration') + + $.queryPanel( + [ + ||| + ( + sum(cluster_job_integration:cortex_alertmanager_notifications_total:rate5m{%s}) by(integration) + - + sum(cluster_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s}) by(integration) + ) > 0 + or on () vector(0) + ||| % [$.jobMatcher('alertmanager'), $.jobMatcher('alertmanager')], + 'sum(cluster_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s}) by(integration)' % $.jobMatcher('alertmanager'), + ], + ['success - {{ integration }}', 'failed - {{ integration }}'] + ) + ) + .addPanel( + $.panel('Latency') + + $.latencyPanel('cortex_alertmanager_notification_latency_seconds', '{%s}' % $.jobMatcher('alertmanager')) + ) + ) + .addRow( + $.row('Configuration API (gateway) + Alertmanager UI') + .addPanel( + $.panel('QPS') + + $.qpsPanel('cortex_request_duration_seconds_count{%s, route=~"api_v1_alerts|alertmanager"}' % $.jobMatcher($._config.job_names.gateway)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_request_duration_seconds', $.jobSelector($._config.job_names.gateway) + [utils.selector.re('route', 'api_v1_alerts|alertmanager')]) + ) + ) + .addRows( + $.getObjectStoreRows('Alertmanager Configuration Object Store (Alertmanager accesses)', 'alertmanager-storage') + ) + .addRow( + $.row('Replication') + .addPanel( + $.panel('Per %s Tenants' % $._config.per_instance_label) + + $.queryPanel( + 'max by(%s) (cortex_alertmanager_tenants_owned{%s})' % [$._config.per_instance_label, $.jobMatcher('alertmanager')], + '{{%s}}' % $._config.per_instance_label + ) + + $.stack + ) + .addPanel( + $.panel('Per %s Alerts' % $._config.per_instance_label) + + $.queryPanel( + 'sum by(%s) (cluster_job_%s:cortex_alertmanager_alerts:sum{%s})' % [$._config.per_instance_label, $._config.per_instance_label, $.jobMatcher('alertmanager')], + '{{%s}}' % $._config.per_instance_label + ) + + $.stack + ) + .addPanel( + $.panel('Per %s Silences' % $._config.per_instance_label) + + $.queryPanel( + 'sum by(%s) (cluster_job_%s:cortex_alertmanager_silences:sum{%s})' % [$._config.per_instance_label, $._config.per_instance_label, $.jobMatcher('alertmanager')], + '{{%s}}' % $._config.per_instance_label + ) + + $.stack + ) + ) + .addRow( + $.row('Tenant Configuration Sync') + .addPanel( + $.panel('Syncs/sec') + + $.queryPanel( + [ + ||| + sum(rate(cortex_alertmanager_sync_configs_total{%s}[$__rate_interval])) + - + sum(rate(cortex_alertmanager_sync_configs_failed_total{%s}[$__rate_interval])) + ||| % [$.jobMatcher('alertmanager'), $.jobMatcher('alertmanager')], + 'sum(rate(cortex_alertmanager_sync_configs_failed_total{%s}[$__rate_interval]))' % $.jobMatcher('alertmanager'), + ], + ['success', 'failed'] + ) + ) + .addPanel( + $.panel('Syncs/sec (By Reason)') + + $.queryPanel( + 'sum by(reason) (rate(cortex_alertmanager_sync_configs_total{%s}[$__rate_interval]))' % $.jobMatcher('alertmanager'), + '{{reason}}' + ) + ) + .addPanel( + $.panel('Ring Check Errors/sec') + + $.queryPanel( + 'sum (rate(cortex_alertmanager_ring_check_errors_total{%s}[$__rate_interval]))' % $.jobMatcher('alertmanager'), + 'errors' + ) + ) + ) + .addRow( + $.row('Sharding Initial State Sync') + .addPanel( + $.panel('Initial syncs /sec') + + $.queryPanel( + 'sum by(outcome) (rate(cortex_alertmanager_state_initial_sync_completed_total{%s}[$__rate_interval]))' % $.jobMatcher('alertmanager'), + '{{outcome}}' + ) + { + targets: [ + target { + interval: '1m', + } + for target in super.targets + ], + } + ) + .addPanel( + $.panel('Initial sync duration') + + $.latencyPanel('cortex_alertmanager_state_initial_sync_duration_seconds', '{%s}' % $.jobMatcher('alertmanager')) + { + targets: [ + target { + interval: '1m', + } + for target in super.targets + ], + } + ) + .addPanel( + $.panel('Fetch state from other alertmanagers /sec') + + $.queryPanel( + [ + ||| + sum(rate(cortex_alertmanager_state_fetch_replica_state_total{%s}[$__rate_interval])) + - + sum(rate(cortex_alertmanager_state_fetch_replica_state_failed_total{%s}[$__rate_interval])) + ||| % [$.jobMatcher('alertmanager'), $.jobMatcher('alertmanager')], + 'sum(rate(cortex_alertmanager_state_fetch_replica_state_failed_total{%s}[$__rate_interval]))' % $.jobMatcher('alertmanager'), + ], + ['success', 'failed'] + ) + { + targets: [ + target { + interval: '1m', + } + for target in super.targets + ], + } + ) + ) + .addRow( + $.row('Sharding Runtime State Sync') + .addPanel( + $.panel('Replicate state to other alertmanagers /sec') + + $.queryPanel( + [ + ||| + sum(cluster_job:cortex_alertmanager_state_replication_total:rate5m{%s}) + - + sum(cluster_job:cortex_alertmanager_state_replication_failed_total:rate5m{%s}) + ||| % [$.jobMatcher('alertmanager'), $.jobMatcher('alertmanager')], + 'sum(cluster_job:cortex_alertmanager_state_replication_failed_total:rate5m{%s})' % $.jobMatcher('alertmanager'), + ], + ['success', 'failed'] + ) + ) + .addPanel( + $.panel('Merge state from other alertmanagers /sec') + + $.queryPanel( + [ + ||| + sum(cluster_job:cortex_alertmanager_partial_state_merges_total:rate5m{%s}) + - + sum(cluster_job:cortex_alertmanager_partial_state_merges_failed_total:rate5m{%s}) + ||| % [$.jobMatcher('alertmanager'), $.jobMatcher('alertmanager')], + 'sum(cluster_job:cortex_alertmanager_partial_state_merges_failed_total:rate5m{%s})' % $.jobMatcher('alertmanager'), + ], + ['success', 'failed'] + ) + ) + .addPanel( + $.panel('Persist state to remote storage /sec') + + $.queryPanel( + [ + ||| + sum(rate(cortex_alertmanager_state_persist_total{%s}[$__rate_interval])) + - + sum(rate(cortex_alertmanager_state_persist_failed_total{%s}[$__rate_interval])) + ||| % [$.jobMatcher('alertmanager'), $.jobMatcher('alertmanager')], + 'sum(rate(cortex_alertmanager_state_persist_failed_total{%s}[$__rate_interval]))' % $.jobMatcher('alertmanager'), + ], + ['success', 'failed'] + ) + ) + ), +} diff --git a/operations/mimir-mixin/dashboards/chunks.libsonnet b/operations/mimir-mixin/dashboards/chunks.libsonnet new file mode 100644 index 00000000000..b82c68800db --- /dev/null +++ b/operations/mimir-mixin/dashboards/chunks.libsonnet @@ -0,0 +1,100 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + 'cortex-chunks.json': + ($.dashboard('Cortex / Chunks') + { uid: 'a56a3fa6284064eb392a115f3acbf744' }) + .addClusterSelectorTemplates() + .addRow( + $.row('Active Series / Chunks') + .addPanel( + $.panel('Series') + + $.queryPanel('sum(cortex_ingester_memory_series{%s})' % $.jobMatcher($._config.job_names.ingester), 'series'), + ) + .addPanel( + $.panel('Chunks per series') + + $.queryPanel('sum(cortex_ingester_memory_chunks{%s}) / sum(cortex_ingester_memory_series{%s})' % [$.jobMatcher($._config.job_names.ingester), $.jobMatcher($._config.job_names.ingester)], 'chunks'), + ) + ) + .addRow( + $.row('Flush Stats') + .addPanel( + $.panel('Utilization') + + $.latencyPanel('cortex_ingester_chunk_utilization', '{%s}' % $.jobMatcher($._config.job_names.ingester), multiplier='1') + + { yaxes: $.yaxes('percentunit') }, + ) + .addPanel( + $.panel('Age') + + $.latencyPanel('cortex_ingester_chunk_age_seconds', '{%s}' % $.jobMatcher($._config.job_names.ingester)), + ), + ) + .addRow( + $.row('Flush Stats') + .addPanel( + $.panel('Size') + + $.latencyPanel('cortex_ingester_chunk_length', '{%s}' % $.jobMatcher($._config.job_names.ingester), multiplier='1') + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.panel('Entries') + + $.queryPanel('sum(rate(cortex_chunk_store_index_entries_per_chunk_sum{%s}[5m])) / sum(rate(cortex_chunk_store_index_entries_per_chunk_count{%s}[5m]))' % [$.jobMatcher($._config.job_names.ingester), $.jobMatcher($._config.job_names.ingester)], 'entries'), + ), + ) + .addRow( + $.row('Flush Stats') + .addPanel( + $.panel('Queue Length') + + $.queryPanel('cortex_ingester_flush_queue_length{%s}' % $.jobMatcher($._config.job_names.ingester), '{{%s}}' % $._config.per_instance_label), + ) + .addPanel( + $.panel('Flush Rate') + + $.qpsPanel('cortex_ingester_chunk_age_seconds_count{%s}' % $.jobMatcher($._config.job_names.ingester)), + ), + ), + + 'cortex-wal.json': + ($.dashboard('Cortex / WAL') + { uid: 'd4fb924cdc1581cd8e870e3eb0110bda' }) + .addClusterSelectorTemplates() + .addRow( + $.row('') + .addPanel( + $.panel('Bytes Logged (WAL+Checkpoint) / ingester / second') + + $.queryPanel('avg(rate(cortex_ingester_wal_logged_bytes_total{%(m)s}[$__rate_interval])) + avg(rate(cortex_ingester_checkpoint_logged_bytes_total{%(m)s}[$__rate_interval]))' % { m: $.jobMatcher($._config.job_names.ingester) }, 'bytes') + + { yaxes: $.yaxes('bytes') }, + ) + ) + .addRow( + $.row('WAL') + .addPanel( + $.panel('Records logged / ingester / second') + + $.queryPanel('avg(rate(cortex_ingester_wal_records_logged_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.ingester), 'records'), + ) + .addPanel( + $.panel('Bytes per record') + + $.queryPanel('avg(rate(cortex_ingester_wal_logged_bytes_total{%(m)s}[$__rate_interval]) / rate(cortex_ingester_wal_records_logged_total{%(m)s}[$__rate_interval]))' % { m: $.jobMatcher($._config.job_names.ingester) }, 'bytes') + + { yaxes: $.yaxes('bytes') }, + ) + .addPanel( + $.panel('Bytes per sample') + + $.queryPanel('avg(rate(cortex_ingester_wal_logged_bytes_total{%(m)s}[$__rate_interval]) / rate(cortex_ingester_ingested_samples_total{%(m)s}[$__rate_interval]))' % { m: $.jobMatcher($._config.job_names.ingester) }, 'bytes') + + { yaxes: $.yaxes('bytes') }, + ) + .addPanel( + $.panel('Min(available disk space)') + + $.queryPanel('min(kubelet_volume_stats_available_bytes{cluster=~"$cluster", namespace=~"$namespace", persistentvolumeclaim=~"ingester.*"})', 'bytes') + + { yaxes: $.yaxes('bytes') }, + ) + ) + .addRow( + $.row('Checkpoint') + .addPanel( + $.panel('Checkpoint creation/deletion / sec') + + $.queryPanel('rate(cortex_ingester_checkpoint_creations_total{%s}[$__rate_interval])' % $.jobMatcher($._config.job_names.ingester), '{{%s}}-creation' % $._config.per_instance_label) + + $.queryPanel('rate(cortex_ingester_checkpoint_deletions_total{%s}[$__rate_interval])' % $.jobMatcher($._config.job_names.ingester), '{{%s}}-deletion' % $._config.per_instance_label), + ) + .addPanel( + $.panel('Checkpoint creation/deletion failed / sec') + + $.queryPanel('rate(cortex_ingester_checkpoint_creations_failed_total{%s}[$__rate_interval])' % $.jobMatcher($._config.job_names.ingester), '{{%s}}-creation' % $._config.per_instance_label) + + $.queryPanel('rate(cortex_ingester_checkpoint_deletions_failed_total{%s}[$__rate_interval])' % $.jobMatcher($._config.job_names.ingester), '{{%s}}-deletion' % $._config.per_instance_label), + ) + ), +} diff --git a/operations/mimir-mixin/dashboards/compactor-resources.libsonnet b/operations/mimir-mixin/dashboards/compactor-resources.libsonnet new file mode 100644 index 00000000000..82a6bce4f07 --- /dev/null +++ b/operations/mimir-mixin/dashboards/compactor-resources.libsonnet @@ -0,0 +1,49 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + 'cortex-compactor-resources.json': + ($.dashboard('Cortex / Compactor Resources') + { uid: 'df9added6f1f4332f95848cca48ebd99' }) + .addClusterSelectorTemplates() + .addRow( + $.row('CPU and Memory') + .addPanel( + $.containerCPUUsagePanel('CPU', 'compactor'), + ) + .addPanel( + $.containerMemoryWorkingSetPanel('Memory (workingset)', 'compactor'), + ) + .addPanel( + $.goHeapInUsePanel('Memory (go heap inuse)', $._config.job_names.compactor), + ) + ) + .addRow( + $.row('Network') + .addPanel( + $.containerNetworkReceiveBytesPanel($._config.instance_names.compactor), + ) + .addPanel( + $.containerNetworkTransmitBytesPanel($._config.instance_names.compactor), + ) + ) + .addRow( + $.row('Disk') + .addPanel( + $.containerDiskWritesPanel('Disk Writes', 'compactor'), + ) + .addPanel( + $.containerDiskReadsPanel('Disk Reads', 'compactor'), + ) + .addPanel( + $.containerDiskSpaceUtilization('Disk Space Utilization', 'compactor'), + ) + ) + { + templating+: { + list: [ + // Do not allow to include all clusters/namespaces otherwise this dashboard + // risks to explode because it shows resources per pod. + l + (if (l.name == 'cluster' || l.name == 'namespace') then { includeAll: false } else {}) + for l in super.list + ], + }, + }, +} diff --git a/operations/mimir-mixin/dashboards/compactor.libsonnet b/operations/mimir-mixin/dashboards/compactor.libsonnet new file mode 100644 index 00000000000..aeb644919f3 --- /dev/null +++ b/operations/mimir-mixin/dashboards/compactor.libsonnet @@ -0,0 +1,120 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + 'cortex-compactor.json': + ($.dashboard('Cortex / Compactor') + { uid: '9c408e1d55681ecb8a22c9fab46875cc' }) + .addClusterSelectorTemplates() + .addRow( + $.row('Summary') + .addPanel( + $.startedCompletedFailedPanel( + 'Per-instance runs / sec', + 'sum(rate(cortex_compactor_runs_started_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.compactor), + 'sum(rate(cortex_compactor_runs_completed_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.compactor), + 'sum(rate(cortex_compactor_runs_failed_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.compactor) + ) + + $.bars + + { yaxes: $.yaxes('ops') } + + $.panelDescription( + 'Per-instance runs', + ||| + Number of times a compactor instance triggers a compaction across all tenants that it manages. + ||| + ), + ) + .addPanel( + $.panel('Tenants compaction progress') + + $.queryPanel(||| + ( + cortex_compactor_tenants_processing_succeeded{%s} + + cortex_compactor_tenants_processing_failed{%s} + + cortex_compactor_tenants_skipped{%s} + ) / cortex_compactor_tenants_discovered{%s} + ||| % [$.jobMatcher($._config.job_names.compactor), $.jobMatcher($._config.job_names.compactor), $.jobMatcher($._config.job_names.compactor), $.jobMatcher($._config.job_names.compactor)], '{{%s}}' % $._config.per_instance_label) + + { yaxes: $.yaxes({ format: 'percentunit', max: 1 }) } + + $.panelDescription( + 'Tenants compaction progress', + ||| + In a multi-tenant cluster, display the progress of tenants that are compacted while compaction is running. + Reset to 0 after the compaction run is completed for all tenants in the shard. + ||| + ), + ) + ) + .addRow( + $.row('') + .addPanel( + $.panel('Compacted blocks / sec') + + $.queryPanel('sum(rate(prometheus_tsdb_compactions_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.compactor), 'blocks') + + { yaxes: $.yaxes('ops') } + + $.panelDescription( + 'Compacted blocks / sec', + ||| + Rate of blocks that are generated as a result of a compaction operation. + ||| + ), + ) + .addPanel( + $.panel('Per-block compaction duration') + + $.latencyPanel('prometheus_tsdb_compaction_duration_seconds', '{%s}' % $.jobMatcher($._config.job_names.compactor)) + + $.panelDescription( + 'Per-block compaction duration', + ||| + Display the amount of time that it has taken to generate a single compacted block. + ||| + ), + ) + ) + .addRow( + $.row('') + .addPanel( + $.panel('Average blocks / tenant') + + $.queryPanel('avg(max by(user) (cortex_bucket_blocks_count{%s}))' % $.jobMatcher($._config.job_names.compactor), 'avg'), + ) + .addPanel( + $.panel('Tenants with largest number of blocks') + + $.queryPanel('topk(10, max by(user) (cortex_bucket_blocks_count{%s}))' % $.jobMatcher($._config.job_names.compactor), '{{user}}') + + $.panelDescription( + 'Tenants with largest number of blocks', + ||| + The 10 tenants with the largest number of blocks. + ||| + ), + ) + ) + .addRow( + $.row('Garbage Collector') + .addPanel( + $.panel('Blocks marked for deletion / sec') + + $.queryPanel('sum(rate(cortex_compactor_blocks_marked_for_deletion_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.compactor), 'blocks') + + { yaxes: $.yaxes('ops') }, + ) + .addPanel( + $.successFailurePanel( + 'Blocks deletions / sec', + // The cortex_compactor_blocks_cleaned_total tracks the number of successfully + // deleted blocks. + 'sum(rate(cortex_compactor_blocks_cleaned_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.compactor), + 'sum(rate(cortex_compactor_block_cleanup_failures_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.compactor), + ) + { yaxes: $.yaxes('ops') } + ) + ) + .addRow( + $.row('Metadata Sync') + .addPanel( + $.successFailurePanel( + 'Metadata Syncs / sec', + // The cortex_compactor_meta_syncs_total metric is incremented each time a per-tenant + // metadata sync is triggered. + 'sum(rate(cortex_compactor_meta_syncs_total{%s}[$__rate_interval])) - sum(rate(cortex_compactor_meta_sync_failures_total{%s}[$__rate_interval]))' % [$.jobMatcher($._config.job_names.compactor), $.jobMatcher($._config.job_names.compactor)], + 'sum(rate(cortex_compactor_meta_sync_failures_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.compactor), + ) + { yaxes: $.yaxes('ops') } + ) + .addPanel( + $.panel('Metadata Sync Duration') + + // This metric tracks the duration of a per-tenant metadata sync. + $.latencyPanel('cortex_compactor_meta_sync_duration_seconds', '{%s}' % $.jobMatcher($._config.job_names.compactor)), + ) + ) + .addRows($.getObjectStoreRows('Object Store', 'compactor')), +} diff --git a/operations/mimir-mixin/dashboards/comparison.libsonnet b/operations/mimir-mixin/dashboards/comparison.libsonnet new file mode 100644 index 00000000000..1716f7d4c51 --- /dev/null +++ b/operations/mimir-mixin/dashboards/comparison.libsonnet @@ -0,0 +1,105 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') +{ + 'cortex-blocks-vs-chunks.json': + ($.dashboard('Cortex / Blocks vs Chunks') + { uid: '0e2b4dd23df9921972e3fb554c0fc483' }) + .addMultiTemplate('cluster', 'kube_pod_container_info{image=~".*cortex.*"}', 'cluster') + .addTemplate('blocks_namespace', 'kube_pod_container_info{image=~".*cortex.*"}', 'namespace') + .addTemplate('chunks_namespace', 'kube_pod_container_info{image=~".*cortex.*"}', 'namespace') + .addRow( + $.row('Ingesters') + .addPanel( + $.panel('Samples / sec') + + $.queryPanel('sum(rate(cortex_ingester_ingested_samples_total{cluster=~"$cluster",job=~"($blocks_namespace)/ingester"}[$__rate_interval]))', 'blocks') + + $.queryPanel('sum(rate(cortex_ingester_ingested_samples_total{cluster=~"$cluster",job=~"($chunks_namespace)/ingester"}[$__rate_interval]))', 'chunks') + ) + ) + .addRow( + $.row('') + .addPanel( + $.panel('Blocks Latency') + + utils.latencyRecordingRulePanel('cortex_request_duration_seconds', [utils.selector.re('cluster', '$cluster'), utils.selector.re('job', '($blocks_namespace)/ingester'), utils.selector.eq('route', '/cortex.Ingester/Push')]) + ) + .addPanel( + $.panel('Chunks Latency') + + utils.latencyRecordingRulePanel('cortex_request_duration_seconds', [utils.selector.re('cluster', '$cluster'), utils.selector.re('job', '($chunks_namespace)/ingester'), utils.selector.eq('route', '/cortex.Ingester/Push')]) + ) + ) + .addRow( + $.row('') + .addPanel( + $.panel('CPU per sample') + + $.queryPanel('sum(rate(container_cpu_usage_seconds_total{cluster=~"$cluster",namespace="$blocks_namespace",container="ingester"}[$__rate_interval])) / sum(rate(cortex_ingester_ingested_samples_total{cluster=~"$cluster",job="$blocks_namespace/ingester"}[$__rate_interval]))', 'blocks') + + $.queryPanel('sum(rate(container_cpu_usage_seconds_total{cluster=~"$cluster",namespace="$chunks_namespace",container="ingester"}[$__rate_interval])) / sum(rate(cortex_ingester_ingested_samples_total{cluster=~"$cluster",job="$chunks_namespace/ingester"}[$__rate_interval]))', 'chunks') + ) + .addPanel( + $.panel('Memory per active series') + + $.queryPanel('sum(container_memory_working_set_bytes{cluster=~"$cluster",namespace="$blocks_namespace",container="ingester"}) / sum(cortex_ingester_memory_series{cluster=~"$cluster",job=~"$blocks_namespace/ingester"})', 'blocks - working set') + + $.queryPanel('sum(container_memory_working_set_bytes{cluster=~"$cluster",namespace="$chunks_namespace",container="ingester"}) / sum(cortex_ingester_memory_series{cluster=~"$cluster",job=~"$chunks_namespace/ingester"})', 'chunks - working set') + + $.queryPanel('sum(go_memstats_heap_inuse_bytes{cluster=~"$cluster",job=~"$blocks_namespace/ingester"}) / sum(cortex_ingester_memory_series{cluster=~"$cluster",job=~"$blocks_namespace/ingester"})', 'blocks - heap inuse') + + $.queryPanel('sum(go_memstats_heap_inuse_bytes{cluster=~"$cluster",job=~"$chunks_namespace/ingester"}) / sum(cortex_ingester_memory_series{cluster=~"$cluster",job=~"$chunks_namespace/ingester"})', 'chunks - heap inuse') + + { yaxes: $.yaxes('bytes') } + ) + ) + .addRow( + $.row('') + .addPanel( + $.panel('CPU') + + $.queryPanel('sum(rate(container_cpu_usage_seconds_total{cluster=~"$cluster",namespace="$blocks_namespace",container="ingester"}[$__rate_interval]))', 'blocks') + + $.queryPanel('sum(rate(container_cpu_usage_seconds_total{cluster=~"$cluster",namespace="$chunks_namespace",container="ingester"}[$__rate_interval]))', 'chunks') + ) + .addPanel( + $.panel('Memory') + + $.queryPanel('sum(container_memory_working_set_bytes{cluster=~"$cluster",namespace="$blocks_namespace",container="ingester"})', 'blocks - working set') + + $.queryPanel('sum(container_memory_working_set_bytes{cluster=~"$cluster",namespace="$chunks_namespace",container="ingester"})', 'chunks - working set') + + $.queryPanel('sum(go_memstats_heap_inuse_bytes{cluster=~"$cluster",job=~"$blocks_namespace/ingester"})', 'blocks - heap inuse') + + $.queryPanel('sum(go_memstats_heap_inuse_bytes{cluster=~"$cluster",job=~"$chunks_namespace/ingester"})', 'chunks - heap inuse') + + { yaxes: $.yaxes('bytes') } + ) + ) + .addRow( + $.row('Queriers') + .addPanel( + $.panel('Queries / sec (query-frontend)') + + $.queryPanel('sum(rate(cortex_request_duration_seconds_count{cluster=~"$cluster",job="$blocks_namespace/query-frontend",route!="metrics"}[$__rate_interval]))', 'blocks') + + $.queryPanel('sum(rate(cortex_request_duration_seconds_count{cluster=~"$cluster",job="$chunks_namespace/query-frontend",route!="metrics"}[$__rate_interval]))', 'chunks') + ) + .addPanel( + $.panel('Queries / sec (query-tee)') + + $.queryPanel('sum(rate(cortex_querytee_request_duration_seconds_count{cluster=~"$cluster",backend=~".*\\\\.$blocks_namespace\\\\..*"}[$__rate_interval]))', 'blocks') + + $.queryPanel('sum(rate(cortex_querytee_request_duration_seconds_count{cluster=~"$cluster",backend=~".*\\\\.$chunks_namespace\\\\..*"}[$__rate_interval]))', 'chunks') + ) + ) + .addRow( + $.row('') + .addPanel( + $.panel('Latency 99th') + + $.queryPanel('histogram_quantile(0.99, sum by(backend, le) (rate(cortex_querytee_request_duration_seconds_bucket{cluster=~"$cluster",backend=~".*\\\\.$blocks_namespace\\\\..*"}[$__rate_interval])))', 'blocks') + + $.queryPanel('histogram_quantile(0.99, sum by(backend, le) (rate(cortex_querytee_request_duration_seconds_bucket{cluster=~"$cluster",backend=~".*\\\\.$chunks_namespace\\\\..*"}[$__rate_interval])))', 'chunks') + + { yaxes: $.yaxes('s') } + ) + .addPanel( + $.panel('Latency average') + + $.queryPanel('sum by(backend) (rate(cortex_querytee_request_duration_seconds_sum{cluster=~"$cluster",backend=~".*\\\\.$blocks_namespace\\\\..*"}[$__rate_interval])) / sum by(backend) (rate(cortex_querytee_request_duration_seconds_count{cluster=~"$cluster",backend=~".*\\\\.$blocks_namespace\\\\..*"}[$__rate_interval]))', 'blocks') + + $.queryPanel('sum by(backend) (rate(cortex_querytee_request_duration_seconds_sum{cluster=~"$cluster",backend=~".*\\\\.$chunks_namespace\\\\..*"}[$__rate_interval])) / sum by(backend) (rate(cortex_querytee_request_duration_seconds_count{cluster=~"$cluster",backend=~".*\\\\.$chunks_namespace\\\\..*"}[$__rate_interval]))', 'chunks') + + { yaxes: $.yaxes('s') } + ) + ) + .addRow( + $.row('') + .addPanel( + $.panel('CPU') + + $.queryPanel('sum(rate(container_cpu_usage_seconds_total{cluster=~"$cluster",namespace="$blocks_namespace",container="querier"}[$__rate_interval]))', 'blocks') + + $.queryPanel('sum(rate(container_cpu_usage_seconds_total{cluster=~"$cluster",namespace="$chunks_namespace",container="querier"}[$__rate_interval]))', 'chunks') + ) + .addPanel( + $.panel('Memory') + + $.queryPanel('sum(container_memory_working_set_bytes{cluster=~"$cluster",namespace="$blocks_namespace",container="querier"})', 'blocks - working set') + + $.queryPanel('sum(container_memory_working_set_bytes{cluster=~"$cluster",namespace="$chunks_namespace",container="querier"})', 'chunks - working set') + + $.queryPanel('sum(go_memstats_heap_inuse_bytes{cluster=~"$cluster",job=~"$blocks_namespace/querier"})', 'blocks - heap inuse') + + $.queryPanel('sum(go_memstats_heap_inuse_bytes{cluster=~"$cluster",job=~"$chunks_namespace/querier"})', 'chunks - heap inuse') + + { yaxes: $.yaxes('bytes') } + ) + ), +} diff --git a/operations/mimir-mixin/dashboards/config.libsonnet b/operations/mimir-mixin/dashboards/config.libsonnet new file mode 100644 index 00000000000..9240ef89dc7 --- /dev/null +++ b/operations/mimir-mixin/dashboards/config.libsonnet @@ -0,0 +1,26 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + + 'cortex-config.json': + ($.dashboard('Cortex / Config') + { uid: '61bb048ced9817b2d3e07677fb1c6290' }) + .addClusterSelectorTemplates() + .addRow( + $.row('Startup config file') + .addPanel( + $.panel('Startup config file hashes') + + $.queryPanel('count(cortex_config_hash{%s}) by (sha256)' % $.namespaceMatcher(), 'sha256:{{sha256}}') + + $.stack + + { yaxes: $.yaxes('instances') }, + ) + ) + .addRow( + $.row('Runtime config file') + .addPanel( + $.panel('Runtime config file hashes') + + $.queryPanel('count(cortex_runtime_config_hash{%s}) by (sha256)' % $.namespaceMatcher(), 'sha256:{{sha256}}') + + $.stack + + { yaxes: $.yaxes('instances') }, + ) + ), +} diff --git a/operations/mimir-mixin/dashboards/dashboard-utils.libsonnet b/operations/mimir-mixin/dashboards/dashboard-utils.libsonnet new file mode 100644 index 00000000000..981614ac83e --- /dev/null +++ b/operations/mimir-mixin/dashboards/dashboard-utils.libsonnet @@ -0,0 +1,506 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'grafana-builder/grafana.libsonnet') { + + _config:: error 'must provide _config', + + // Override the dashboard constructor to add: + // - default tags, + // - some links that propagate the selectred cluster. + dashboard(title):: + super.dashboard(title) + { + addRowIf(condition, row):: + if condition + then self.addRow(row) + else self, + + addRowsIf(condition, rows):: + if condition + then + local reduceRows(dashboard, remainingRows) = + if (std.length(remainingRows) == 0) + then dashboard + else + reduceRows( + dashboard.addRow(remainingRows[0]), + std.slice(remainingRows, 1, std.length(remainingRows), 1) + ) + ; + reduceRows(self, rows) + else self, + + addRows(rows):: + self.addRowsIf(true, rows), + + addClusterSelectorTemplates(multi=true):: + local d = self { + tags: $._config.tags, + links: [ + { + asDropdown: true, + icon: 'external link', + includeVars: true, + keepTime: true, + tags: $._config.tags, + targetBlank: false, + title: 'Cortex Dashboards', + type: 'dashboards', + }, + ], + }; + + if multi then + if $._config.singleBinary + then d.addMultiTemplate('job', 'cortex_build_info', 'job') + else d + .addMultiTemplate('cluster', 'cortex_build_info', 'cluster') + .addMultiTemplate('namespace', 'cortex_build_info{cluster=~"$cluster"}', 'namespace') + else + if $._config.singleBinary + then d.addTemplate('job', 'cortex_build_info', 'job') + else d + .addTemplate('cluster', 'cortex_build_info', 'cluster') + .addTemplate('namespace', 'cortex_build_info{cluster=~"$cluster"}', 'namespace'), + }, + + // The mixin allow specialism of the job selector depending on if its a single binary + // deployment or a namespaced one. + jobMatcher(job):: + if $._config.singleBinary + then 'job=~"$job"' + else 'cluster=~"$cluster", job=~"($namespace)/%s"' % job, + + namespaceMatcher():: + if $._config.singleBinary + then 'job=~"$job"' + else 'cluster=~"$cluster", namespace=~"$namespace"', + + jobSelector(job):: + if $._config.singleBinary + then [utils.selector.noop('cluster'), utils.selector.re('job', '$job')] + else [utils.selector.re('cluster', '$cluster'), utils.selector.re('job', '($namespace)/%s' % job)], + + queryPanel(queries, legends, legendLink=null):: + super.queryPanel(queries, legends, legendLink) + { + targets: [ + target { + interval: '15s', + } + for target in super.targets + ], + }, + + // hiddenLegendQueryPanel is a standard query panel designed to handle a large number of series. it hides the legend, doesn't fill the series and + // sorts the tooltip descending + hiddenLegendQueryPanel(queries, legends, legendLink=null):: + $.queryPanel(queries, legends, legendLink) + + { + legend: { show: false }, + fill: 0, + tooltip: { sort: 2 }, + }, + + qpsPanel(selector):: + super.qpsPanel(selector) + { + targets: [ + target { + interval: '15s', + } + for target in super.targets + ], + }, + + latencyPanel(metricName, selector, multiplier='1e3'):: + super.latencyPanel(metricName, selector, multiplier) + { + targets: [ + target { + interval: '15s', + } + for target in super.targets + ], + }, + + successFailurePanel(title, successMetric, failureMetric):: + $.panel(title) + + $.queryPanel([successMetric, failureMetric], ['successful', 'failed']) + + $.stack + { + aliasColors: { + successful: '#7EB26D', + failed: '#E24D42', + }, + }, + + // Displays started, completed and failed rate. + startedCompletedFailedPanel(title, startedMetric, completedMetric, failedMetric):: + $.panel(title) + + $.queryPanel([startedMetric, completedMetric, failedMetric], ['started', 'completed', 'failed']) + + $.stack + { + aliasColors: { + started: '#34CCEB', + completed: '#7EB26D', + failed: '#E24D42', + }, + }, + + containerCPUUsagePanel(title, containerName):: + $.panel(title) + + $.queryPanel([ + 'sum by(%s) (rate(container_cpu_usage_seconds_total{%s,container=~"%s"}[$__rate_interval]))' % [$._config.per_instance_label, $.namespaceMatcher(), containerName], + 'min(container_spec_cpu_quota{%s,container=~"%s"} / container_spec_cpu_period{%s,container=~"%s"})' % [$.namespaceMatcher(), containerName, $.namespaceMatcher(), containerName], + ], ['{{%s}}' % $._config.per_instance_label, 'limit']) + + { + seriesOverrides: [ + { + alias: 'limit', + color: '#E02F44', + fill: 0, + }, + ], + tooltip: { sort: 2 }, // Sort descending. + }, + + containerMemoryWorkingSetPanel(title, containerName):: + $.panel(title) + + $.queryPanel([ + // We use "max" instead of "sum" otherwise during a rolling update of a statefulset we will end up + // summing the memory of the old instance/pod (whose metric will be stale for 5m) to the new instance/pod. + 'max by(%s) (container_memory_working_set_bytes{%s,container=~"%s"})' % [$._config.per_instance_label, $.namespaceMatcher(), containerName], + 'min(container_spec_memory_limit_bytes{%s,container=~"%s"} > 0)' % [$.namespaceMatcher(), containerName], + ], ['{{%s}}' % $._config.per_instance_label, 'limit']) + + { + seriesOverrides: [ + { + alias: 'limit', + color: '#E02F44', + fill: 0, + }, + ], + yaxes: $.yaxes('bytes'), + tooltip: { sort: 2 }, // Sort descending. + }, + + containerNetworkPanel(title, metric, instanceName):: + $.panel(title) + + $.queryPanel( + 'sum by(%(instance)s) (rate(%(metric)s{%(namespace)s,%(instance)s=~"%(instanceName)s"}[$__rate_interval]))' % { + namespace: $.namespaceMatcher(), + metric: metric, + instance: $._config.per_instance_label, + instanceName: instanceName, + }, '{{%s}}' % $._config.per_instance_label + ) + + $.stack + + { yaxes: $.yaxes('Bps') }, + + containerNetworkReceiveBytesPanel(instanceName):: + $.containerNetworkPanel('Receive Bandwidth', 'container_network_receive_bytes_total', instanceName), + + containerNetworkTransmitBytesPanel(instanceName):: + $.containerNetworkPanel('Transmit Bandwidth', 'container_network_transmit_bytes_total', instanceName), + + containerDiskWritesPanel(title, containerName):: + $.panel(title) + + $.queryPanel( + ||| + sum by(%s, %s, device) ( + rate( + node_disk_written_bytes_total[$__rate_interval] + ) + ) + + + %s + ||| % [ + $._config.per_node_label, + $._config.per_instance_label, + $.filterNodeDiskContainer(containerName), + ], + '{{%s}} - {{device}}' % $._config.per_instance_label + ) + + $.stack + + { yaxes: $.yaxes('Bps') }, + + containerDiskReadsPanel(title, containerName):: + $.panel(title) + + $.queryPanel( + ||| + sum by(%s, %s, device) ( + rate( + node_disk_read_bytes_total[$__rate_interval] + ) + ) + %s + ||| % [ + $._config.per_node_label, + $._config.per_instance_label, + $.filterNodeDiskContainer(containerName), + ], + '{{%s}} - {{device}}' % $._config.per_instance_label + ) + + $.stack + + { yaxes: $.yaxes('Bps') }, + + containerDiskSpaceUtilization(title, containerName):: + $.panel(title) + + $.queryPanel( + ||| + max by(persistentvolumeclaim) ( + kubelet_volume_stats_used_bytes{%(namespace)s} / + kubelet_volume_stats_capacity_bytes{%(namespace)s} + ) + and + count by(persistentvolumeclaim) ( + kube_persistentvolumeclaim_labels{ + %(namespace)s, + %(label)s + } + ) + ||| % { + namespace: $.namespaceMatcher(), + label: $.containerLabelMatcher(containerName), + }, '{{persistentvolumeclaim}}' + ) + + { yaxes: $.yaxes('percentunit') }, + + containerLabelMatcher(containerName):: + if containerName == 'ingester' + then 'label_name=~"ingester.*"' + else 'label_name="%s"' % containerName, + + goHeapInUsePanel(title, jobName):: + $.panel(title) + + $.queryPanel( + 'sum by(%s) (go_memstats_heap_inuse_bytes{%s})' % [$._config.per_instance_label, $.jobMatcher(jobName)], + '{{%s}}' % $._config.per_instance_label + ) + + { + yaxes: $.yaxes('bytes'), + tooltip: { sort: 2 }, // Sort descending. + }, + + newStatPanel(queries, legends='', unit='percentunit', decimals=1, thresholds=[], instant=false, novalue=''):: + super.queryPanel(queries, legends) + { + type: 'stat', + targets: [ + target { + instant: instant, + interval: '', + + // Reset defaults from queryPanel(). + format: null, + intervalFactor: null, + step: null, + } + for target in super.targets + ], + fieldConfig: { + defaults: { + color: { mode: 'thresholds' }, + decimals: decimals, + thresholds: { + mode: 'absolute', + steps: thresholds, + }, + noValue: novalue, + unit: unit, + }, + overrides: [], + }, + }, + + barGauge(queries, legends='', thresholds=[], unit='short', min=null, max=null):: + super.queryPanel(queries, legends) + { + type: 'bargauge', + targets: [ + target { + // Reset defaults from queryPanel(). + format: null, + intervalFactor: null, + step: null, + } + for target in super.targets + ], + fieldConfig: { + defaults: { + color: { mode: 'thresholds' }, + mappings: [], + max: max, + min: min, + thresholds: { + mode: 'absolute', + steps: thresholds, + }, + unit: unit, + }, + }, + options: { + displayMode: 'basic', + orientation: 'horizontal', + reduceOptions: { + calcs: ['lastNotNull'], + fields: '', + values: false, + }, + }, + }, + + // Switches a panel from lines (default) to bars. + bars:: { + bars: true, + lines: false, + }, + + textPanel(title, content, options={}):: { + content: content, + datasource: null, + description: '', + mode: 'markdown', + title: title, + transparent: true, + type: 'text', + } + options, + + getObjectStoreRows(title, component):: [ + super.row(title) + .addPanel( + $.panel('Operations / sec') + + $.queryPanel('sum by(operation) (rate(thanos_objstore_bucket_operations_total{%s,component="%s"}[$__rate_interval]))' % [$.namespaceMatcher(), component], '{{operation}}') + + $.stack + + { yaxes: $.yaxes('rps') }, + ) + .addPanel( + $.panel('Error rate') + + $.queryPanel('sum by(operation) (rate(thanos_objstore_bucket_operation_failures_total{%s,component="%s"}[$__rate_interval])) / sum by(operation) (rate(thanos_objstore_bucket_operations_total{%s,component="%s"}[$__rate_interval]))' % [$.namespaceMatcher(), component, $.namespaceMatcher(), component], '{{operation}}') + + { yaxes: $.yaxes('percentunit') }, + ) + .addPanel( + $.panel('Latency of Op: Attributes') + + $.latencyPanel('thanos_objstore_bucket_operation_duration_seconds', '{%s,component="%s",operation="attributes"}' % [$.namespaceMatcher(), component]), + ) + .addPanel( + $.panel('Latency of Op: Exists') + + $.latencyPanel('thanos_objstore_bucket_operation_duration_seconds', '{%s,component="%s",operation="exists"}' % [$.namespaceMatcher(), component]), + ), + $.row('') + .addPanel( + $.panel('Latency of Op: Get') + + $.latencyPanel('thanos_objstore_bucket_operation_duration_seconds', '{%s,component="%s",operation="get"}' % [$.namespaceMatcher(), component]), + ) + .addPanel( + $.panel('Latency of Op: GetRange') + + $.latencyPanel('thanos_objstore_bucket_operation_duration_seconds', '{%s,component="%s",operation="get_range"}' % [$.namespaceMatcher(), component]), + ) + .addPanel( + $.panel('Latency of Op: Upload') + + $.latencyPanel('thanos_objstore_bucket_operation_duration_seconds', '{%s,component="%s",operation="upload"}' % [$.namespaceMatcher(), component]), + ) + .addPanel( + $.panel('Latency of Op: Delete') + + $.latencyPanel('thanos_objstore_bucket_operation_duration_seconds', '{%s,component="%s",operation="delete"}' % [$.namespaceMatcher(), component]), + ), + ], + + thanosMemcachedCache(title, jobName, component, cacheName):: + local config = { + jobMatcher: $.jobMatcher(jobName), + component: component, + cacheName: cacheName, + }; + super.row(title) + .addPanel( + $.panel('Requests / sec') + + $.queryPanel( + ||| + sum by(operation) ( + rate( + thanos_memcached_operations_total{ + %(jobMatcher)s, + component="%(component)s", + name="%(cacheName)s" + }[$__rate_interval] + ) + ) + ||| % config, + '{{operation}}' + ) + + $.stack + + { yaxes: $.yaxes('ops') } + ) + .addPanel( + $.panel('Latency (getmulti)') + + $.latencyPanel( + 'thanos_memcached_operation_duration_seconds', + ||| + { + %(jobMatcher)s, + operation="getmulti", + component="%(component)s", + name="%(cacheName)s" + } + ||| % config + ) + ) + .addPanel( + $.panel('Hit ratio') + + $.queryPanel( + ||| + sum( + rate( + thanos_cache_memcached_hits_total{ + %(jobMatcher)s, + component="%(component)s", + name="%(cacheName)s" + }[$__rate_interval] + ) + ) + / + sum( + rate( + thanos_cache_memcached_requests_total{ + %(jobMatcher)s, + component="%(component)s", + name="%(cacheName)s" + }[$__rate_interval] + ) + ) + ||| % config, + 'items' + ) + + { yaxes: $.yaxes('percentunit') } + ), + + filterNodeDiskContainer(containerName):: + ||| + ignoring(%s) group_right() ( + label_replace( + count by( + %s, + %s, + device + ) + ( + container_fs_writes_bytes_total{ + %s, + container="%s", + device!~".*sda.*" + } + ), + "device", + "$1", + "device", + "/dev/(.*)" + ) * 0 + ) + ||| % [ + $._config.per_instance_label, + $._config.per_node_label, + $._config.per_instance_label, + $.namespaceMatcher(), + containerName, + ], + + panelDescription(title, description):: { + description: ||| + ### %s + %s + ||| % [title, description], + }, +} diff --git a/operations/mimir-mixin/dashboards/object-store.libsonnet b/operations/mimir-mixin/dashboards/object-store.libsonnet new file mode 100644 index 00000000000..69e257b60dd --- /dev/null +++ b/operations/mimir-mixin/dashboards/object-store.libsonnet @@ -0,0 +1,65 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + 'cortex-object-store.json': + ($.dashboard('Cortex / Object Store') + { uid: 'd5a3a4489d57c733b5677fb55370a723' }) + .addClusterSelectorTemplates() + .addRow( + $.row('Components') + .addPanel( + $.panel('RPS / component') + + $.queryPanel('sum by(component) (rate(thanos_objstore_bucket_operations_total{%s}[$__rate_interval]))' % $.namespaceMatcher(), '{{component}}') + + $.stack + + { yaxes: $.yaxes('rps') }, + ) + .addPanel( + $.panel('Error rate / component') + + $.queryPanel('sum by(component) (rate(thanos_objstore_bucket_operation_failures_total{%s}[$__rate_interval])) / sum by(component) (rate(thanos_objstore_bucket_operations_total{%s}[$__rate_interval]))' % [$.namespaceMatcher(), $.namespaceMatcher()], '{{component}}') + + { yaxes: $.yaxes('percentunit') }, + ) + ) + .addRow( + $.row('Operations') + .addPanel( + $.panel('RPS / operation') + + $.queryPanel('sum by(operation) (rate(thanos_objstore_bucket_operations_total{%s}[$__rate_interval]))' % $.namespaceMatcher(), '{{operation}}') + + $.stack + + { yaxes: $.yaxes('rps') }, + ) + .addPanel( + $.panel('Error rate / operation') + + $.queryPanel('sum by(operation) (rate(thanos_objstore_bucket_operation_failures_total{%s}[$__rate_interval])) / sum by(operation) (rate(thanos_objstore_bucket_operations_total{%s}[$__rate_interval]))' % [$.namespaceMatcher(), $.namespaceMatcher()], '{{operation}}') + + { yaxes: $.yaxes('percentunit') }, + ) + ) + .addRow( + $.row('') + .addPanel( + $.panel('Op: Get') + + $.latencyPanel('thanos_objstore_bucket_operation_duration_seconds', '{%s,operation="get"}' % $.namespaceMatcher()), + ) + .addPanel( + $.panel('Op: GetRange') + + $.latencyPanel('thanos_objstore_bucket_operation_duration_seconds', '{%s,operation="get_range"}' % $.namespaceMatcher()), + ) + .addPanel( + $.panel('Op: Exists') + + $.latencyPanel('thanos_objstore_bucket_operation_duration_seconds', '{%s,operation="exists"}' % $.namespaceMatcher()), + ) + ) + .addRow( + $.row('') + .addPanel( + $.panel('Op: Attributes') + + $.latencyPanel('thanos_objstore_bucket_operation_duration_seconds', '{%s,operation="attributes"}' % $.namespaceMatcher()), + ) + .addPanel( + $.panel('Op: Upload') + + $.latencyPanel('thanos_objstore_bucket_operation_duration_seconds', '{%s,operation="upload"}' % $.namespaceMatcher()), + ) + .addPanel( + $.panel('Op: Delete') + + $.latencyPanel('thanos_objstore_bucket_operation_duration_seconds', '{%s,operation="delete"}' % $.namespaceMatcher()), + ) + ), +} diff --git a/operations/mimir-mixin/dashboards/queries.libsonnet b/operations/mimir-mixin/dashboards/queries.libsonnet new file mode 100644 index 00000000000..259f5dfabd3 --- /dev/null +++ b/operations/mimir-mixin/dashboards/queries.libsonnet @@ -0,0 +1,286 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + + 'cortex-queries.json': + ($.dashboard('Cortex / Queries') + { uid: 'd9931b1054053c8b972d320774bb8f1d' }) + .addClusterSelectorTemplates() + .addRow( + $.row('Query Frontend') + .addPanel( + $.panel('Queue Duration') + + $.latencyPanel('cortex_query_frontend_queue_duration_seconds', '{%s}' % $.jobMatcher($._config.job_names.query_frontend)), + ) + .addPanel( + $.panel('Retries') + + $.latencyPanel('cortex_query_frontend_retries', '{%s}' % $.jobMatcher($._config.job_names.query_frontend), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.panel('Queue Length') + + $.queryPanel('cortex_query_frontend_queue_length{%s}' % $.jobMatcher($._config.job_names.query_frontend), '{{cluster}} / {{namespace}} / {{%s}}' % $._config.per_instance_label), + ) + ) + .addRow( + $.row('Query Scheduler') + .addPanel( + $.panel('Queue Duration') + + $.latencyPanel('cortex_query_scheduler_queue_duration_seconds', '{%s}' % $.jobMatcher($._config.job_names.query_scheduler)), + ) + .addPanel( + $.panel('Queue Length') + + $.queryPanel('cortex_query_scheduler_queue_length{%s}' % $.jobMatcher($._config.job_names.query_scheduler), '{{cluster}} / {{namespace}} / {{%s}}' % $._config.per_instance_label), + ) + ) + .addRow( + $.row('Query Frontend - Query Splitting and Results Cache') + .addPanel( + $.panel('Intervals per Query') + + $.queryPanel('sum(rate(cortex_frontend_split_queries_total{%s}[1m])) / sum(rate(cortex_frontend_query_range_duration_seconds_count{%s, method="split_by_interval"}[1m]))' % [$.jobMatcher($._config.job_names.query_frontend), $.jobMatcher($._config.job_names.query_frontend)], 'splitting rate') + + $.panelDescription( + 'Intervals per Query', + ||| + The average number of splitted queries (partitioned by time) executed a single input query. + ||| + ), + ) + .addPanel( + $.panel('Results Cache Hit %') + + $.queryPanel('sum(rate(cortex_cache_hits{name=~"frontend.+", %s}[1m])) / sum(rate(cortex_cache_fetched_keys{name=~"frontend.+", %s}[1m]))' % [$.jobMatcher($._config.job_names.query_frontend), $.jobMatcher($._config.job_names.query_frontend)], 'Hit Rate') + + { yaxes: $.yaxes({ format: 'percentunit', max: 1 }) }, + ) + .addPanel( + $.panel('Results Cache misses') + + $.queryPanel('sum(rate(cortex_cache_fetched_keys{name=~"frontend.+", %s}[1m])) - sum(rate(cortex_cache_hits{name=~"frontend.+", %s}[1m]))' % [$.jobMatcher($._config.job_names.query_frontend), $.jobMatcher($._config.job_names.query_frontend)], 'Miss Rate'), + ) + ) + .addRow( + $.row('Query Frontend - Query sharding') + .addPanel( + $.panel('Sharded Queries Ratio') + + $.queryPanel(||| + sum(rate(cortex_frontend_query_sharding_rewrites_succeeded_total{%s}[$__rate_interval])) / + sum(rate(cortex_frontend_query_sharding_rewrites_attempted_total{%s}[$__rate_interval])) + ||| % [$.jobMatcher($._config.job_names.query_frontend), $.jobMatcher($._config.job_names.query_frontend)], 'sharded queries ratio') + + { yaxes: $.yaxes({ format: 'percentunit', max: 1 }) } + + $.panelDescription( + 'Sharded Queries Ratio', + ||| + The % of queries that have been successfully rewritten and executed in a shardable way. + This panel takes in account only type of queries which are supported by query sharding (eg. range queries). + ||| + ), + ) + .addPanel( + $.panel('Number of Sharded Queries per Query') + + $.latencyPanel('cortex_frontend_sharded_queries_per_query', '{%s}' % $.jobMatcher($._config.job_names.query_frontend), multiplier=1) + + { yaxes: $.yaxes('short') } + + $.panelDescription( + 'Number of Sharded Queries per Query', + ||| + How many sharded queries have been executed for a single input query. It tracks only queries which have + been successfully rewritten in a shardable way. + ||| + ), + ) + ) + .addRow( + $.row('Querier') + .addPanel( + $.panel('Stages') + + $.queryPanel('max by (slice) (prometheus_engine_query_duration_seconds{quantile="0.9",%s}) * 1e3' % $.jobMatcher($._config.job_names.querier), '{{slice}}') + + { yaxes: $.yaxes('ms') } + + $.stack, + ) + .addPanel( + $.panel('Chunk cache misses') + + $.queryPanel('sum(rate(cortex_cache_fetched_keys{%s,name="chunksmemcache"}[1m])) - sum(rate(cortex_cache_hits{%s,name="chunksmemcache"}[1m]))' % [$.jobMatcher($._config.job_names.querier), $.jobMatcher($._config.job_names.querier)], 'Hit rate'), + ) + .addPanel( + $.panel('Chunk cache corruptions') + + $.queryPanel('sum(rate(cortex_cache_corrupt_chunks_total{%s}[1m]))' % $.jobMatcher($._config.job_names.querier), 'Corrupt chunks'), + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks'), + $.row('Querier - Chunks storage - Index Cache') + .addPanel( + $.panel('Total entries') + + $.queryPanel('sum(querier_cache_added_new_total{cache="store.index-cache-read.fifocache",%s}) - sum(querier_cache_evicted_total{cache="store.index-cache-read.fifocache",%s})' % [$.jobMatcher($._config.job_names.querier), $.jobMatcher($._config.job_names.querier)], 'Entries'), + ) + .addPanel( + $.panel('Cache Hit %') + + $.queryPanel('(sum(rate(querier_cache_gets_total{cache="store.index-cache-read.fifocache",%s}[1m])) - sum(rate(querier_cache_misses_total{cache="store.index-cache-read.fifocache",%s}[1m]))) / sum(rate(querier_cache_gets_total{cache="store.index-cache-read.fifocache",%s}[1m]))' % [$.jobMatcher($._config.job_names.querier), $.jobMatcher($._config.job_names.querier), $.jobMatcher($._config.job_names.querier)], 'hit rate') + { yaxes: $.yaxes({ format: 'percentunit', max: 1 }) }, + ) + .addPanel( + $.panel('Churn Rate') + + $.queryPanel('sum(rate(querier_cache_evicted_total{cache="store.index-cache-read.fifocache",%s}[1m]))' % $.jobMatcher($._config.job_names.querier), 'churn rate'), + ) + ) + .addRow( + $.row('Ingester') + .addPanel( + $.panel('Series per Query') + + utils.latencyRecordingRulePanel('cortex_ingester_queried_series', $.jobSelector($._config.job_names.ingester), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.panel('Chunks per Query') + + utils.latencyRecordingRulePanel('cortex_ingester_queried_chunks', $.jobSelector($._config.job_names.ingester), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.panel('Samples per Query') + + utils.latencyRecordingRulePanel('cortex_ingester_queried_samples', $.jobSelector($._config.job_names.ingester), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks'), + $.row('Querier - Chunks storage - Store') + .addPanel( + $.panel('Index Lookups per Query') + + utils.latencyRecordingRulePanel('cortex_chunk_store_index_lookups_per_query', $.jobSelector($._config.job_names.querier), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.panel('Series (pre-intersection) per Query') + + utils.latencyRecordingRulePanel('cortex_chunk_store_series_pre_intersection_per_query', $.jobSelector($._config.job_names.querier), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.panel('Series (post-intersection) per Query') + + utils.latencyRecordingRulePanel('cortex_chunk_store_series_post_intersection_per_query', $.jobSelector($._config.job_names.querier), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.panel('Chunks per Query') + + utils.latencyRecordingRulePanel('cortex_chunk_store_chunks_per_query', $.jobSelector($._config.job_names.querier), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.row('Querier - Blocks storage') + .addPanel( + $.panel('Number of store-gateways hit per Query') + + $.latencyPanel('cortex_querier_storegateway_instances_hit_per_query', '{%s}' % $.jobMatcher($._config.job_names.querier), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.panel('Refetches of missing blocks per Query') + + $.latencyPanel('cortex_querier_storegateway_refetches_per_query', '{%s}' % $.jobMatcher($._config.job_names.querier), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.panel('Consistency checks failed') + + $.queryPanel('sum(rate(cortex_querier_blocks_consistency_checks_failed_total{%s}[1m])) / sum(rate(cortex_querier_blocks_consistency_checks_total{%s}[1m]))' % [$.jobMatcher($._config.job_names.querier), $.jobMatcher($._config.job_names.querier)], 'Failure Rate') + + { yaxes: $.yaxes({ format: 'percentunit', max: 1 }) }, + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.row('') + .addPanel( + $.panel('Bucket indexes loaded (per querier)') + + $.queryPanel([ + 'max(cortex_bucket_index_loaded{%s})' % $.jobMatcher($._config.job_names.querier), + 'min(cortex_bucket_index_loaded{%s})' % $.jobMatcher($._config.job_names.querier), + 'avg(cortex_bucket_index_loaded{%s})' % $.jobMatcher($._config.job_names.querier), + ], ['Max', 'Min', 'Average']) + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.successFailurePanel( + 'Bucket indexes load / sec', + 'sum(rate(cortex_bucket_index_loads_total{%s}[$__rate_interval])) - sum(rate(cortex_bucket_index_load_failures_total{%s}[$__rate_interval]))' % [$.jobMatcher($._config.job_names.querier), $.jobMatcher($._config.job_names.querier)], + 'sum(rate(cortex_bucket_index_load_failures_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.querier), + ) + ) + .addPanel( + $.panel('Bucket indexes load latency') + + $.latencyPanel('cortex_bucket_index_load_duration_seconds', '{%s}' % $.jobMatcher($._config.job_names.querier)), + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.row('Store-gateway - Blocks storage') + .addPanel( + $.panel('Blocks queried / sec') + + $.queryPanel('sum(rate(cortex_bucket_store_series_blocks_queried_sum{component="store-gateway",%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.store_gateway), 'blocks') + + { yaxes: $.yaxes('ops') }, + ) + .addPanel( + $.panel('Data fetched / sec') + + $.queryPanel('sum by(data_type) (rate(cortex_bucket_store_series_data_fetched_sum{component="store-gateway",%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.store_gateway), '{{data_type}}') + + $.stack + + { yaxes: $.yaxes('ops') }, + ) + .addPanel( + $.panel('Data touched / sec') + + $.queryPanel('sum by(data_type) (rate(cortex_bucket_store_series_data_touched_sum{component="store-gateway",%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.store_gateway), '{{data_type}}') + + $.stack + + { yaxes: $.yaxes('ops') }, + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.row('') + .addPanel( + $.panel('Series fetch duration (per request)') + + $.latencyPanel('cortex_bucket_store_series_get_all_duration_seconds', '{component="store-gateway",%s}' % $.jobMatcher($._config.job_names.store_gateway)), + ) + .addPanel( + $.panel('Series merge duration (per request)') + + $.latencyPanel('cortex_bucket_store_series_merge_duration_seconds', '{component="store-gateway",%s}' % $.jobMatcher($._config.job_names.store_gateway)), + ) + .addPanel( + $.panel('Series returned (per request)') + + $.queryPanel('sum(rate(cortex_bucket_store_series_result_series_sum{component="store-gateway",%s}[$__rate_interval])) / sum(rate(cortex_bucket_store_series_result_series_count{component="store-gateway",%s}[$__rate_interval]))' % [$.jobMatcher($._config.job_names.store_gateway), $.jobMatcher($._config.job_names.store_gateway)], 'avg series returned'), + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.row('') + .addPanel( + $.panel('Blocks currently loaded') + + $.queryPanel('cortex_bucket_store_blocks_loaded{component="store-gateway",%s}' % $.jobMatcher($._config.job_names.store_gateway), '{{%s}}' % $._config.per_instance_label) + ) + .addPanel( + $.successFailurePanel( + 'Blocks loaded / sec', + 'sum(rate(cortex_bucket_store_block_loads_total{component="store-gateway",%s}[$__rate_interval])) - sum(rate(cortex_bucket_store_block_load_failures_total{component="store-gateway",%s}[$__rate_interval]))' % [$.jobMatcher($._config.job_names.store_gateway), $.jobMatcher($._config.job_names.store_gateway)], + 'sum(rate(cortex_bucket_store_block_load_failures_total{component="store-gateway",%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.store_gateway), + ) + ) + .addPanel( + $.successFailurePanel( + 'Blocks dropped / sec', + 'sum(rate(cortex_bucket_store_block_drops_total{component="store-gateway",%s}[$__rate_interval])) - sum(rate(cortex_bucket_store_block_drop_failures_total{component="store-gateway",%s}[$__rate_interval]))' % [$.jobMatcher($._config.job_names.store_gateway), $.jobMatcher($._config.job_names.store_gateway)], + 'sum(rate(cortex_bucket_store_block_drop_failures_total{component="store-gateway",%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.store_gateway), + ) + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.row('') + .addPanel( + $.panel('Lazy loaded index-headers') + + $.queryPanel('cortex_bucket_store_indexheader_lazy_load_total{%s} - cortex_bucket_store_indexheader_lazy_unload_total{%s}' % [$.jobMatcher($._config.job_names.store_gateway), $.jobMatcher($._config.job_names.store_gateway)], '{{%s}}' % $._config.per_instance_label) + ) + .addPanel( + $.panel('Index-header lazy load duration') + + $.latencyPanel('cortex_bucket_store_indexheader_lazy_load_duration_seconds', '{%s}' % $.jobMatcher($._config.job_names.store_gateway)), + ) + .addPanel( + $.panel('Series hash cache hit ratio') + + $.queryPanel(||| + sum(rate(cortex_bucket_store_series_hash_cache_hits_total{%s}[$__rate_interval])) + / + sum(rate(cortex_bucket_store_series_hash_cache_requests_total{%s}[$__rate_interval])) + ||| % [$.jobMatcher($._config.job_names.store_gateway), $.jobMatcher($._config.job_names.store_gateway)], 'hit ratio') + + { yaxes: $.yaxes({ format: 'percentunit', max: 1 }) }, + ) + ), +} diff --git a/operations/mimir-mixin/dashboards/reads-resources.libsonnet b/operations/mimir-mixin/dashboards/reads-resources.libsonnet new file mode 100644 index 00000000000..f0750c885ac --- /dev/null +++ b/operations/mimir-mixin/dashboards/reads-resources.libsonnet @@ -0,0 +1,124 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + 'cortex-reads-resources.json': + ($.dashboard('Cortex / Reads Resources') + { uid: '2fd2cda9eea8d8af9fbc0a5960425120' }) + .addClusterSelectorTemplates(false) + .addRow( + $.row('Gateway') + .addPanel( + $.containerCPUUsagePanel('CPU', $._config.job_names.gateway), + ) + .addPanel( + $.containerMemoryWorkingSetPanel('Memory (workingset)', $._config.job_names.gateway), + ) + .addPanel( + $.goHeapInUsePanel('Memory (go heap inuse)', $._config.job_names.gateway), + ) + ) + .addRow( + $.row('Query Frontend') + .addPanel( + $.containerCPUUsagePanel('CPU', 'query-frontend'), + ) + .addPanel( + $.containerMemoryWorkingSetPanel('Memory (workingset)', 'query-frontend'), + ) + .addPanel( + $.goHeapInUsePanel('Memory (go heap inuse)', $._config.job_names.query_frontend), + ) + ) + .addRow( + $.row('Query Scheduler') + .addPanel( + $.containerCPUUsagePanel('CPU', 'query-scheduler'), + ) + .addPanel( + $.containerMemoryWorkingSetPanel('Memory (workingset)', 'query-scheduler'), + ) + .addPanel( + $.goHeapInUsePanel('Memory (go heap inuse)', $._config.job_names.query_scheduler), + ) + ) + .addRow( + $.row('Querier') + .addPanel( + $.containerCPUUsagePanel('CPU', 'querier'), + ) + .addPanel( + $.containerMemoryWorkingSetPanel('Memory (workingset)', 'querier'), + ) + .addPanel( + $.goHeapInUsePanel('Memory (go heap inuse)', $._config.job_names.querier), + ) + ) + .addRow( + $.row('Ingester') + .addPanel( + $.containerCPUUsagePanel('CPU', 'ingester'), + ) + .addPanel( + $.containerMemoryWorkingSetPanel('Memory (workingset)', 'ingester'), + ) + .addPanel( + $.goHeapInUsePanel('Memory (go heap inuse)', $._config.job_names.ingester), + ) + ) + .addRow( + $.row('Ruler') + .addPanel( + $.panel('Rules') + + $.queryPanel( + 'sum by(%s) (cortex_prometheus_rule_group_rules{%s})' % [$._config.per_instance_label, $.jobMatcher($._config.job_names.ruler)], + '{{%s}}' % $._config.per_instance_label + ), + ) + .addPanel( + $.containerCPUUsagePanel('CPU', 'ruler'), + ) + ) + .addRow( + $.row('') + .addPanel( + $.containerMemoryWorkingSetPanel('Memory (workingset)', 'ruler'), + ) + .addPanel( + $.goHeapInUsePanel('Memory (go heap inuse)', $._config.job_names.ruler), + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.row('Store-gateway') + .addPanel( + $.containerCPUUsagePanel('CPU', 'store-gateway'), + ) + .addPanel( + $.containerMemoryWorkingSetPanel('Memory (workingset)', 'store-gateway'), + ) + .addPanel( + $.goHeapInUsePanel('Memory (go heap inuse)', $._config.job_names.store_gateway), + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.row('') + .addPanel( + $.containerDiskWritesPanel('Disk Writes', 'store-gateway'), + ) + .addPanel( + $.containerDiskReadsPanel('Disk Reads', 'store-gateway'), + ) + .addPanel( + $.containerDiskSpaceUtilization('Disk Space Utilization', 'store-gateway'), + ) + ) + { + templating+: { + list: [ + // Do not allow to include all clusters/namespaces otherwise this dashboard + // risks to explode because it shows resources per pod. + l + (if (l.name == 'cluster' || l.name == 'namespace') then { includeAll: false } else {}) + for l in super.list + ], + }, + }, +} diff --git a/operations/mimir-mixin/dashboards/reads.libsonnet b/operations/mimir-mixin/dashboards/reads.libsonnet new file mode 100644 index 00000000000..9bc9b7d6b31 --- /dev/null +++ b/operations/mimir-mixin/dashboards/reads.libsonnet @@ -0,0 +1,404 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + 'cortex-reads.json': + ($.dashboard('Cortex / Reads') + { uid: '8d6ba60eccc4b6eedfa329b24b1bd339' }) + .addClusterSelectorTemplates() + .addRowIf( + $._config.show_dashboard_descriptions.reads, + ($.row('Reads dashboard description') { height: '175px', showTitle: false }) + .addPanel( + $.textPanel('', ||| +

+ This dashboard shows health metrics for the Cortex read path. + It is broken into sections for each service on the read path, and organized by the order in which the read request flows. +
+ Incoming queries travel from the gateway → query frontend → query scheduler → querier → ingester and/or store-gateway (depending on the time range of the query). +
+ For each service, there are 3 panels showing (1) requests per second to that service, (2) average, median, and p99 latency of requests to that service, and (3) p99 latency of requests to each instance of that service. +

+

+ The dashboard also shows metrics for the 4 optional caches that can be deployed with Cortex: + the query results cache, the metadata cache, the chunks cache, and the index cache. +
+ These panels will show “no data” if the caches are not deployed. +

+

+ Lastly, it also includes metrics for how the ingester and store-gateway interact with object storage. +

+ |||), + ) + ) + .addRow( + ($.row('Headlines') + + { + height: '100px', + showTitle: false, + }) + .addPanel( + $.panel('Instant queries / sec') + + $.statPanel(||| + sum( + rate( + cortex_request_duration_seconds_count{ + %(queryFrontend)s, + route=~"(prometheus|api_prom)_api_v1_query" + }[$__rate_interval] + ) + ) + + sum( + rate( + cortex_prometheus_rule_evaluations_total{ + %(ruler)s + }[$__rate_interval] + ) + ) + ||| % { + queryFrontend: $.jobMatcher($._config.job_names.query_frontend), + ruler: $.jobMatcher($._config.job_names.ruler), + }, format='reqps') + + $.panelDescription( + 'Instant queries per second', + ||| + Rate of instant queries per second being made to the system. + Includes both queries made to the /prometheus API as + well as queries from the ruler. + ||| + ), + ) + .addPanel( + $.panel('Range queries / sec') + + $.statPanel(||| + sum( + rate( + cortex_request_duration_seconds_count{ + %(queryFrontend)s, + route=~"(prometheus|api_prom)_api_v1_query_range" + }[$__rate_interval] + ) + ) + ||| % { + queryFrontend: $.jobMatcher($._config.job_names.query_frontend), + }, format='reqps') + + $.panelDescription( + 'Range queries per second', + ||| + Rate of range queries per second being made to + Cortex via the /prometheus API. + ||| + ), + ) + ) + .addRow( + $.row('Gateway') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_request_duration_seconds_count{%s, route=~"(prometheus|api_prom)_api_v1_.+"}' % $.jobMatcher($._config.job_names.gateway)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_request_duration_seconds', $.jobSelector($._config.job_names.gateway) + [utils.selector.re('route', '(prometheus|api_prom)_api_v1_.+')]) + ) + .addPanel( + $.panel('Per %s p99 Latency' % $._config.per_instance_label) + + $.hiddenLegendQueryPanel( + 'histogram_quantile(0.99, sum by(le, %s) (rate(cortex_request_duration_seconds_bucket{%s, route=~"(prometheus|api_prom)_api_v1_.+"}[$__rate_interval])))' % [$._config.per_instance_label, $.jobMatcher($._config.job_names.gateway)], '' + ) + + { yaxes: $.yaxes('s') } + ) + ) + .addRow( + $.row('Query Frontend') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_request_duration_seconds_count{%s, route=~"(prometheus|api_prom)_api_v1_.+"}' % $.jobMatcher($._config.job_names.query_frontend)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_request_duration_seconds', $.jobSelector($._config.job_names.query_frontend) + [utils.selector.re('route', '(prometheus|api_prom)_api_v1_.+')]) + ) + .addPanel( + $.panel('Per %s p99 Latency' % $._config.per_instance_label) + + $.hiddenLegendQueryPanel( + 'histogram_quantile(0.99, sum by(le, %s) (rate(cortex_request_duration_seconds_bucket{%s, route=~"(prometheus|api_prom)_api_v1_.+"}[$__rate_interval])))' % [$._config.per_instance_label, $.jobMatcher($._config.job_names.query_frontend)], '' + ) + + { yaxes: $.yaxes('s') } + ) + ) + .addRow( + $.row('Query Scheduler') + .addPanel( + $.textPanel( + '', + ||| +

+ The query scheduler is an optional service that moves + the internal queue from the query-frontend into a + separate component. + If this service is not deployed, + these panels will show "No data." +

+ ||| + ) + ) + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_query_scheduler_queue_duration_seconds_count{%s}' % $.jobMatcher($._config.job_names.query_scheduler)) + ) + .addPanel( + $.panel('Latency (Time in Queue)') + + $.latencyPanel('cortex_query_scheduler_queue_duration_seconds', '{%s}' % $.jobMatcher($._config.job_names.query_scheduler)) + ) + ) + .addRow( + $.row('Cache - Query Results') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_cache_request_duration_seconds_count{method=~"frontend.+", %s}' % $.jobMatcher($._config.job_names.query_frontend)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_cache_request_duration_seconds', $.jobSelector($._config.job_names.query_frontend) + [utils.selector.re('method', 'frontend.+')]) + ) + ) + .addRow( + $.row('Querier') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_querier_request_duration_seconds_count{%s, route=~"(prometheus|api_prom)_api_v1_.+"}' % $.jobMatcher($._config.job_names.querier)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_querier_request_duration_seconds', $.jobSelector($._config.job_names.querier) + [utils.selector.re('route', '(prometheus|api_prom)_api_v1_.+')]) + ) + .addPanel( + $.panel('Per %s p99 Latency' % $._config.per_instance_label) + + $.hiddenLegendQueryPanel( + 'histogram_quantile(0.99, sum by(le, %s) (rate(cortex_querier_request_duration_seconds_bucket{%s, route=~"(prometheus|api_prom)_api_v1_.+"}[$__rate_interval])))' % [$._config.per_instance_label, $.jobMatcher($._config.job_names.querier)], '' + ) + + { yaxes: $.yaxes('s') } + ) + ) + .addRow( + $.row('Ingester') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_request_duration_seconds_count{%s,route=~"/cortex.Ingester/Query(Stream)?|/cortex.Ingester/MetricsForLabelMatchers|/cortex.Ingester/LabelValues|/cortex.Ingester/MetricsMetadata"}' % $.jobMatcher($._config.job_names.ingester)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_request_duration_seconds', $.jobSelector($._config.job_names.ingester) + [utils.selector.re('route', '/cortex.Ingester/Query(Stream)?|/cortex.Ingester/MetricsForLabelMatchers|/cortex.Ingester/LabelValues|/cortex.Ingester/MetricsMetadata')]) + ) + .addPanel( + $.panel('Per %s p99 Latency' % $._config.per_instance_label) + + $.hiddenLegendQueryPanel( + 'histogram_quantile(0.99, sum by(le, %s) (rate(cortex_request_duration_seconds_bucket{%s, route=~"/cortex.Ingester/Query(Stream)?|/cortex.Ingester/MetricsForLabelMatchers|/cortex.Ingester/LabelValues|/cortex.Ingester/MetricsMetadata"}[$__rate_interval])))' % [$._config.per_instance_label, $.jobMatcher($._config.job_names.ingester)], '' + ) + + { yaxes: $.yaxes('s') } + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.row('Store-gateway') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_request_duration_seconds_count{%s,route=~"/gatewaypb.StoreGateway/.*"}' % $.jobMatcher($._config.job_names.store_gateway)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_request_duration_seconds', $.jobSelector($._config.job_names.store_gateway) + [utils.selector.re('route', '/gatewaypb.StoreGateway/.*')]) + ) + .addPanel( + $.panel('Per %s p99 Latency' % $._config.per_instance_label) + + $.hiddenLegendQueryPanel( + 'histogram_quantile(0.99, sum by(le, %s) (rate(cortex_request_duration_seconds_bucket{%s, route=~"/gatewaypb.StoreGateway/.*"}[$__rate_interval])))' % [$._config.per_instance_label, $.jobMatcher($._config.job_names.store_gateway)], '' + ) + + { yaxes: $.yaxes('s') } + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks'), + $.row('Memcached - Chunks storage - Index') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_cache_request_duration_seconds_count{%s,method="store.index-cache-read.memcache.fetch"}' % $.jobMatcher($._config.job_names.querier)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_cache_request_duration_seconds', $.jobSelector($._config.job_names.querier) + [utils.selector.eq('method', 'store.index-cache-read.memcache.fetch')]) + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks'), + $.row('Memcached - Chunks storage - Chunks') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_cache_request_duration_seconds_count{%s,method="chunksmemcache.fetch"}' % $.jobMatcher($._config.job_names.querier)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_cache_request_duration_seconds', $.jobSelector($._config.job_names.querier) + [utils.selector.eq('method', 'chunksmemcache.fetch')]) + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.row('Memcached – Blocks storage – Block index cache (store-gateway accesses)') // Resembles thanosMemcachedCache + .addPanel( + $.panel('Requests / sec') + + $.queryPanel( + ||| + sum by(operation) ( + rate( + thanos_memcached_operations_total{ + component="store-gateway", + name="index-cache", + %s + }[$__rate_interval] + ) + ) + ||| % $.jobMatcher($._config.job_names.store_gateway), '{{operation}}' + ) + + $.stack + + { yaxes: $.yaxes('ops') }, + ) + .addPanel( + $.panel('Latency (getmulti)') + + $.latencyPanel( + 'thanos_memcached_operation_duration_seconds', + ||| + { + %s, + operation="getmulti", + component="store-gateway", + name="index-cache" + } + ||| % $.jobMatcher($._config.job_names.store_gateway) + ) + ) + .addPanel( + $.panel('Hit ratio') + + $.queryPanel( + ||| + sum by(item_type) ( + rate( + thanos_store_index_cache_hits_total{ + component="store-gateway", + %s + }[$__rate_interval] + ) + ) + / + sum by(item_type) ( + rate( + thanos_store_index_cache_requests_total{ + component="store-gateway", + %s + }[$__rate_interval] + ) + ) + ||| % [ + $.jobMatcher($._config.job_names.store_gateway), + $.jobMatcher($._config.job_names.store_gateway), + ], + '{{item_type}}' + ) + + { yaxes: $.yaxes('percentunit') } + + $.panelDescription( + 'Hit Ratio', + ||| + Even if you do not set up memcached for the blocks index cache, you will still see data in this panel because Cortex by default has an + in-memory blocks index cache. + ||| + ), + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.thanosMemcachedCache( + 'Memcached – Blocks storage – Chunks cache (store-gateway accesses)', + $._config.job_names.store_gateway, + 'store-gateway', + 'chunks-cache' + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.thanosMemcachedCache( + 'Memcached – Blocks storage – Metadata cache (store-gateway accesses)', + $._config.job_names.store_gateway, + 'store-gateway', + 'metadata-cache' + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.thanosMemcachedCache( + 'Memcached – Blocks storage – Metadata cache (querier accesses)', + $._config.job_names.querier, + 'querier', + 'metadata-cache' + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks') && + std.member($._config.chunk_index_backend + $._config.chunk_store_backend, 'cassandra'), + $.row('Cassandra') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_cassandra_request_duration_seconds_count{%s, operation="SELECT"}' % $.jobMatcher($._config.job_names.querier)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_cassandra_request_duration_seconds', $.jobSelector($._config.job_names.querier) + [utils.selector.eq('operation', 'SELECT')]) + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks') && + std.member($._config.chunk_index_backend + $._config.chunk_store_backend, 'bigtable'), + $.row('BigTable') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_bigtable_request_duration_seconds_count{%s, operation="/google.bigtable.v2.Bigtable/ReadRows"}' % $.jobMatcher($._config.job_names.querier)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_bigtable_request_duration_seconds', $.jobSelector($._config.job_names.querier) + [utils.selector.eq('operation', '/google.bigtable.v2.Bigtable/ReadRows')]) + ), + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks') && + std.member($._config.chunk_index_backend + $._config.chunk_store_backend, 'dynamodb'), + $.row('DynamoDB') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_dynamo_request_duration_seconds_count{%s, operation="DynamoDB.QueryPages"}' % $.jobMatcher($._config.job_names.querier)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_dynamo_request_duration_seconds', $.jobSelector($._config.job_names.querier) + [utils.selector.eq('operation', 'DynamoDB.QueryPages')]) + ), + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks') && + std.member($._config.chunk_store_backend, 'gcs'), + $.row('GCS') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_gcs_request_duration_seconds_count{%s, operation="GET"}' % $.jobMatcher($._config.job_names.querier)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_gcs_request_duration_seconds', $.jobSelector($._config.job_names.querier) + [utils.selector.eq('operation', 'GET')]) + ) + ) + // Object store metrics for the store-gateway. + .addRowsIf( + std.member($._config.storage_engine, 'blocks'), + $.getObjectStoreRows('Blocks Object Store (Store-gateway accesses)', 'store-gateway') + ) + // Object store metrics for the querier. + .addRowsIf( + std.member($._config.storage_engine, 'blocks'), + $.getObjectStoreRows('Blocks Object Store (Querier accesses)', 'querier') + ), +} diff --git a/operations/mimir-mixin/dashboards/rollout-progress.libsonnet b/operations/mimir-mixin/dashboards/rollout-progress.libsonnet new file mode 100644 index 00000000000..16c54095570 --- /dev/null +++ b/operations/mimir-mixin/dashboards/rollout-progress.libsonnet @@ -0,0 +1,316 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + local config = { + namespace_matcher: $.namespaceMatcher(), + gateway_job_matcher: $.jobMatcher($._config.job_names.gateway), + gateway_write_routes_regex: 'api_(v1|prom)_push', + gateway_read_routes_regex: '(prometheus|api_prom)_api_v1_.+', + all_services_regex: std.join('|', ['cortex-gw', 'distributor', 'ingester.*', 'query-frontend.*', 'query-scheduler.*', 'querier.*', 'compactor', 'store-gateway', 'ruler', 'alertmanager']), + }, + + 'cortex-rollout-progress.json': + ($.dashboard('Cortex / Rollout progress') + { uid: '7544a3a62b1be6ffd919fc990ab8ba8f' }) + .addClusterSelectorTemplates(false) + { + // This dashboard uses the new grid system in order to place panels (using gridPos). + // Because of this we can't use the mixin's addRow() and addPanel(). + schemaVersion: 27, + rows: null, + panels: [ + // + // Rollout progress + // + $.panel('Rollout progress') + + $.barGauge([ + // Multi-zone deployments are grouped together removing the "zone-X" suffix. + // After the grouping, the resulting label is called "cortex_service". + ||| + ( + sum by(cortex_service) ( + label_replace( + kube_statefulset_status_replicas_updated{%(namespace_matcher)s,statefulset=~"%(all_services_regex)s"}, + "cortex_service", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?" + ) + ) + / + sum by(cortex_service) ( + label_replace( + kube_statefulset_replicas{%(namespace_matcher)s}, + "cortex_service", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?" + ) + ) + ) and ( + sum by(cortex_service) ( + label_replace( + kube_statefulset_replicas{%(namespace_matcher)s}, + "cortex_service", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?" + ) + ) + > 0 + ) + ||| % config, + ||| + ( + sum by(cortex_service) ( + label_replace( + kube_deployment_status_replicas_updated{%(namespace_matcher)s,deployment=~"%(all_services_regex)s"}, + "cortex_service", "$1", "deployment", "(.*?)(?:-zone-[a-z])?" + ) + ) + / + sum by(cortex_service) ( + label_replace( + kube_deployment_spec_replicas{%(namespace_matcher)s}, + "cortex_service", "$1", "deployment", "(.*?)(?:-zone-[a-z])?" + ) + ) + ) and ( + sum by(cortex_service) ( + label_replace( + kube_deployment_spec_replicas{%(namespace_matcher)s}, + "cortex_service", "$1", "deployment", "(.*?)(?:-zone-[a-z])?" + ) + ) + > 0 + ) + ||| % config, + ], legends=[ + '{{cortex_service}}', + '{{cortex_service}}', + ], thresholds=[ + { color: 'yellow', value: null }, + { color: 'yellow', value: 0.999 }, + { color: 'green', value: 1 }, + ], unit='percentunit', min=0, max=1) + { + id: 1, + gridPos: { h: 8, w: 10, x: 0, y: 0 }, + }, + + // + // Writes + // + $.panel('Writes - 2xx') + + $.newStatPanel(||| + sum(rate(cortex_request_duration_seconds_count{%(gateway_job_matcher)s, route=~"%(gateway_write_routes_regex)s",status_code=~"2.+"}[$__rate_interval])) / + sum(rate(cortex_request_duration_seconds_count{%(gateway_job_matcher)s, route=~"%(gateway_write_routes_regex)s"}[$__rate_interval])) + ||| % config, thresholds=[ + { color: 'green', value: null }, + ]) + { + id: 2, + gridPos: { h: 4, w: 2, x: 10, y: 0 }, + }, + + $.panel('Writes - 4xx') + + $.newStatPanel(||| + sum(rate(cortex_request_duration_seconds_count{%(gateway_job_matcher)s, route=~"%(gateway_write_routes_regex)s",status_code=~"4.+"}[$__rate_interval])) / + sum(rate(cortex_request_duration_seconds_count{%(gateway_job_matcher)s, route=~"%(gateway_write_routes_regex)s"}[$__rate_interval])) + ||| % config, thresholds=[ + { color: 'green', value: null }, + { color: 'orange', value: 0.2 }, + { color: 'red', value: 0.5 }, + ]) + { + id: 3, + gridPos: { h: 4, w: 2, x: 12, y: 0 }, + }, + + $.panel('Writes - 5xx') + + $.newStatPanel(||| + sum(rate(cortex_request_duration_seconds_count{%(gateway_job_matcher)s, route=~"%(gateway_write_routes_regex)s",status_code=~"5.+"}[$__rate_interval])) / + sum(rate(cortex_request_duration_seconds_count{%(gateway_job_matcher)s, route=~"%(gateway_write_routes_regex)s"}[$__rate_interval])) + ||| % config, thresholds=[ + { color: 'green', value: null }, + { color: 'red', value: 0.01 }, + ]) + { + id: 4, + gridPos: { h: 4, w: 2, x: 14, y: 0 }, + }, + + $.panel('Writes 99th Latency') + + $.newStatPanel(||| + histogram_quantile(0.99, sum by (le) (cluster_job_route:cortex_request_duration_seconds_bucket:sum_rate{%(gateway_job_matcher)s, route=~"%(gateway_write_routes_regex)s"})) + ||| % config, unit='s', thresholds=[ + { color: 'green', value: null }, + { color: 'orange', value: 0.2 }, + { color: 'red', value: 0.5 }, + ]) + { + id: 5, + gridPos: { h: 4, w: 8, x: 16, y: 0 }, + }, + + // + // Reads + // + $.panel('Reads - 2xx') + + $.newStatPanel(||| + sum(rate(cortex_request_duration_seconds_count{%(gateway_job_matcher)s, route=~"%(gateway_read_routes_regex)s",status_code=~"2.+"}[$__rate_interval])) / + sum(rate(cortex_request_duration_seconds_count{%(gateway_job_matcher)s, route=~"%(gateway_read_routes_regex)s"}[$__rate_interval])) + ||| % config, thresholds=[ + { color: 'green', value: null }, + ]) + { + id: 6, + gridPos: { h: 4, w: 2, x: 10, y: 4 }, + }, + + $.panel('Reads - 4xx') + + $.newStatPanel(||| + sum(rate(cortex_request_duration_seconds_count{%(gateway_job_matcher)s, route=~"%(gateway_read_routes_regex)s",status_code=~"4.+"}[$__rate_interval])) / + sum(rate(cortex_request_duration_seconds_count{%(gateway_job_matcher)s, route=~"%(gateway_read_routes_regex)s"}[$__rate_interval])) + ||| % config, thresholds=[ + { color: 'green', value: null }, + { color: 'orange', value: 0.01 }, + { color: 'red', value: 0.05 }, + ]) + { + id: 7, + gridPos: { h: 4, w: 2, x: 12, y: 4 }, + }, + + $.panel('Reads - 5xx') + + $.newStatPanel(||| + sum(rate(cortex_request_duration_seconds_count{%(gateway_job_matcher)s, route=~"%(gateway_read_routes_regex)s",status_code=~"5.+"}[$__rate_interval])) / + sum(rate(cortex_request_duration_seconds_count{%(gateway_job_matcher)s, route=~"%(gateway_read_routes_regex)s"}[$__rate_interval])) + ||| % config, thresholds=[ + { color: 'green', value: null }, + { color: 'red', value: 0.01 }, + ]) + { + id: 8, + gridPos: { h: 4, w: 2, x: 14, y: 4 }, + }, + + $.panel('Reads 99th Latency') + + $.newStatPanel(||| + histogram_quantile(0.99, sum by (le) (cluster_job_route:cortex_request_duration_seconds_bucket:sum_rate{%(gateway_job_matcher)s, route=~"%(gateway_read_routes_regex)s"})) + ||| % config, unit='s', thresholds=[ + { color: 'green', value: null }, + { color: 'orange', value: 1 }, + { color: 'red', value: 2.5 }, + ]) + { + id: 9, + gridPos: { h: 4, w: 8, x: 16, y: 4 }, + }, + + // + // Unhealthy pods + // + $.panel('Unhealthy pods') + + $.newStatPanel([ + ||| + kube_deployment_status_replicas_unavailable{%(namespace_matcher)s, deployment=~"%(all_services_regex)s"} + > 0 + ||| % config, + ||| + kube_statefulset_status_replicas_current{%(namespace_matcher)s, statefulset=~"%(all_services_regex)s"} - + kube_statefulset_status_replicas_ready {%(namespace_matcher)s, statefulset=~"%(all_services_regex)s"} + > 0 + ||| % config, + ], legends=[ + '{{deployment}}', + '{{statefulset}}', + ], thresholds=[ + { color: 'green', value: null }, + { color: 'orange', value: 1 }, + { color: 'red', value: 2 }, + ], instant=true, novalue='All healthy', unit='short', decimals=0) + { + options: { + text: { + // Small font size since we may have many entries during a rollout. + titleSize: 14, + valueSize: 14, + }, + }, + id: 10, + gridPos: { h: 8, w: 10, x: 0, y: 8 }, + }, + + // + // Versions + // + { + title: 'Pods count per Version', + type: 'table', + datasource: '$datasource', + + targets: [ + { + expr: ||| + count by(container, version) ( + label_replace( + kube_pod_container_info{%(namespace_matcher)s,container=~"%(all_services_regex)s"}, + "version", "$1", "image", ".*:(.+)-.*" + ) + ) + ||| % config, + instant: true, + legendFormat: '', + refId: 'A', + }, + ], + + fieldConfig: { + overrides: [ + { + // Center align the version. + matcher: { id: 'byRegexp', options: 'r.*' }, + properties: [{ id: 'custom.align', value: 'center' }], + }, + ], + }, + + transformations: [ + { + // Transform the version label to a field. + id: 'labelsToFields', + options: { valueLabel: 'version' }, + }, + { + // Hide time. + id: 'organize', + options: { excludeByName: { Time: true } }, + }, + { + // Sort by container. + id: 'sortBy', + options: { fields: {}, sort: [{ field: 'container' }] }, + }, + ], + + id: 11, + gridPos: { h: 8, w: 6, x: 10, y: 8 }, + }, + + // + // Performance comparison with 24h ago + // + $.panel('Latency vs 24h ago') + + $.queryPanel([||| + 1 - ( + avg_over_time(histogram_quantile(0.99, sum by (le) (cluster_job_route:cortex_request_duration_seconds_bucket:sum_rate{%(gateway_job_matcher)s, route=~"%(gateway_write_routes_regex)s"} offset 24h))[1h:]) + / + avg_over_time(histogram_quantile(0.99, sum by (le) (cluster_job_route:cortex_request_duration_seconds_bucket:sum_rate{%(gateway_job_matcher)s, route=~"%(gateway_write_routes_regex)s"}))[1h:]) + ) + ||| % config, ||| + 1 - ( + avg_over_time(histogram_quantile(0.99, sum by (le) (cluster_job_route:cortex_request_duration_seconds_bucket:sum_rate{%(gateway_job_matcher)s, route=~"%(gateway_read_routes_regex)s"} offset 24h))[1h:]) + / + avg_over_time(histogram_quantile(0.99, sum by (le) (cluster_job_route:cortex_request_duration_seconds_bucket:sum_rate{%(gateway_job_matcher)s, route=~"%(gateway_read_routes_regex)s"}))[1h:]) + ) + ||| % config], ['writes', 'reads']) + { + yaxes: $.yaxes({ + format: 'percentunit', + min: null, // Can be negative. + }), + + id: 12, + gridPos: { h: 8, w: 8, x: 16, y: 8 }, + }, + ], + + templating+: { + list: [ + // Do not allow to include all clusters/namespaces cause this dashboard is designed to show + // 1 cluster at a time. + l + (if (l.name == 'cluster' || l.name == 'namespace') then { includeAll: false } else {}) + for l in super.list + ], + }, + }, +} diff --git a/operations/mimir-mixin/dashboards/ruler.libsonnet b/operations/mimir-mixin/dashboards/ruler.libsonnet new file mode 100644 index 00000000000..bfa231b7c20 --- /dev/null +++ b/operations/mimir-mixin/dashboards/ruler.libsonnet @@ -0,0 +1,255 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + local ruler_config_api_routes_re = 'api_prom_rules.*|api_prom_api_v1_(rules|alerts)', + + rulerQueries+:: { + ruleEvaluations: { + success: + ||| + sum(rate(cortex_prometheus_rule_evaluations_total{%s}[$__rate_interval])) + - + sum(rate(cortex_prometheus_rule_evaluation_failures_total{%s}[$__rate_interval])) + |||, + failure: 'sum(rate(cortex_prometheus_rule_evaluation_failures_total{%s}[$__rate_interval]))', + latency: + ||| + sum (rate(cortex_prometheus_rule_evaluation_duration_seconds_sum{%s}[$__rate_interval])) + / + sum (rate(cortex_prometheus_rule_evaluation_duration_seconds_count{%s}[$__rate_interval])) + |||, + }, + perUserPerGroupEvaluations: { + failure: 'sum by(rule_group) (rate(cortex_prometheus_rule_evaluation_failures_total{%s}[$__rate_interval])) > 0', + latency: + ||| + sum by(user) (rate(cortex_prometheus_rule_evaluation_duration_seconds_sum{%s}[$__rate_interval])) + / + sum by(user) (rate(cortex_prometheus_rule_evaluation_duration_seconds_count{%s}[$__rate_interval])) + |||, + }, + groupEvaluations: { + missedIterations: 'sum by(user) (rate(cortex_prometheus_rule_group_iterations_missed_total{%s}[$__rate_interval])) > 0', + latency: + ||| + rate(cortex_prometheus_rule_group_duration_seconds_sum{%s}[$__rate_interval]) + / + rate(cortex_prometheus_rule_group_duration_seconds_count{%s}[$__rate_interval]) + |||, + }, + notifications: { + failure: + ||| + sum by(user) (rate(cortex_prometheus_notifications_errors_total{%s}[$__rate_interval])) + / + sum by(user) (rate(cortex_prometheus_notifications_sent_total{%s}[$__rate_interval])) + > 0 + |||, + queue: + ||| + sum by(user) (rate(cortex_prometheus_notifications_queue_length{%s}[$__rate_interval])) + / + sum by(user) (rate(cortex_prometheus_notifications_queue_capacity{%s}[$__rate_interval])) > 0 + |||, + dropped: + ||| + sum by (user) (increase(cortex_prometheus_notifications_dropped_total{%s}[$__rate_interval])) > 0 + |||, + }, + }, + + 'ruler.json': + ($.dashboard('Cortex / Ruler') + { uid: '44d12bcb1f95661c6ab6bc946dfc3473' }) + .addClusterSelectorTemplates() + .addRow( + ($.row('Headlines') + { + height: '100px', + showTitle: false, + }) + .addPanel( + $.panel('Active Configurations') + + $.statPanel('sum(cortex_ruler_managers_total{%s})' % $.jobMatcher('ruler'), format='short') + ) + .addPanel( + $.panel('Total Rules') + + $.statPanel('sum(cortex_prometheus_rule_group_rules{%s})' % $.jobMatcher('ruler'), format='short') + ) + .addPanel( + $.panel('Read from Ingesters - QPS') + + $.statPanel('sum(rate(cortex_ingester_client_request_duration_seconds_count{%s, operation="/cortex.Ingester/QueryStream"}[5m]))' % $.jobMatcher('ruler'), format='reqps') + ) + .addPanel( + $.panel('Write to Ingesters - QPS') + + $.statPanel('sum(rate(cortex_ingester_client_request_duration_seconds_count{%s, operation="/cortex.Ingester/Push"}[5m]))' % $.jobMatcher('ruler'), format='reqps') + ) + ) + .addRow( + $.row('Rule Evaluations Global') + .addPanel( + $.panel('EPS') + + $.queryPanel( + [ + $.rulerQueries.ruleEvaluations.success % [$.jobMatcher('ruler'), $.jobMatcher('ruler')], + $.rulerQueries.ruleEvaluations.failure % $.jobMatcher('ruler'), + ], + ['success', 'failed'], + ), + ) + .addPanel( + $.panel('Latency') + + $.queryPanel( + $.rulerQueries.ruleEvaluations.latency % [$.jobMatcher('ruler'), $.jobMatcher('ruler')], + 'average' + ), + ) + ) + .addRow( + $.row('Configuration API (gateway)') + .addPanel( + $.panel('QPS') + + $.qpsPanel('cortex_request_duration_seconds_count{%s, route=~"%s"}' % [$.jobMatcher($._config.job_names.gateway), ruler_config_api_routes_re]) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_request_duration_seconds', $.jobSelector($._config.job_names.gateway) + [utils.selector.re('route', ruler_config_api_routes_re)]) + ) + .addPanel( + $.panel('Per route p99 Latency') + + $.queryPanel( + 'histogram_quantile(0.99, sum by (route, le) (cluster_job_route:cortex_request_duration_seconds_bucket:sum_rate{%s, route=~"%s"}))' % [$.jobMatcher($._config.job_names.gateway), ruler_config_api_routes_re], + '{{ route }}' + ) + + { yaxes: $.yaxes('s') } + ) + ) + .addRow( + $.row('Writes (Ingesters)') + .addPanel( + $.panel('QPS') + + $.qpsPanel('cortex_ingester_client_request_duration_seconds_count{%s, operation="/cortex.Ingester/Push"}' % $.jobMatcher('ruler')) + ) + .addPanel( + $.panel('Latency') + + $.latencyPanel('cortex_ingester_client_request_duration_seconds', '{%s, operation="/cortex.Ingester/Push"}' % $.jobMatcher('ruler')) + ) + ) + .addRow( + $.row('Reads (Ingesters)') + .addPanel( + $.panel('QPS') + + $.qpsPanel('cortex_ingester_client_request_duration_seconds_count{%s, operation="/cortex.Ingester/QueryStream"}' % $.jobMatcher('ruler')) + ) + .addPanel( + $.panel('Latency') + + $.latencyPanel('cortex_ingester_client_request_duration_seconds', '{%s, operation="/cortex.Ingester/QueryStream"}' % $.jobMatcher('ruler')) + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks'), + $.row('Ruler - Chunks storage - Index Cache') + .addPanel( + $.panel('Total entries') + + $.queryPanel('sum(querier_cache_added_new_total{cache="store.index-cache-read.fifocache",%s}) - sum(querier_cache_evicted_total{cache="store.index-cache-read.fifocache",%s})' % [$.jobMatcher($._config.job_names.ruler), $.jobMatcher($._config.job_names.ruler)], 'Entries'), + ) + .addPanel( + $.panel('Cache Hit %') + + $.queryPanel('(sum(rate(querier_cache_gets_total{cache="store.index-cache-read.fifocache",%s}[1m])) - sum(rate(querier_cache_misses_total{cache="store.index-cache-read.fifocache",%s}[1m]))) / sum(rate(querier_cache_gets_total{cache="store.index-cache-read.fifocache",%s}[1m]))' % [$.jobMatcher($._config.job_names.ruler), $.jobMatcher($._config.job_names.ruler), $.jobMatcher($._config.job_names.ruler)], 'hit rate') + { yaxes: $.yaxes({ format: 'percentunit', max: 1 }) }, + ) + .addPanel( + $.panel('Churn Rate') + + $.queryPanel('sum(rate(querier_cache_evicted_total{cache="store.index-cache-read.fifocache",%s}[1m]))' % $.jobMatcher($._config.job_names.ruler), 'churn rate'), + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks'), + $.row('Ruler - Chunks storage - Store') + .addPanel( + $.panel('Index Lookups per Query') + + utils.latencyRecordingRulePanel('cortex_chunk_store_index_lookups_per_query', $.jobSelector($._config.job_names.ruler), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.panel('Series (pre-intersection) per Query') + + utils.latencyRecordingRulePanel('cortex_chunk_store_series_pre_intersection_per_query', $.jobSelector($._config.job_names.ruler), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.panel('Series (post-intersection) per Query') + + utils.latencyRecordingRulePanel('cortex_chunk_store_series_post_intersection_per_query', $.jobSelector($._config.job_names.ruler), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.panel('Chunks per Query') + + utils.latencyRecordingRulePanel('cortex_chunk_store_chunks_per_query', $.jobSelector($._config.job_names.ruler), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.row('Ruler - Blocks storage') + .addPanel( + $.panel('Number of store-gateways hit per Query') + + $.latencyPanel('cortex_querier_storegateway_instances_hit_per_query', '{%s}' % $.jobMatcher($._config.job_names.ruler), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.panel('Refetches of missing blocks per Query') + + $.latencyPanel('cortex_querier_storegateway_refetches_per_query', '{%s}' % $.jobMatcher($._config.job_names.ruler), multiplier=1) + + { yaxes: $.yaxes('short') }, + ) + .addPanel( + $.panel('Consistency checks failed') + + $.queryPanel('sum(rate(cortex_querier_blocks_consistency_checks_failed_total{%s}[1m])) / sum(rate(cortex_querier_blocks_consistency_checks_total{%s}[1m]))' % [$.jobMatcher($._config.job_names.ruler), $.jobMatcher($._config.job_names.ruler)], 'Failure Rate') + + { yaxes: $.yaxes({ format: 'percentunit', max: 1 }) }, + ) + ) + .addRow( + $.row('Notifications') + .addPanel( + $.panel('Delivery Errors') + + $.queryPanel($.rulerQueries.notifications.failure % [$.jobMatcher('ruler'), $.jobMatcher('ruler')], '{{ user }}') + ) + .addPanel( + $.panel('Queue Length') + + $.queryPanel($.rulerQueries.notifications.queue % [$.jobMatcher('ruler'), $.jobMatcher('ruler')], '{{ user }}') + ) + .addPanel( + $.panel('Dropped') + + $.queryPanel($.rulerQueries.notifications.dropped % $.jobMatcher('ruler'), '{{ user }}') + ) + ) + .addRow( + ($.row('Group Evaluations') + { collapse: true }) + .addPanel( + $.panel('Missed Iterations') + + $.queryPanel($.rulerQueries.groupEvaluations.missedIterations % $.jobMatcher('ruler'), '{{ user }}'), + ) + .addPanel( + $.panel('Latency') + + $.queryPanel( + $.rulerQueries.groupEvaluations.latency % [$.jobMatcher('ruler'), $.jobMatcher('ruler')], + '{{ user }}' + ), + ) + .addPanel( + $.panel('Failures') + + $.queryPanel( + $.rulerQueries.perUserPerGroupEvaluations.failure % [$.jobMatcher('ruler')], '{{ rule_group }}' + ) + ) + ) + .addRow( + ($.row('Rule Evaluation per User') + { collapse: true }) + .addPanel( + $.panel('Latency') + + $.queryPanel( + $.rulerQueries.perUserPerGroupEvaluations.latency % [$.jobMatcher('ruler'), $.jobMatcher('ruler')], + '{{ user }}' + ) + ) + ) + .addRows( + $.getObjectStoreRows('Ruler Configuration Object Store (Ruler accesses)', 'ruler-storage') + ), +} diff --git a/operations/mimir-mixin/dashboards/scaling.libsonnet b/operations/mimir-mixin/dashboards/scaling.libsonnet new file mode 100644 index 00000000000..a01a7db304e --- /dev/null +++ b/operations/mimir-mixin/dashboards/scaling.libsonnet @@ -0,0 +1,60 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + + 'cortex-scaling.json': + ($.dashboard('Cortex / Scaling') + { uid: '88c041017b96856c9176e07cf557bdcf' }) + .addClusterSelectorTemplates() + .addRow( + ($.row('Cortex Service Scaling') + { height: '200px' }) + .addPanel({ + type: 'text', + title: '', + options: { + content: ||| + This dashboards shows any services which are not scaled correctly. + The table below gives the required number of replicas and the reason why. + We only show services without enough replicas. + + Reasons: + - **sample_rate**: There are not enough replicas to handle the + sample rate. Applies to distributor and ingesters. + - **active_series**: There are not enough replicas + to handle the number of active series. Applies to ingesters. + - **cpu_usage**: There are not enough replicas + based on the CPU usage of the jobs vs the resource requests. + Applies to all jobs. + - **memory_usage**: There are not enough replicas based on the memory + usage vs the resource requests. Applies to all jobs. + - **active_series_limits**: There are not enough replicas to hold 60% of the + sum of all the per tenant series limits. + - **sample_rate_limits**: There are not enough replicas to handle 60% of the + sum of all the per tenant rate limits. + |||, + mode: 'markdown', + }, + }) + ) + .addRow( + ($.row('Scaling') + { height: '400px' }) + .addPanel( + $.panel('Workload-based scaling') + { sort: { col: 0, desc: false } } + + $.tablePanel([ + ||| + sort_desc( + cluster_namespace_deployment_reason:required_replicas:count{cluster=~"$cluster", namespace=~"$namespace"} + > ignoring(reason) group_left + cluster_namespace_deployment:actual_replicas:count{cluster=~"$cluster", namespace=~"$namespace"} + ) + |||, + ], { + __name__: { alias: 'Cluster', type: 'hidden' }, + cluster: { alias: 'Cluster' }, + namespace: { alias: 'Namespace' }, + deployment: { alias: 'Service' }, + reason: { alias: 'Reason' }, + Value: { alias: 'Required Replicas', decimals: 0 }, + }) + ) + ), +} diff --git a/operations/mimir-mixin/dashboards/slow-queries.libsonnet b/operations/mimir-mixin/dashboards/slow-queries.libsonnet new file mode 100644 index 00000000000..90916facdc8 --- /dev/null +++ b/operations/mimir-mixin/dashboards/slow-queries.libsonnet @@ -0,0 +1,185 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + 'cortex-slow-queries.json': + ($.dashboard('Cortex / Slow Queries') + { uid: 'e6f3091e29d2636e3b8393447e925668' }) + .addClusterSelectorTemplates(false) + .addRow( + $.row('') + .addPanel( + { + title: 'Slow queries', + type: 'table', + datasource: '${lokidatasource}', + + // Query logs from Loki. + targets: [ + { + // Filter out the remote read endpoint. + expr: '{cluster=~"$cluster",namespace=~"$namespace",name=~"query-frontend.*"} |= "query stats" != "/api/v1/read" | logfmt | org_id=~"${tenant_id}" | response_time > ${min_duration}', + instant: false, + legendFormat: '', + range: true, + refId: 'A', + }, + ], + + // Use Grafana transformations to display fields in a table. + transformations: [ + { + // Convert labels to fields. + id: 'labelsToFields', + options: {}, + }, + { + // Compute the query time range. + id: 'calculateField', + options: { + alias: 'Time range', + mode: 'binary', + binary: { + left: 'param_end', + operator: '-', + reducer: 'sum', + right: 'param_start', + }, + reduce: { reducer: 'sum' }, + replaceFields: false, + }, + }, + { + id: 'organize', + options: { + // Hide fields we don't care. + local hiddenFields = ['caller', 'cluster', 'container', 'host', 'id', 'job', 'level', 'line', 'method', 'msg', 'name', 'namespace', 'param_end', 'param_start', 'param_time', 'path', 'pod', 'pod_template_hash', 'query_wall_time_seconds', 'stream', 'traceID', 'tsNs'], + + excludeByName: { + [field]: true + for field in hiddenFields + }, + + // Order fields. + local orderedFields = ['ts', 'org_id', 'param_query', 'Time range', 'param_step', 'response_time'], + + indexByName: { + [orderedFields[i]]: i + for i in std.range(0, std.length(orderedFields) - 1) + }, + + // Rename fields. + renameByName: { + org_id: 'Tenant ID', + param_query: 'Query', + param_step: 'Step', + response_time: 'Duration', + }, + }, + }, + ], + + fieldConfig: { + // Configure overrides to nicely format field values. + overrides: [ + { + matcher: { id: 'byName', options: 'Time range' }, + properties: [ + { + id: 'mappings', + value: [ + { + from: '', + id: 1, + text: 'Instant query', + to: '', + type: 1, + value: '0', + }, + ], + }, + { id: 'unit', value: 's' }, + ], + }, + { + matcher: { id: 'byName', options: 'Step' }, + properties: [{ id: 'unit', value: 's' }], + }, + ], + }, + }, + ) + ) + + { + templating+: { + list+: [ + // Add the Loki datasource. + { + type: 'datasource', + name: 'lokidatasource', + label: 'Logs datasource', + query: 'loki', + hide: 0, + includeAll: false, + multi: false, + }, + // Add a variable to configure the min duration. + { + local defaultValue = '5s', + + type: 'textbox', + name: 'min_duration', + label: 'Min duration', + hide: 0, + options: [ + { + selected: true, + text: defaultValue, + value: defaultValue, + }, + ], + current: { + // Default value. + selected: true, + text: defaultValue, + value: defaultValue, + }, + query: defaultValue, + }, + // Add a variable to configure the tenant to filter on. + { + local defaultValue = '.*', + + type: 'textbox', + name: 'tenant_id', + label: 'Tenant ID', + hide: 0, + options: [ + { + selected: true, + text: defaultValue, + value: defaultValue, + }, + ], + current: { + // Default value. + selected: true, + text: defaultValue, + value: defaultValue, + }, + query: defaultValue, + }, + ], + }, + } + { + templating+: { + list: [ + // Do not allow to include all clusters/namespaces otherwise this dashboard + // risks to explode because it shows resources per pod. + l + (if (l.name == 'cluster' || l.name == 'namespace') then { includeAll: false } else {}) + for l in super.list + ], + }, + } + { + // No auto-refresh by default. + refresh: '', + }, +} diff --git a/operations/mimir-mixin/dashboards/writes-resources.libsonnet b/operations/mimir-mixin/dashboards/writes-resources.libsonnet new file mode 100644 index 00000000000..64f83ef1cca --- /dev/null +++ b/operations/mimir-mixin/dashboards/writes-resources.libsonnet @@ -0,0 +1,78 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + 'cortex-writes-resources.json': + ($.dashboard('Cortex / Writes Resources') + { uid: 'c0464f0d8bd026f776c9006b0591bb0b' }) + .addClusterSelectorTemplates(false) + .addRow( + $.row('Gateway') + .addPanel( + $.containerCPUUsagePanel('CPU', $._config.job_names.gateway), + ) + .addPanel( + $.containerMemoryWorkingSetPanel('Memory (workingset)', $._config.job_names.gateway), + ) + .addPanel( + $.goHeapInUsePanel('Memory (go heap inuse)', $._config.job_names.gateway), + ) + ) + .addRow( + $.row('Distributor') + .addPanel( + $.containerCPUUsagePanel('CPU', 'distributor'), + ) + .addPanel( + $.containerMemoryWorkingSetPanel('Memory (workingset)', 'distributor'), + ) + .addPanel( + $.goHeapInUsePanel('Memory (go heap inuse)', $._config.job_names.distributor), + ) + ) + .addRow( + $.row('Ingester') + .addPanel( + $.panel('In-memory series') + + $.queryPanel( + 'sum by(%s) (cortex_ingester_memory_series{%s})' % [$._config.per_instance_label, $.jobMatcher($._config.job_names.ingester)], + '{{%s}}' % $._config.per_instance_label + ) + + { + tooltip: { sort: 2 }, // Sort descending. + }, + ) + .addPanel( + $.containerCPUUsagePanel('CPU', 'ingester'), + ) + ) + .addRow( + $.row('') + .addPanel( + $.containerMemoryWorkingSetPanel('Memory (workingset)', 'ingester'), + ) + .addPanel( + $.goHeapInUsePanel('Memory (go heap inuse)', $._config.job_names.ingester), + ) + ) + .addRow( + $.row('') + .addPanel( + $.containerDiskWritesPanel('Disk Writes', 'ingester') + ) + .addPanel( + $.containerDiskReadsPanel('Disk Reads', 'ingester') + ) + .addPanel( + $.containerDiskSpaceUtilization('Disk Space Utilization', 'ingester'), + ) + ) + + { + templating+: { + list: [ + // Do not allow to include all clusters/namespaces otherwise this dashboard + // risks to explode because it shows resources per pod. + l + (if (l.name == 'cluster' || l.name == 'namespace') then { includeAll: false } else {}) + for l in super.list + ], + }, + }, +} diff --git a/operations/mimir-mixin/dashboards/writes.libsonnet b/operations/mimir-mixin/dashboards/writes.libsonnet new file mode 100644 index 00000000000..e99faee4c4e --- /dev/null +++ b/operations/mimir-mixin/dashboards/writes.libsonnet @@ -0,0 +1,327 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +(import 'dashboard-utils.libsonnet') { + 'cortex-writes.json': + ($.dashboard('Cortex / Writes') + { uid: '0156f6d15aa234d452a33a4f13c838e3' }) + .addClusterSelectorTemplates() + .addRowIf( + $._config.show_dashboard_descriptions.writes, + ($.row('Writes dashboard description') { height: '125px', showTitle: false }) + .addPanel( + $.textPanel('', ||| +

+ This dashboard shows various health metrics for the Cortex write path. + It is broken into sections for each service on the write path, + and organized by the order in which the write request flows. +
+ Incoming metrics data travels from the gateway → distributor → ingester. +
+ For each service, there are 3 panels showing + (1) requests per second to that service, + (2) average, median, and p99 latency of requests to that service, and + (3) p99 latency of requests to each instance of that service. +

+

+ It also includes metrics for the key-value (KV) stores used to manage + the high-availability tracker and the ingesters. +

+ |||), + ) + ).addRow( + ($.row('Headlines') + + { + height: '100px', + showTitle: false, + }) + .addPanel( + $.panel('Samples / sec') + + $.statPanel( + 'sum(%(group_prefix_jobs)s:cortex_distributor_received_samples:rate5m{%(job)s})' % ( + $._config { + job: $.jobMatcher($._config.job_names.distributor), + } + ), + format='short' + ) + ) + .addPanel( + $.panel('Active Series') + + $.statPanel(||| + sum(cortex_ingester_memory_series{%(ingester)s} + / on(%(group_by_cluster)s) group_left + max by (%(group_by_cluster)s) (cortex_distributor_replication_factor{%(distributor)s})) + ||| % ($._config) { + ingester: $.jobMatcher($._config.job_names.ingester), + distributor: $.jobMatcher($._config.job_names.distributor), + }, format='short') + ) + .addPanel( + $.panel('Tenants') + + $.statPanel('count(count by(user) (cortex_ingester_active_series{%s}))' % $.jobMatcher($._config.job_names.ingester), format='short') + ) + .addPanel( + $.panel('Requests / sec') + + $.statPanel('sum(rate(cortex_request_duration_seconds_count{%s, route=~"api_(v1|prom)_push"}[5m]))' % $.jobMatcher($._config.job_names.gateway), format='reqps') + ) + ) + .addRow( + $.row('Gateway') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_request_duration_seconds_count{%s, route=~"api_(v1|prom)_push"}' % $.jobMatcher($._config.job_names.gateway)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_request_duration_seconds', $.jobSelector($._config.job_names.gateway) + [utils.selector.re('route', 'api_(v1|prom)_push')]) + ) + .addPanel( + $.panel('Per %s p99 Latency' % $._config.per_instance_label) + + $.hiddenLegendQueryPanel( + 'histogram_quantile(0.99, sum by(le, %s) (rate(cortex_request_duration_seconds_bucket{%s, route=~"api_(v1|prom)_push"}[$__rate_interval])))' % [$._config.per_instance_label, $.jobMatcher($._config.job_names.gateway)], '' + ) + + { yaxes: $.yaxes('s') } + ) + ) + .addRow( + $.row('Distributor') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_request_duration_seconds_count{%s, route=~"/distributor.Distributor/Push|/httpgrpc.*|api_(v1|prom)_push"}' % $.jobMatcher($._config.job_names.distributor)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_request_duration_seconds', $.jobSelector($._config.job_names.distributor) + [utils.selector.re('route', '/distributor.Distributor/Push|/httpgrpc.*|api_(v1|prom)_push')]) + ) + .addPanel( + $.panel('Per %s p99 Latency' % $._config.per_instance_label) + + $.hiddenLegendQueryPanel( + 'histogram_quantile(0.99, sum by(le, %s) (rate(cortex_request_duration_seconds_bucket{%s, route=~"/distributor.Distributor/Push|/httpgrpc.*|api_(v1|prom)_push"}[$__rate_interval])))' % [$._config.per_instance_label, $.jobMatcher($._config.job_names.distributor)], '' + ) + + { yaxes: $.yaxes('s') } + ) + ) + .addRow( + $.row('Key-value store for high-availability (HA) deduplication') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_kv_request_duration_seconds_count{%s}' % $.jobMatcher($._config.job_names.distributor)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_kv_request_duration_seconds', $.jobSelector($._config.job_names.distributor)) + ) + ) + .addRow( + $.row('Ingester') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_request_duration_seconds_count{%s,route="/cortex.Ingester/Push"}' % $.jobMatcher($._config.job_names.ingester)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_request_duration_seconds', $.jobSelector($._config.job_names.ingester) + [utils.selector.eq('route', '/cortex.Ingester/Push')]) + ) + .addPanel( + $.panel('Per %s p99 Latency' % $._config.per_instance_label) + + $.hiddenLegendQueryPanel( + 'histogram_quantile(0.99, sum by(le, %s) (rate(cortex_request_duration_seconds_bucket{%s, route="/cortex.Ingester/Push"}[$__rate_interval])))' % [$._config.per_instance_label, $.jobMatcher($._config.job_names.ingester)], '' + ) + + { yaxes: $.yaxes('s') } + ) + ) + .addRow( + $.row('Key-value store for the ingesters ring') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_kv_request_duration_seconds_count{%s}' % $.jobMatcher($._config.job_names.ingester)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_kv_request_duration_seconds', $.jobSelector($._config.job_names.ingester)) + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks'), + $.row('Memcached') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_memcache_request_duration_seconds_count{%s,method="Memcache.Put"}' % $.jobMatcher($._config.job_names.ingester)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_memcache_request_duration_seconds', $.jobSelector($._config.job_names.ingester) + [utils.selector.eq('method', 'Memcache.Put')]) + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks') && + std.member($._config.chunk_index_backend + $._config.chunk_store_backend, 'cassandra'), + $.row('Cassandra') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_cassandra_request_duration_seconds_count{%s, operation="INSERT"}' % $.jobMatcher($._config.job_names.ingester)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_cassandra_request_duration_seconds', $.jobSelector($._config.job_names.ingester) + [utils.selector.eq('operation', 'INSERT')]) + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks') && + std.member($._config.chunk_index_backend + $._config.chunk_store_backend, 'bigtable'), + $.row('BigTable') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_bigtable_request_duration_seconds_count{%s, operation="/google.bigtable.v2.Bigtable/MutateRows"}' % $.jobMatcher($._config.job_names.ingester)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_bigtable_request_duration_seconds', $.jobSelector($._config.job_names.ingester) + [utils.selector.eq('operation', '/google.bigtable.v2.Bigtable/MutateRows')]) + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks') && + std.member($._config.chunk_index_backend + $._config.chunk_store_backend, 'dynamodb'), + $.row('DynamoDB') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_dynamo_request_duration_seconds_count{%s, operation="DynamoDB.BatchWriteItem"}' % $.jobMatcher($._config.job_names.ingester)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_dynamo_request_duration_seconds', $.jobSelector($._config.job_names.ingester) + [utils.selector.eq('operation', 'DynamoDB.BatchWriteItem')]) + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'chunks') && + std.member($._config.chunk_store_backend, 'gcs'), + $.row('GCS') + .addPanel( + $.panel('Requests / sec') + + $.qpsPanel('cortex_gcs_request_duration_seconds_count{%s, operation="POST"}' % $.jobMatcher($._config.job_names.ingester)) + ) + .addPanel( + $.panel('Latency') + + utils.latencyRecordingRulePanel('cortex_gcs_request_duration_seconds', $.jobSelector($._config.job_names.ingester) + [utils.selector.eq('operation', 'POST')]) + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.row('Ingester - Blocks storage - Shipper') + .addPanel( + $.successFailurePanel( + 'Uploaded blocks / sec', + 'sum(rate(cortex_ingester_shipper_uploads_total{%s}[$__rate_interval])) - sum(rate(cortex_ingester_shipper_upload_failures_total{%s}[$__rate_interval]))' % [$.jobMatcher($._config.job_names.ingester), $.jobMatcher($._config.job_names.ingester)], + 'sum(rate(cortex_ingester_shipper_upload_failures_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.ingester), + ) + + $.panelDescription( + 'Uploaded blocks / sec', + ||| + The rate of blocks being uploaded from the ingesters + to object storage. + ||| + ), + ) + .addPanel( + $.panel('Upload latency') + + $.latencyPanel('thanos_objstore_bucket_operation_duration_seconds', '{%s,component="ingester",operation="upload"}' % $.jobMatcher($._config.job_names.ingester)) + + $.panelDescription( + 'Upload latency', + ||| + The average, median (50th percentile), and 99th percentile time + the ingesters take to upload blocks to object storage. + ||| + ), + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.row('Ingester - Blocks storage - TSDB Head') + .addPanel( + $.successFailurePanel( + 'Compactions / sec', + 'sum(rate(cortex_ingester_tsdb_compactions_total{%s}[$__rate_interval]))' % [$.jobMatcher($._config.job_names.ingester)], + 'sum(rate(cortex_ingester_tsdb_compactions_failed_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.ingester), + ) + + $.panelDescription( + 'Compactions per second', + ||| + Ingesters maintain a local TSDB per-tenant on disk. Each TSDB maintains a head block for each + active time series; these blocks get periodically compacted (by default, every 2h). + This panel shows the rate of compaction operations across all TSDBs on all ingesters. + ||| + ), + ) + .addPanel( + $.panel('Compactions latency') + + $.latencyPanel('cortex_ingester_tsdb_compaction_duration_seconds', '{%s}' % $.jobMatcher($._config.job_names.ingester)) + + $.panelDescription( + 'Compaction latency', + ||| + The average, median (50th percentile), and 99th percentile time ingesters take to compact TSDB head blocks + on the local filesystem. + ||| + ), + ) + ) + .addRowIf( + std.member($._config.storage_engine, 'blocks'), + $.row('Ingester - Blocks storage - TSDB write ahead log (WAL)') + .addPanel( + $.successFailurePanel( + 'WAL truncations / sec', + 'sum(rate(cortex_ingester_tsdb_wal_truncations_total{%s}[$__rate_interval])) - sum(rate(cortex_ingester_tsdb_wal_truncations_failed_total{%s}[$__rate_interval]))' % [$.jobMatcher($._config.job_names.ingester), $.jobMatcher($._config.job_names.ingester)], + 'sum(rate(cortex_ingester_tsdb_wal_truncations_failed_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.ingester), + ) + + $.panelDescription( + 'WAL truncations per second', + ||| + The WAL is truncated each time a new TSDB block is written. This panel measures the rate of + truncations. + ||| + ), + ) + .addPanel( + $.successFailurePanel( + 'Checkpoints created / sec', + 'sum(rate(cortex_ingester_tsdb_checkpoint_creations_total{%s}[$__rate_interval])) - sum(rate(cortex_ingester_tsdb_checkpoint_creations_failed_total{%s}[$__rate_interval]))' % [$.jobMatcher($._config.job_names.ingester), $.jobMatcher($._config.job_names.ingester)], + 'sum(rate(cortex_ingester_tsdb_checkpoint_creations_failed_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.ingester), + ) + + $.panelDescription( + 'Checkpoints created per second', + ||| + Checkpoints are created as part of the WAL truncation process. + This metric measures the rate of checkpoint creation. + ||| + ), + ) + .addPanel( + $.panel('WAL truncations latency (includes checkpointing)') + + $.queryPanel('sum(rate(cortex_ingester_tsdb_wal_truncate_duration_seconds_sum{%s}[$__rate_interval])) / sum(rate(cortex_ingester_tsdb_wal_truncate_duration_seconds_count{%s}[$__rate_interval]))' % [$.jobMatcher($._config.job_names.ingester), $.jobMatcher($._config.job_names.ingester)], 'avg') + + { yaxes: $.yaxes('s') } + + $.panelDescription( + 'WAL truncations latency (including checkpointing)', + ||| + Average time taken to perform a full WAL truncation, + including the time taken for the checkpointing to complete. + ||| + ), + ) + .addPanel( + $.panel('Corruptions / sec') + + $.queryPanel([ + 'sum(rate(cortex_ingester_wal_corruptions_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.ingester), + 'sum(rate(cortex_ingester_tsdb_mmap_chunk_corruptions_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.ingester), + ], [ + 'WAL', + 'mmap-ed chunks', + ]) + + $.stack + { + yaxes: $.yaxes('ops'), + aliasColors: { + WAL: '#E24D42', + 'mmap-ed chunks': '#E28A42', + }, + }, + ) + ), +} diff --git a/operations/mimir-mixin/docs/playbooks.md b/operations/mimir-mixin/docs/playbooks.md new file mode 100644 index 00000000000..8534b0c14bd --- /dev/null +++ b/operations/mimir-mixin/docs/playbooks.md @@ -0,0 +1,1022 @@ +# Playbooks + +This document contains playbooks, or at least a checklist of what to look for, for alerts in the cortex-mixin and logs from Cortex. This document assumes that you are running a Cortex cluster: + +1. Using this mixin config +2. Using GCS as object store (but similar procedures apply to other backends) + +## Alerts + +### CortexIngesterRestarts + +First, check if the alert is for a single ingester or multiple. Even if the alert is only for one ingester, it's best to follow up by checking `kubectl get pods --namespace=` every few minutes, or looking at the query `rate(kube_pod_container_status_restarts_total{container="ingester"}[30m]) > 0` just until you're sure there isn't a larger issue causing multiple restarts. + +Next, check `kubectl get events`, with and without the addition of the `--namespace` flag, to look for node restarts or other related issues. Grep or something similar to filter the output can be useful here. The most common cause of this alert is a single cloud providers node restarting and causing the ingester on that node to be rescheduled somewhere else. + +In events you're looking for things like: + +``` +57m Normal NodeControllerEviction Pod Marking for deletion Pod ingester-01 from Node cloud-provider-node-01 +37m Normal SuccessfulDelete ReplicaSet (combined from similar events): Deleted pod: ingester-01 +32m Normal NodeNotReady Node Node cloud-provider-node-01 status is now: NodeNotReady +28m Normal DeletingAllPods Node Node cloud-provider-node-01 event: Deleting all Pods from Node cloud-provider-node-01. +``` + +If nothing obvious from the above, check for increased load: + +- If there is an increase in the number of active series and the memory provisioned is not enough, scale up the ingesters horizontally to have the same number of series as before per ingester. +- If we had an outage and once Cortex is back up, the incoming traffic increases. (or) The clients have their Prometheus remote-write lagging and starts to send samples at a higher rate (again, an increase in traffic but in terms of number of samples). Scale up the ingester horizontally in this case too. + +### CortexIngesterReachingSeriesLimit + +This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit. Once the limit is reached, writes to the ingester will fail (5xx) for new series, while appending samples to existing ones will continue to succeed. + +In case of **emergency**: + +- If the actual number of series is very close to or already hit the limit, then you can increase the limit via runtime config to gain some time +- Increasing the limit will increase the ingesters' memory utilization. Please monitor the ingesters' memory utilization via the `Cortex / Writes Resources` dashboard + +How the limit is **configured**: + +- The limit can be configured either on CLI (`-ingester.instance-limits.max-series`) or in the runtime config: + ``` + ingester_limits: + max_series: + ``` +- The mixin configures the limit in the runtime config and can be fine-tuned via: + ``` + _config+:: { + ingester_instance_limits+:: { + max_series: + } + } + ``` +- When configured in the runtime config, changes are applied live without requiring an ingester restart +- The configured limit can be queried via `cortex_ingester_instance_limits{limit="max_series"}` + +How to **fix**: + +1. **Temporarily increase the limit**
+ If the actual number of series is very close to or already hit the limit, or if you foresee the ingester will hit the limit before dropping the stale series as an effect of the scale up, you should also temporarily increase the limit. +2. **Check if shuffle-sharding shard size is correct**
+ +- When shuffle-sharding is enabled, we target up to 100K series / tenant / ingester assuming tenants on average use 50% of their max series limit. +- Run the following **instant query** to find tenants that may cause higher pressure on some ingesters: + + ``` + ( + sum by(user) (cortex_ingester_memory_series_created_total{namespace=""} + - + cortex_ingester_memory_series_removed_total{namespace=""}) + ) + > + ( + max by(user) (cortex_overrides{namespace="",limit_name="max_global_series_per_user"}) + * + scalar(max(cortex_distributor_replication_factor{namespace=""})) + * + 0.5 + ) + > 200000 + + # Decomment the following to show only tenants beloging to a specific ingester's shard. + # and count by(user) (cortex_ingester_active_series{namespace="",pod="ingester-"}) + ``` + +- Check the current shard size of each tenant in the output and, if they're not already sharded across all ingesters, you may consider to double their shard size +- The in-memory series in the ingesters will be effectively reduced at the TSDB head compaction happening at least 1h after you increased the shard size for the affected tenants + +3. **Scale up ingesters**
+ Scaling up ingesters will lower the number of series per ingester. However, the effect of this change will take up to 4h, because after the scale up we need to wait until all stale series are dropped from memory as the effect of TSDB head compaction, which could take up to 4h (with the default config, TSDB keeps in-memory series up to 3h old and it gets compacted every 2h). + +### CortexIngesterReachingTenantsLimit + +This alert fires when the `max_tenants` per ingester instance limit is enabled and the actual number of tenants in an ingester is reaching the limit. Once the limit is reached, writes to the ingester will fail (5xx) for new tenants, while they will continue to succeed for previously existing ones. + +In case of **emergency**: + +- If the actual number of tenants is very close to or already hit the limit, then you can increase the limit via runtime config to gain some time +- Increasing the limit will increase the ingesters' memory utilization. Please monitor the ingesters' memory utilization via the `Cortex / Writes Resources` dashboard + +How the limit is **configured**: + +- The limit can be configured either on CLI (`-ingester.instance-limits.max-tenants`) or in the runtime config: + ``` + ingester_limits: + max_tenants: + ``` +- The mixin configures the limit in the runtime config and can be fine-tuned via: + ``` + _config+:: { + ingester_instance_limits+:: { + max_tenants: + } + } + ``` +- When configured in the runtime config, changes are applied live without requiring an ingester restart +- The configured limit can be queried via `cortex_ingester_instance_limits{limit="max_tenants"}` + +How to **fix**: + +1. Ensure shuffle-sharding is enabled in the Cortex cluster +1. Assuming shuffle-sharding is enabled, scaling up ingesters will lower the number of tenants per ingester. However, the effect of this change will be visible only after `-blocks-storage.tsdb.close-idle-tsdb-timeout` period so you may have to temporarily increase the limit + +### CortexRequestLatency + +This alert fires when a specific Cortex route is experiencing an high latency. + +The alert message includes both the Cortex service and route experiencing the high latency. Establish if the alert is about the read or write path based on that (see [Cortex routes by path](#cortex-routes-by-path)). + +#### Write Latency + +How to **investigate**: + +- Check the `Cortex / Writes` dashboard + - Looking at the dashboard you should see in which Cortex service the high latency originates + - The panels in the dashboard are vertically sorted by the network path (eg. cortex-gw -> distributor -> ingester) +- Deduce where in the stack the latency is being introduced + - **`cortex-gw`** + - The cortex-gw may need to be scaled up. Use the `Cortex / Scaling` dashboard to check for CPU usage vs requests. + - There could be a problem with authentication (eg. slow to run auth layer) + - **`distributor`** + - Typically, distributor p99 latency is in the range 50-100ms. If the distributor latency is higher than this, you may need to scale up the distributors. + - **`ingester`** + - Typically, ingester p99 latency is in the range 5-50ms. If the ingester latency is higher than this, you should investigate the root cause before scaling up ingesters. + - Check out the following alerts and fix them if firing: + - `CortexProvisioningTooManyActiveSeries` + - `CortexProvisioningTooManyWrites` + +#### Read Latency + +Query performance is a known issue. A query may be slow because of high cardinality, large time range and/or because not leveraging on cache (eg. querying series data not cached yet). When investigating this alert, you should check if it's caused by few slow queries or there's an operational / config issue to be fixed. + +How to **investigate**: + +- Check the `Cortex / Reads` dashboard + - Looking at the dashboard you should see in which Cortex service the high latency originates + - The panels in the dashboard are vertically sorted by the network path (eg. cortex-gw -> query-frontend -> query->scheduler -> querier -> store-gateway) +- Check the `Cortex / Slow Queries` dashboard to find out if it's caused by few slow queries +- Deduce where in the stack the latency is being introduced + - **`cortex-gw`** + - The cortex-gw may need to be scaled up. Use the `Cortex / Scaling` dashboard to check for CPU usage vs requests. + - There could be a problem with authentication (eg. slow to run auth layer) + - **`query-frontend`** + - The query-frontend may beed to be scaled up. If the Cortex cluster is running with the query-scheduler, the query-frontend can be scaled up with no side effects, otherwise the maximum number of query-frontend replicas should be the configured `-querier.worker-parallelism`. + - **`querier`** + - Look at slow queries traces to find out where it's slow. + - Typically, slowness either comes from running PromQL engine (`innerEval`) or fetching chunks from ingesters and/or store-gateways. + - If slowness comes from running PromQL engine, typically there's not much we can do. Scaling up queriers may help only if querier nodes are overloaded. + - If slowness comes from fetching chunks from ingesters and/or store-gateways you should investigate deeper on the root cause. Common causes: + - High CPU utilization in ingesters + - Scale up ingesters + - Low cache hit ratio in the store-gateways + - Check `Memcached Overview` dashboard + - If memcached eviction rate is high, then you should scale up memcached replicas. Check the recommendations by `Cortex / Scaling` dashboard and make reasonable adjustments as necessary. + - If memcached eviction rate is zero or very low, then it may be caused by "first time" queries + +### CortexRequestErrors + +This alert fires when the rate of 5xx errors of a specific route is > 1% for some time. + +This alert typically acts as a last resort to detect issues / outages. SLO alerts are expected to trigger earlier: if an **SLO alert** has triggered as well for the same read/write path, then you can ignore this alert and focus on the SLO one (but the investigation procedure is typically the same). + +How to **investigate**: + +- Check for which route the alert fired (see [Cortex routes by path](#cortex-routes-by-path)) + - Write path: open the `Cortex / Writes` dashboard + - Read path: open the `Cortex / Reads` dashboard +- Looking at the dashboard you should see in which Cortex service the error originates + - The panels in the dashboard are vertically sorted by the network path (eg. on the write path: cortex-gw -> distributor -> ingester) +- If the failing service is going OOM (`OOMKilled`): scale up or increase the memory +- If the failing service is crashing / panicking: look for the stack trace in the logs and investigate from there + +### CortexTransferFailed + +This alert goes off when an ingester fails to find another node to transfer its data to when it was shutting down. If there is both a pod stuck terminating and one stuck joining, look at the kubernetes events. This may be due to scheduling problems caused by some combination of anti affinity rules/resource utilization. Adding a new node can help in these circumstances. You can see recent events associated with a resource via kubectl describe, ex: `kubectl -n describe pod ` + +### CortexIngesterUnhealthy + +This alert goes off when an ingester is marked as unhealthy. Check the ring web page to see which is marked as unhealthy. You could then check the logs to see if there are any related to that ingester ex: `kubectl logs -f ingester-01 --namespace=prod`. A simple way to resolve this may be to click the "Forgot" button on the ring page, especially if the pod doesn't exist anymore. It might not exist anymore because it was on a node that got shut down, so you could check to see if there are any logs related to the node that pod is/was on, ex: `kubectl get events --namespace=prod | grep cloud-provider-node`. + +### CortexMemoryMapAreasTooHigh + +This alert fires when a Cortex process has a number of memory map areas close to the limit. The limit is a per-process limit imposed by the kernel and this issue is typically caused by a large number of mmap-ed failes. + +How to **fix**: + +- Increase the limit on your system: `sysctl -w vm.max_map_count=` +- If it's caused by a store-gateway, consider enabling `-blocks-storage.bucket-store.index-header-lazy-loading-enabled=true` to lazy mmap index-headers at query time + +More information: + +- [Kernel doc](https://www.kernel.org/doc/Documentation/sysctl/vm.txt) +- [Side effects when increasing `vm.max_map_count`](https://www.suse.com/support/kb/doc/?id=000016692) + +### CortexRulerFailedRingCheck + +This alert occurs when a ruler is unable to validate whether or not it should claim ownership over the evaluation of a rule group. The most likely cause is that one of the rule ring entries is unhealthy. If this is the case proceed to the ring admin http page and forget the unhealth ruler. The other possible cause would be an error returned the ring client. If this is the case look into debugging the ring based on the in-use backend implementation. + +### CortexRulerTooManyFailedPushes + +This alert fires when rulers cannot push new samples (result of rule evaluation) to ingesters. + +In general, pushing samples can fail due to problems with Cortex operations (eg. too many ingesters have crashed, and ruler cannot write samples to them), or due to problems with resulting data (eg. user hitting limit for number of series, out of order samples, etc.). +This alert fires only for first kind of problems, and not for problems caused by limits or invalid rules. + +How to **fix**: + +- Investigate the ruler logs to find out the reason why ruler cannot write samples. Note that ruler logs all push errors, including "user errors", but those are not causing the alert to fire. Focus on problems with ingesters. + +### CortexRulerTooManyFailedQueries + +This alert fires when rulers fail to evaluate rule queries. + +Each rule evaluation may fail due to many reasons, eg. due to invalid PromQL expression, or query hits limits on number of chunks. These are "user errors", and this alert ignores them. + +There is a category of errors that is more important: errors due to failure to read data from store-gateways or ingesters. These errors would result in 500 when run from querier. This alert fires if there is too many of such failures. + +How to **fix**: + +- Investigate the ruler logs to find out the reason why ruler cannot evaluate queries. Note that ruler logs rule evaluation errors even for "user errors", but those are not causing the alert to fire. Focus on problems with ingesters or store-gateways. + +### CortexRulerMissedEvaluations + +_TODO: this playbook has not been written yet._ + +### CortexIngesterHasNotShippedBlocks + +This alert fires when a Cortex ingester is not uploading any block to the long-term storage. An ingester is expected to upload a block to the storage every block range period (defaults to 2h) and if a longer time elapse since the last successful upload it means something is not working correctly. + +How to **investigate**: + +- Ensure the ingester is receiving write-path traffic (samples to ingest) +- Look for any upload error in the ingester logs (ie. networking or authentication issues) + +_If the alert `CortexIngesterTSDBHeadCompactionFailed` fired as well, then give priority to it because that could be the cause._ + +#### Ingester hit the disk capacity + +If the ingester hit the disk capacity, any attempt to append samples will fail. You should: + +1. Increase the disk size and restart the ingester. If the ingester is running in Kubernetes with a Persistent Volume, please refers to [Resizing Persistent Volumes using Kubernetes](#resizing-persistent-volumes-using-kubernetes). +2. Investigate why the disk capacity has been hit + +- Was the disk just too small? +- Was there an issue compacting TSDB head and the WAL is increasing indefinitely? + +### CortexIngesterHasNotShippedBlocksSinceStart + +Same as [`CortexIngesterHasNotShippedBlocks`](#CortexIngesterHasNotShippedBlocks). + +### CortexIngesterHasUnshippedBlocks + +This alert fires when a Cortex ingester has compacted some blocks but such blocks haven't been successfully uploaded to the storage yet. + +How to **investigate**: + +- Look for details in the ingester logs + +### CortexIngesterTSDBHeadCompactionFailed + +This alert fires when a Cortex ingester is failing to compact the TSDB head into a block. + +A TSDB instance is opened for each tenant writing at least 1 series to the ingester and its head contains the in-memory series not flushed to a block yet. Once the TSDB head is compactable, the ingester will try to compact it every 1 minute. If the TSDB head compaction repeatedly fails, it means it's failing to compact a block from the in-memory series for at least 1 tenant, and it's a critical condition that should be immediately investigated. + +The cause triggering this alert could **lead to**: + +- Ingesters run out of memory +- Ingesters run out of disk space +- Queries return partial results after `-querier.query-ingesters-within` time since the beginning of the incident + +How to **investigate**: + +- Look for details in the ingester logs + +### CortexIngesterTSDBHeadTruncationFailed + +This alert fires when a Cortex ingester fails to truncate the TSDB head. + +The TSDB head is the in-memory store used to keep series and samples not compacted into a block yet. If head truncation fails for a long time, the ingester disk might get full as it won't continue to the WAL truncation stage and the subsequent ingester restart may take a long time or even go into an OOMKilled crash loop because of the huge WAL to replay. For this reason, it's important to investigate and address the issue as soon as it happen. + +How to **investigate**: + +- Look for details in the ingester logs + +### CortexIngesterTSDBCheckpointCreationFailed + +This alert fires when a Cortex ingester fails to create a TSDB checkpoint. + +How to **investigate**: + +- Look for details in the ingester logs +- If the checkpoint fails because of a `corruption in segment`, you can restart the ingester because at next startup TSDB will try to "repair" it. After restart, if the issue is repaired and the ingester is running, you should also get paged by `CortexIngesterTSDBWALCorrupted` to signal you the WAL was corrupted and manual investigation is required. + +### CortexIngesterTSDBCheckpointDeletionFailed + +This alert fires when a Cortex ingester fails to delete a TSDB checkpoint. + +Generally, this is not an urgent issue, but manual investigation is required to find the root cause of the issue and fix it. + +How to **investigate**: + +- Look for details in the ingester logs + +### CortexIngesterTSDBWALTruncationFailed + +This alert fires when a Cortex ingester fails to truncate the TSDB WAL. + +How to **investigate**: + +- Look for details in the ingester logs + +### CortexIngesterTSDBWALCorrupted + +This alert fires when a Cortex ingester finds a corrupted TSDB WAL (stored on disk) while replaying it at ingester startup or when creation of a checkpoint comes across a WAL corruption. + +If this alert fires during an **ingester startup**, the WAL should have been auto-repaired, but manual investigation is required. The WAL repair mechanism cause data loss because all WAL records after the corrupted segment are discarded and so their samples lost while replaying the WAL. If this issue happen only on 1 ingester then Cortex doesn't suffer any data loss because of the replication factor, while if it happens on multiple ingesters then some data loss is possible. + +If this alert fires during a **checkpoint creation**, you should have also been paged with `CortexIngesterTSDBCheckpointCreationFailed`, and you can follow the steps under that alert. + +### CortexIngesterTSDBWALWritesFailed + +This alert fires when a Cortex ingester is failing to log records to the TSDB WAL on disk. + +How to **investigate**: + +- Look for details in the ingester logs + +### CortexQuerierHasNotScanTheBucket + +This alert fires when a Cortex querier is not successfully scanning blocks in the storage (bucket). A querier is expected to periodically iterate the bucket to find new and deleted blocks (defaults to every 5m) and if it's not successfully synching the bucket since a long time, it may end up querying only a subset of blocks, thus leading to potentially partial results. + +How to **investigate**: + +- Look for any scan error in the querier logs (ie. networking or rate limiting issues) + +### CortexQuerierHighRefetchRate + +This alert fires when there's an high number of queries for which series have been refetched from a different store-gateway because of missing blocks. This could happen for a short time whenever a store-gateway ring resharding occurs (e.g. during/after an outage or while rolling out store-gateway) but store-gateways should reconcile in a short time. This alert fires if the issue persist for an unexpected long time and thus it should be investigated. + +How to **investigate**: + +- Ensure there are no errors related to blocks scan or sync in the queriers and store-gateways +- Check store-gateway logs to see if all store-gateway have successfully completed a blocks sync + +### CortexStoreGatewayHasNotSyncTheBucket + +This alert fires when a Cortex store-gateway is not successfully scanning blocks in the storage (bucket). A store-gateway is expected to periodically iterate the bucket to find new and deleted blocks (defaults to every 5m) and if it's not successfully synching the bucket for a long time, it may end up querying only a subset of blocks, thus leading to potentially partial results. + +How to **investigate**: + +- Look for any scan error in the store-gateway logs (ie. networking or rate limiting issues) + +### CortexCompactorHasNotSuccessfullyCleanedUpBlocks + +This alert fires when a Cortex compactor is not successfully deleting blocks marked for deletion for a long time. + +How to **investigate**: + +- Ensure the compactor is not crashing during compaction (ie. `OOMKilled`) +- Look for any error in the compactor logs (ie. bucket Delete API errors) + +### CortexCompactorHasNotSuccessfullyCleanedUpBlocksSinceStart + +Same as [`CortexCompactorHasNotSuccessfullyCleanedUpBlocks`](#CortexCompactorHasNotSuccessfullyCleanedUpBlocks). + +### CortexCompactorHasNotUploadedBlocks + +This alert fires when a Cortex compactor is not uploading any compacted blocks to the storage since a long time. + +How to **investigate**: + +- If the alert `CortexCompactorHasNotSuccessfullyRunCompaction` has fired as well, then investigate that issue first +- If the alert `CortexIngesterHasNotShippedBlocks` or `CortexIngesterHasNotShippedBlocksSinceStart` have fired as well, then investigate that issue first +- Ensure ingesters are successfully shipping blocks to the storage +- Look for any error in the compactor logs + +### CortexCompactorHasNotSuccessfullyRunCompaction + +This alert fires if the compactor is not able to successfully compact all discovered compactable blocks (across all tenants). + +When this alert fires, the compactor may still have successfully compacted some blocks but, for some reason, other blocks compaction is consistently failing. A common case is when the compactor is trying to compact a corrupted block for a single tenant: in this case the compaction of blocks for other tenants is still working, but compaction for the affected tenant is blocked by the corrupted block. + +How to **investigate**: + +- Look for any error in the compactor logs + - Corruption: [`not healthy index found`](#compactor-is-failing-because-of-not-healthy-index-found) + +#### Compactor is failing because of `not healthy index found` + +The compactor may fail to compact blocks due a corrupted block index found in one of the source blocks: + +``` +level=error ts=2020-07-12T17:35:05.516823471Z caller=compactor.go:339 component=compactor msg="failed to compact user blocks" user=REDACTED-TENANT err="compaction: group 0@6672437747845546250: block with not healthy index found /data/compact/0@6672437747845546250/REDACTED-BLOCK; Compaction level 1; Labels: map[__org_id__:REDACTED]: 1/1183085 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)" +``` + +When this happen you should: + +1. Rename the block prefixing it with `corrupted-` so that it will be skipped by the compactor and queriers. Keep in mind that doing so the block will become invisible to the queriers too, so its series/samples will not be queried. If the corruption affects only 1 block whose compaction `level` is 1 (the information is stored inside its `meta.json`) then Cortex guarantees no data loss because all the data is replicated across other blocks. In all other cases, there may be some data loss once you rename the block and stop querying it. +2. Ensure the compactor has recovered +3. Investigate offline the root cause (eg. download the corrupted block and debug it locally) + +To rename a block stored on GCS you can use the `gsutil` CLI command: + +``` +gsutil mv gs://BUCKET/TENANT/BLOCK gs://BUCKET/TENANT/corrupted-BLOCK +``` + +Where: + +- `BUCKET` is the gcs bucket name the compactor is using. The cell's bucket name is specified as the `blocks_storage_bucket_name` in the cell configuration +- `TENANT` is the tenant id reported in the example error message above as `REDACTED-TENANT` +- `BLOCK` is the last part of the file path reported as `REDACTED-BLOCK` in the example error message above + +### CortexBucketIndexNotUpdated + +This alert fires when the bucket index, for a given tenant, is not updated since a long time. The bucket index is expected to be periodically updated by the compactor and is used by queriers and store-gateways to get an almost-updated view over the bucket store. + +How to **investigate**: + +- Ensure the compactor is successfully running +- Look for any error in the compactor logs + +### CortexTenantHasPartialBlocks + +This alert fires when Cortex finds partial blocks for a given tenant. A partial block is a block missing the `meta.json` and this may usually happen in two circumstances: + +1. A block upload has been interrupted and not cleaned up or retried +2. A block deletion has been interrupted and `deletion-mark.json` has been deleted before `meta.json` + +How to **investigate**: + +- Look for the block ID in the logs. Example Loki query: + ``` + {cluster="",namespace="",container="compactor"} |= "skipped partial block" + ``` +- Find out which Cortex component operated on the block at last (eg. uploaded by ingester/compactor, or deleted by compactor) +- Investigate if was a partial upload or partial delete +- Safely manually delete the block from the bucket if was a partial delete or an upload failed by a compactor +- Further investigate if was an upload failed by an ingester but not later retried (ingesters are expected to retry uploads until succeed) + +### CortexWALCorruption + +This alert is only related to the chunks storage. This can happen because of 2 reasons: (1) Non graceful shutdown of ingesters. (2) Faulty storage or NFS. + +WAL corruptions are only detected at startups, so at this point the WAL/Checkpoint would have been repaired automatically. So we can only check what happened and if there was any data loss and take actions to avoid this happening in future. + +1. Check if there was any node restarts that force killed pods. If there is, then the corruption is from the non graceful shutdown of ingesters, which is generally fine. You can: + +- Describe the pod to see the last state. +- Use `kube_pod_info` to check the node for the pod. `node_boot_time_seconds` to see if node just booted (which also indicates restart). +- You can use `eventrouter` logs to double check. +- Check ingester logs to check if the shutdown logs are missing at that time. + +2. To confirm this, in the logs, check the WAL segment on which the corruption happened (let's say `X`) and the last checkpoint attempt number (let's say `Y`, this is the last WAL segment that was present when checkpointing started). +3. If `X > Y`, then it's most likely an abrupt restart of ingester and the corruption would be on the last few records of the last segment. To verify this, check the file timestamps of WAL segment `X` and `X - 1` if they were recent. +4. If `X < Y`, then the corruption was in some WAL segment which was not the last one. This indicates faulty disk and some data loss on that ingester. +5. In case of faulty disk corruption, if the number or ingesters that had corruption within the chunk flush age: +6. Less than the quorum number for your replication factor: No data loss, because there is a guarantee that the data is replicated. For example, if replication factor is 3, then it's fine if corruption was on 1 ingester. +7. Equal or more than the quorum number but less than replication factor: There is a good chance that there is no data loss if it was replicated to desired number of ingesters. But it's good to check once for data loss. +8. Equal or more than the replication factor: Then there is definitely some data loss. + +### CortexTableSyncFailure + +_This alert applies to Cortex chunks storage only._ + +### CortexQueriesIncorrect + +_TODO: this playbook has not been written yet._ + +### CortexInconsistentRuntimeConfig + +This alert fires if multiple replicas of the same Cortex service are using a different runtime config for a longer period of time. + +The Cortex runtime config is a config file which gets live reloaded by Cortex at runtime. In order for Cortex to work properly, the loaded config is expected to be the exact same across multiple replicas of the same Cortex service (eg. distributors, ingesters, ...). When the config changes, there may be short periods of time during which some replicas have loaded the new config and others are still running on the previous one, but it shouldn't last for more than few minutes. + +How to **investigate**: + +- Check how many different config file versions (hashes) are reported + ``` + count by (sha256) (cortex_runtime_config_hash{namespace=""}) + ``` +- Check which replicas are running a different version + ``` + cortex_runtime_config_hash{namespace="",sha256=""} + ``` +- Check if the runtime config has been updated on the affected replicas' filesystem. Check `-runtime-config.file` command line argument to find the location of the file. +- Check the affected replicas logs and look for any error loading the runtime config + +### CortexBadRuntimeConfig + +This alert fires if Cortex is unable to reload the runtime config. + +This typically means an invalid runtime config was deployed. Cortex keeps running with the previous (valid) version of the runtime config; running Cortex replicas and the system availability shouldn't be affected, but new replicas won't be able to startup until the runtime config is fixed. + +How to **investigate**: + +- Check the latest runtime config update (it's likely to be broken) +- Check Cortex logs to get more details about what's wrong with the config + +### CortexFrontendQueriesStuck + +This alert fires if Cortex is running without query-scheduler and queries are piling up in the query-frontend queue. + +The procedure to investigate it is the same as the one for [`CortexSchedulerQueriesStuck`](#CortexSchedulerQueriesStuck): please see the other playbook for more details. + +### CortexSchedulerQueriesStuck + +This alert fires if queries are piling up in the query-scheduler. + +How it **works**: + +- A query-frontend API endpoint is called to execute a query +- The query-frontend enqueues the request to the query-scheduler +- The query-scheduler is responsible for dispatching enqueued queries to idle querier workers +- The querier runs the query, sends the response back directly to the query-frontend and notifies the query-scheduler that it can process another query + +How to **investigate**: + +- Are queriers in a crash loop (eg. OOMKilled)? + - `OOMKilled`: temporarily increase queriers memory request/limit + - `panic`: look for the stack trace in the logs and investigate from there +- Is QPS increased? + - Scale up queriers to satisfy the increased workload +- Is query latency increased? + - An increased latency reduces the number of queries we can run / sec: once all workers are busy, new queries will pile up in the queue + - Temporarily scale up queriers to try to stop the bleed + - Check if a specific tenant is running heavy queries + - Run `sum by (user) (cortex_query_scheduler_queue_length{namespace=""}) > 0` to find tenants with enqueued queries + - Check the `Cortex / Slow Queries` dashboard to find slow queries + - On multi-tenant Cortex cluster with **shuffle-sharing for queriers disabled**, you may consider to enable it for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant). + - On multi-tenant Cortex cluster with **shuffle-sharding for queriers enabled**, you may consider to temporarily increase the shard size for affected tenants: be aware that this could affect other tenants too, reducing resources available to run other tenant queries. Alternatively, you may choose to do nothing and let Cortex return errors for that given user once the per-tenant queue is full. + +### CortexMemcachedRequestErrors + +This alert fires if Cortex memcached client is experiencing an high error rate for a specific cache and operation. + +How to **investigate**: + +- The alert reports which cache is experiencing issue + - `metadata-cache`: object store metadata cache + - `index-cache`: TSDB index cache + - `chunks-cache`: TSDB chunks cache +- Check which specific error is occurring + - Run the following query to find out the reason (replace `` with the actual Cortex cluster namespace) + ``` + sum by(name, operation, reason) (rate(thanos_memcached_operation_failures_total{namespace=""}[1m])) > 0 + ``` +- Based on the **`reason`**: + - `timeout` + - Scale up the memcached replicas + - `server-error` + - Check both Cortex and memcached logs to find more details + - `network-error` + - Check Cortex logs to find more details + - `malformed-key` + - The key is too long or contains invalid characters + - Check Cortex logs to find the offending key + - Fixing this will require changes to the application code + - `other` + - Check both Cortex and memcached logs to find more details + +### CortexOldChunkInMemory + +_This alert applies to Cortex chunks storage only._ + +### CortexCheckpointCreationFailed + +_This alert applies to Cortex chunks storage only._ + +### CortexCheckpointDeletionFailed + +_This alert applies to Cortex chunks storage only._ + +### CortexProvisioningMemcachedTooSmall + +_This alert applies to Cortex chunks storage only._ + +### CortexProvisioningTooManyActiveSeries + +This alert fires if the average number of in-memory series per ingester is above our target (1.5M). + +How to **fix**: + +- Scale up ingesters + - To find out the Cortex clusters where ingesters should be scaled up and how many minimum replicas are expected: + ``` + ceil(sum by(cluster, namespace) (cortex_ingester_memory_series) / 1.5e6) > + count by(cluster, namespace) (cortex_ingester_memory_series) + ``` +- After the scale up, the in-memory series are expected to be reduced at the next TSDB head compaction (occurring every 2h) + +### CortexProvisioningTooManyWrites + +This alert fires if the average number of samples ingested / sec in ingesters is above our target. + +How to **fix**: + +- Scale up ingesters + - To compute the desired number of ingesters to satisfy the average samples rate you can run the following query, replacing `` with the namespace to analyse and `` with the target number of samples/sec per ingester (check out the alert threshold to see the current target): + ``` + sum(rate(cortex_ingester_ingested_samples_total{namespace=""}[$__rate_interval])) / ( * 0.9) + ``` + +### CortexAllocatingTooMuchMemory + +This alert fires when an ingester memory utilization is getting closer to the limit. + +How it **works**: + +- Cortex ingesters are a stateful service +- Having 2+ ingesters `OOMKilled` may cause a cluster outage +- Ingester memory baseline usage is primarily influenced by memory allocated by the process (mostly go heap) and mmap-ed files (used by TSDB) +- Ingester memory short spikes are primarily influenced by queries and TSDB head compaction into new blocks (occurring every 2h) +- A pod gets `OOMKilled` once its working set memory reaches the configured limit, so it's important to prevent ingesters' memory utilization (working set memory) from getting close to the limit (we need to keep at least 30% room for spikes due to queries) + +How to **fix**: + +- Check if the issue occurs only for few ingesters. If so: + - Restart affected ingesters 1 by 1 (proceed with the next one once the previous pod has restarted and it's Ready) + ``` + kubectl -n delete pod ingester-XXX + ``` + - Restarting an ingester typically reduces the memory allocated by mmap-ed files. After the restart, ingester may allocate this memory again over time, but it may give more time while working on a longer term solution +- Check the `Cortex / Writes Resources` dashboard to see if the number of series per ingester is above the target (1.5M). If so: + - Scale up ingesters + - Memory is expected to be reclaimed at the next TSDB head compaction (occurring every 2h) + +### CortexGossipMembersMismatch + +This alert fires when any instance does not register all other instances as members of the memberlist cluster. + +How it **works**: + +- This alert applies when memberlist is used for the ring backing store. +- All Cortex instances using the ring, regardless of type, join a single memberlist cluster. +- Each instance (=memberlist cluster member) should be able to see all others. +- Therefore the following should be equal for every instance: + - The reported number of cluster members (`memberlist_client_cluster_members_count`) + - The total number of currently responsive instances. + +How to **investigate**: + +- The instance which has the incomplete view of the cluster (too few members) is specified in the alert. +- If the count is zero: + - It is possible that the joining the cluster has yet to succeed. + - The following log message indicates that the _initial_ initial join did not succeed: `failed to join memberlist cluster` + - The following log message indicates that subsequent re-join attempts are failing: `re-joining memberlist cluster failed` + - If it is the case that the initial join failed, take action according to the reason given. +- Verify communication with other members by checking memberlist traffic is being sent and received by the instance using the following metrics: + - `memberlist_tcp_transport_packets_received_total` + - `memberlist_tcp_transport_packets_sent_total` +- If traffic is present, then verify there are no errors sending or receiving packets using the following metrics: + - `memberlist_tcp_transport_packets_sent_errors_total` + - `memberlist_tcp_transport_packets_received_errors_total` + - These errors (and others) can be found by searching for messages prefixed with `TCPTransport:`. +- Logs coming directly from memberlist are also logged by Cortex; they may indicate where to investigate further. These can be identified as such due to being tagged with `caller=memberlist_logger.go:xyz`. + +### EtcdAllocatingTooMuchMemory + +This can be triggered if there are too many HA dedupe keys in etcd. We saw this when one of our clusters hit 20K tenants that were using HA dedupe config. Raise the etcd limits via: + +``` + etcd+: { + spec+: { + pod+: { + resources+: { + limits: { + memory: '2Gi', + }, + }, + }, + }, + }, +``` + +### CortexAlertmanagerSyncConfigsFailing + +How it **works**: + +This alert is fired when the multi-tenant alertmanager cannot load alertmanager configs from the remote object store for at least 30 minutes. + +Loading the alertmanager configs can happen in the following situations: + +1. When the multi tenant alertmanager is started +2. Each time it polls for config changes in the alertmanager +3. When there is a ring change + +The metric for this alert is cortex_alertmanager_sync_configs_failed_total and is incremented each time one of the above fails. + +When there is a ring change or the interval has elapsed, a failure to load configs from the store is logged as a warning. + +How to **investigate**: + +Look at the error message that is logged and attempt to understand what is causing the failure. I.e. it could be a networking issue, incorrect configuration for the store, etc. + +### CortexAlertmanagerRingCheckFailing + +How it **works**: + +This alert is fired when the multi-tenant alertmanager has been unable to check if one or more tenants should be owned on this shard for at least 10 minutes. + +When the alertmanager loads its configuration on start up, when it polls for config changes or when there is a ring change it must check the ring to see if the tenant is still owned on this shard. To prevent one error from causing the loading of all configurations to fail we assume that on error the tenant is NOT owned for this shard. If checking the ring continues to fail then some tenants might not be assigned an alertmanager and might not be able to receive notifications for their alerts. + +The metric for this alert is cortex_alertmanager_ring_check_errors_total. + +How to **investigate**: + +Look at the error message that is logged and attempt to understand what is causing the failure. In most cases the error will be encountered when attempting to read from the ring, which can fail if there is an issue with in-use backend implementation. + +### CortexAlertmanagerPartialStateMergeFailing + +How it **works**: + +This alert is fired when the multi-tenant alertmanager attempts to merge a partial state for something that it either does not know about or the partial state cannot be merged with the existing local state. State merges are gRPC messages that are gossiped between a shard and the corresponding alertmanager instance in other shards. + +The metric for this alert is cortex_alertmanager_partial_state_merges_failed_total. + +How to **investigate**: + +The error is not currently logged on the receiver side. If this alert is firing, it is likely that CortexAlertmanagerReplicationFailing is firing also, so instead follow the investigation steps for that alert, with the assumption that the issue is not RPC/communication related. + +### CortexAlertmanagerReplicationFailing + +How it **works**: + +This alert is fired when the multi-tenant alertmanager attempts to replicate a state update for a tenant (i.e. a silence or a notification) to another alertmanager instance but failed. This could be due to an RPC/communication error or the other alertmanager being unable to merge the state with its own local state. + +The metric for this alert is cortex_alertmanager_state_replication_failed_total. + +How to **investigate**: + +When state replication fails it gets logged as an error in the alertmanager that attempted the state replication. Check the error message in the log to understand the cause of the error (i.e. was it due to an RPC/communication error or was there an error in the receiving alertmanager). + +### CortexAlertmanagerPersistStateFailing + +How it **works**: + +This alert is fired when the multi-tenant alertmanager cannot persist its state to the remote object store. This operation is attempted periodically (every 15m by default). + +Each alertmanager writes its state (silences, notification log) to the remote object storage and the cortex_alertmanager_state_persist_failed_total metric is incremented each time this fails. The alert fires if this fails for an hour or more. + +How to **investigate**: + +Each failure to persist state to the remote object storage is logged. Find the reason in the Alertmanager container logs with the text “failed to persist state”. Possibles reasons: + +- The most probable cause is that remote write failed. Try to investigate why based on the message (network issue, storage issue). If the error indicates the issue might be transient, then you can wait until the next periodic attempt and see if it succeeds. +- It is also possible that encoding the state failed. This does not depend on external factors as it is just pulling state from the Alertmanager internal state. It may indicate a bug in the encoding method. + +### CortexAlertmanagerInitialSyncFailed + +How it **works**: + +When a tenant replica becomes owned it is assigned to an alertmanager instance. The alertmanager instance attempts to read the state from other alertmanager instances. If no other alertmanager instances could replicate the full state then it attempts to read the full state from the remote object store. This alert fires when both of these operations fail. + +Note that the case where there is no state for this user in remote object storage, is not treated as a failure. This is expected when a new tenant becomes active for the first time. + +How to **investigate**: + +When an alertmanager cannot read the state for a tenant from storage it gets logged as the following error: "failed to read state from storage; continuing anyway". The possible causes of this error could be: + +- The state could not be merged because it might be invalid and could not be decoded. This could indicate data corruption and therefore a bug in the reading or writing of the state, and would need further investigation. +- The state could not be read from storage. This could be due to a networking issue such as a timeout or an authentication and authorization issue with the remote object store. + +### CortexRolloutStuck + +This alert fires when a Cortex service rollout is stuck, which means the number of updated replicas doesn't match the expected one and looks there's no progress in the rollout. The alert monitors services deployed as Kubernetes `StatefulSet` and `Deployment`. + +How to **investigate**: + +- Run `kubectl -n get pods -l name=` to get a list of running pods +- Ensure there's no pod in a failing state (eg. `Error`, `OOMKilled`, `CrashLoopBackOff`) +- Ensure there's no pod `NotReady` (the number of ready containers should match the total number of containers, eg. `1/1` or `2/2`) +- Run `kubectl -n describe statefulset ` or `kubectl -n describe deployment ` and look at "Pod Status" and "Events" to get more information + +### CortexKVStoreFailure + +This alert fires if a Cortex instance is failing to run any operation on a KV store (eg. consul or etcd). + +How it **works**: + +- Consul is typically used to store the hash ring state. +- Etcd is typically used to store by the HA tracker (distributor) to deduplicate samples. +- If an instance is failing operations on the **hash ring**, either the instance can't update the heartbeat in the ring or is failing to receive ring updates. +- If an instance is failing operations on the **HA tracker** backend, either the instance can't update the authoritative replica or is failing to receive updates. + +How to **investigate**: + +- Ensure Consul/Etcd is up and running. +- Investigate the logs of the affected instance to find the specific error occurring when talking to Consul/Etcd. + +## Cortex routes by path + +**Write path**: + +- `/distributor.Distributor/Push` +- `/cortex.Ingester/Push` +- `api_v1_push` +- `api_prom_push` +- `api_v1_push_influx_write` + +**Read path**: + +- `/schedulerpb.SchedulerForFrontend/FrontendLoop` +- `/cortex.Ingester/QueryStream` +- `/cortex.Ingester/QueryExemplars` +- `/gatewaypb.StoreGateway/Series` +- `api_prom_label` +- `api_prom_api_v1_query_exemplars` + +**Ruler / rules path**: + +- `api_v1_rules` +- `api_v1_rules_namespace` +- `api_prom_rules_namespace` + +## Cortex blocks storage - What to do when things to wrong + +## Recovering from a potential data loss incident + +The ingested series data that could be lost during an incident can be stored in two places: + +1. Ingesters (before blocks are shipped to the bucket) +2. Bucket + +There could be several root causes leading to a potential data loss. In this document we're going to share generic procedures that could be used as a guideline during an incident. + +### Halt the compactor + +The Cortex cluster continues to successfully operate even if the compactor is not running, except that over a long period (12+ hours) this will lead to query performance degradation. The compactor could potentially be the cause of data loss because: + +- It marks blocks for deletion (soft deletion). _This doesn't lead to any immediate deletion, but blocks marked for deletion will be hard deleted once a delay expires._ +- It permanently deletes blocks marked for deletion after `-compactor.deletion-delay` (hard deletion) +- It could generate corrupted compacted blocks (eg. due to a bug or if a source block is corrupted and the automatic checks can't detect it) + +**If you suspect the compactor could be the cause of data loss, halt it** (delete the statefulset or scale down the replicas to 0). It can be restarted anytime later. + +When the compactor is **halted**: + +- No new blocks will be compacted +- No blocks will be deleted (soft and hard deletion) + +### Recover source blocks from ingesters + +Ingesters keep, on their persistent disk, the blocks compacted from TSDB head until the `-experimental.tsdb.retention-period` retention expires. The **default retention is 4 days**, in order to give cluster operators enough time to react in case of a data loss incident. + +The blocks retained in the ingesters can be used in case the compactor generates corrupted blocks and the source blocks, shipped from ingesters, have already been hard deleted from the bucket. + +How to manually blocks from ingesters to the bucket: + +1. Ensure [`gsutil`](https://cloud.google.com/storage/docs/gsutil) is installed in the Cortex pod. If not, [install it](#install-gsutil-in-the-cortex-pod) +2. Run `cd /data/tsdb && /path/to/gsutil -m rsync -n -r -x 'thanos.shipper.json|chunks_head|wal' . gs:///recovered/` + - `-n` enabled the **dry run** (remove it once you've verified the output matches your expectations) + - `-m` enables parallel mode + - `-r` enables recursive rsync + - `-x ` excludes specific patterns from sync (no WAL or shipper metadata file should be uploaded to the bucket) + - Don't use `-d` (dangerous) because it will delete from the bucket any block which is not in the local filesystem + +### Freeze ingesters persistent disk + +The blocks and WAL stored in the ingester persistent disk are the last fence of defence in case of an incident involving blocks not shipped to the bucket or corrupted blocks in the bucket. If the data integrity in the ingester's disk is at risk (eg. close to hit the TSDB retention period or close to reach max disk utilisation), you should freeze it taking a **disk snapshot**. + +To take a **GCP persistent disk snapshot**: + +1. Identify the Kubernetes PVC volume name (`kubectl get pvc -n `) of the volumes to snapshot +2. For each volume, [create a snapshot](https://console.cloud.google.com/compute/snapshotsAdd) from the GCP console ([documentation](https://cloud.google.com/compute/docs/disks/create-snapshots)) + +### Halt the ingesters + +Halting the ingesters should be the **very last resort** because of the side effects. To halt the ingesters, while preserving their disk and without disrupting the cluster write path, you need to: + +1. Create a second pool of ingesters + +- Uses the functions `newIngesterStatefulSet()`, `newIngesterPdb()` + +2. Wait until the second pool is up and running +3. Halt existing ingesters (scale down to 0 or delete their statefulset) + +However the **queries will return partial data**, due to all the ingested samples which have not been compacted to blocks yet. + +## Manual procedures + +### Resizing Persistent Volumes using Kubernetes + +This is the short version of an extensive documentation on [how to resize Kubernetes Persistent Volumes](https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/). + +**Pre-requisites**: + +- Running Kubernetes v1.11 or above +- The PV storage class has `allowVolumeExpansion: true` +- The PV is backed by a supported block storage volume (eg. GCP-PD, AWS-EBS, ...) + +**How to increase the volume**: + +1. Edit the PVC (persistent volume claim) `spec` for the volume to resize and **increase** `resources` > `requests` > `storage` +2. Restart the pod attached to the PVC for which the storage request has been increased + +### How to create clone volume (Google Cloud specific) + +In some scenarios, it may be useful to preserve current volume status for inspection, but keep using the volume. +[Google Persistent Disk supports "Clone"](https://cloud.google.com/compute/docs/disks/add-persistent-disk#source-disk) operation that can be used to do that. +Newly cloned disk is independant from its original, and can be used for further investigation by attaching it to a new Machine / Pod. + +When using Kubernetes, here is YAML file that creates PV (`clone-ingester-7-pv`) pointing to the new disk clone (`clone-pvc-80cc0efa-4996-11ea-ba79-42010a96008c` in this example), +PVC (`clone-ingester-7-pvc`) pointing to PV, and finally Pod (`clone-ingester-7-dataaccess`) using the PVC to access the disk. + +```yaml +apiVersion: v1 +kind: PersistentVolume +metadata: + name: clone-ingester-7-pv +spec: + accessModes: + - ReadWriteOnce + capacity: + storage: 150Gi + gcePersistentDisk: + fsType: ext4 + pdName: clone-pvc-80cc0efa-4996-11ea-ba79-42010a96008c + persistentVolumeReclaimPolicy: Retain + storageClassName: fast + volumeMode: Filesystem +--- +kind: PersistentVolumeClaim +apiVersion: v1 +metadata: + name: clone-ingester-7-pvc +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 150Gi + storageClassName: fast + volumeName: clone-ingester-7-pv + volumeMode: Filesystem +--- +apiVersion: v1 +kind: Pod +metadata: + name: clone-ingester-7-dataaccess +spec: + containers: + - name: alpine + image: alpine:latest + command: ["sleep", "infinity"] + volumeMounts: + - name: mypvc + mountPath: /data + resources: + requests: + cpu: 500m + memory: 1024Mi + volumes: + - name: mypvc + persistentVolumeClaim: + claimName: clone-ingester-7-pvc +``` + +After this preparation, one can use `kubectl exec -t -i clone-ingester-7-dataaccess /bin/sh` to inspect the disk mounted under `/data`. + +### Install `gsutil` in the Cortex pod + +1. Install python + ``` + apk add python3 py3-pip + ln -s /usr/bin/python3 /usr/bin/python + pip install google-compute-engine + ``` +2. Download `gsutil` + ``` + wget https://storage.googleapis.com/pub/gsutil.tar.gz + tar -zxvf gsutil.tar.gz + ./gsutil/gsutil --help + ``` +3. Configure credentials + + ``` + gsutil config -e + + # Private key path: /var/secrets/google/credentials.json + # Project ID: your google project ID + ``` + +### Deleting a StatefulSet with persistent volumes + +When you delete a Kubernetes StatefulSet whose pods have persistent volume claims (PVC), the PVCs are not automatically deleted. This means that if the StatefulSet is recreated, the pods for which there was already a PVC will get the volume mounted previously. + +A PVC can be manually deleted by an operator. When a PVC claim is deleted, what happens to the volume depends on its [Reclaim Policy](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming): + +- `Retain`: the volume will not be deleted until the PV resource will be manually deleted from Kubernetes +- `Delete`: the volume will be automatically deleted + +## Log lines + +### Log line containing 'sample with repeated timestamp but different value' + +This means a sample with the same timestamp as the latest one was received with a different value. The number of occurrences is recorded in the `cortex_discarded_samples_total` metric with the label `reason="new-value-for-timestamp"`. + +Possible reasons for this are: + +- Incorrect relabelling rules can cause a label to be dropped from a series so that multiple series have the same labels. If these series were collected from the same target they will have the same timestamp. +- The exporter being scraped sets the same timestamp on every scrape. Note that exporters should generally not set timestamps. diff --git a/operations/mimir-mixin/groups.libsonnet b/operations/mimir-mixin/groups.libsonnet new file mode 100644 index 00000000000..c2c35f90d21 --- /dev/null +++ b/operations/mimir-mixin/groups.libsonnet @@ -0,0 +1,62 @@ +{ + local makePrefix(groups) = std.join('_', groups), + local makeGroupBy(groups) = std.join(', ', groups), + + local group_by_cluster = makeGroupBy($._config.cluster_labels), + + _group_config+:: { + // Each group prefix is composed of `_`-separated labels + group_prefix_jobs: makePrefix($._config.job_labels), + group_prefix_clusters: makePrefix($._config.cluster_labels), + + // Each group-by label list is `, `-separated and unique identifies + group_by_job: makeGroupBy($._config.job_labels), + group_by_cluster: group_by_cluster, + }, + + // The following works around the deprecation of `$._config.alert_aggregation_labels` + // - If an override of that value is detected, a warning will be printed + // - If no override was detected, it will be set to the `group_by_cluster` value, + // which will replace it altogether in the future. + local alert_aggregation_labels_override = ( + { + alert_aggregation_labels: null, + } + super._config + ).alert_aggregation_labels, + + _config+:: { + alert_aggregation_labels: + if alert_aggregation_labels_override != null + then std.trace( + ||| + Deprecated: _config.alert_aggregation_labels + This field has been explicitly overridden to "%s". + Instead, express the override in terms of _config.cluster_labels. + E.g., cluster_labels: %s will automatically convert to "%s". + ||| % [ + alert_aggregation_labels_override, + $._config.cluster_labels, + group_by_cluster, + ], + alert_aggregation_labels_override + ) + else group_by_cluster, + + // This field contains contains the Prometheus template variables that should + // be used to display values of the configured "group_by_cluster" (or the + // deprecated "alert_aggregation_labels"). + alert_aggregation_variables: + std.join( + '/', + // Generate the variable replacement for each label. + std.map( + function(l) '{{ $labels.%s }}' % l, + // Split the configured labels by comma and remove whitespaces. + std.map( + function(l) std.strReplace(l, ' ', ''), + std.split($._config.alert_aggregation_labels, ',') + ), + ), + ), + }, +} diff --git a/operations/mimir-mixin/jsonnetfile.json b/operations/mimir-mixin/jsonnetfile.json new file mode 100644 index 00000000000..3f1547aaebd --- /dev/null +++ b/operations/mimir-mixin/jsonnetfile.json @@ -0,0 +1,24 @@ +{ + "version": 1, + "dependencies": [ + { + "source": { + "git": { + "remote": "https://github.com/grafana/jsonnet-libs.git", + "subdir": "grafana-builder" + } + }, + "version": "master" + }, + { + "source": { + "git": { + "remote": "https://github.com/grafana/jsonnet-libs.git", + "subdir": "mixin-utils" + } + }, + "version": "master" + } + ], + "legacyImports": true +} diff --git a/operations/mimir-mixin/jsonnetfile.lock.json b/operations/mimir-mixin/jsonnetfile.lock.json new file mode 100644 index 00000000000..a1b021910f4 --- /dev/null +++ b/operations/mimir-mixin/jsonnetfile.lock.json @@ -0,0 +1,26 @@ +{ + "version": 1, + "dependencies": [ + { + "source": { + "git": { + "remote": "https://github.com/grafana/jsonnet-libs.git", + "subdir": "grafana-builder" + } + }, + "version": "0d13e5ba1b3a4c29015738c203d92ea39f71ebe2", + "sum": "GRf2GvwEU4jhXV+JOonXSZ4wdDv8mnHBPCQ6TUVd+g8=" + }, + { + "source": { + "git": { + "remote": "https://github.com/grafana/jsonnet-libs.git", + "subdir": "mixin-utils" + } + }, + "version": "21b638f4e4922c0b6fde12120ed45d8ef803edc7", + "sum": "Je2SxBKu+1WrKEEG60zjSKaY/6TPX8uRz5bsaw0a8oA=" + } + ], + "legacyImports": false +} diff --git a/operations/mimir-mixin/mixin.libsonnet b/operations/mimir-mixin/mixin.libsonnet new file mode 100644 index 00000000000..bc04944c8da --- /dev/null +++ b/operations/mimir-mixin/mixin.libsonnet @@ -0,0 +1,5 @@ +(import 'config.libsonnet') + +(import 'groups.libsonnet') + +(import 'dashboards.libsonnet') + +(import 'alerts.libsonnet') + +(import 'recording_rules.libsonnet') diff --git a/operations/mimir-mixin/recording_rules.libsonnet b/operations/mimir-mixin/recording_rules.libsonnet new file mode 100644 index 00000000000..0383524787b --- /dev/null +++ b/operations/mimir-mixin/recording_rules.libsonnet @@ -0,0 +1,445 @@ +local utils = import 'mixin-utils/utils.libsonnet'; + +{ + local _config = { + max_series_per_ingester: 1.5e6, + max_samples_per_sec_per_ingester: 80e3, + max_samples_per_sec_per_distributor: 240e3, + limit_utilisation_target: 0.6, + cortex_overrides_metric: 'cortex_overrides', + } + $._config + $._group_config, + prometheusRules+:: { + groups+: [ + { + name: 'cortex_api_1', + rules: + utils.histogramRules('cortex_request_duration_seconds', ['cluster', 'job']), + }, + { + name: 'cortex_api_2', + rules: + utils.histogramRules('cortex_request_duration_seconds', ['cluster', 'job', 'route']), + }, + { + name: 'cortex_api_3', + rules: + utils.histogramRules('cortex_request_duration_seconds', ['cluster', 'namespace', 'job', 'route']), + }, + { + name: 'cortex_querier_api', + rules: + utils.histogramRules('cortex_querier_request_duration_seconds', ['cluster', 'job']) + + utils.histogramRules('cortex_querier_request_duration_seconds', ['cluster', 'job', 'route']) + + utils.histogramRules('cortex_querier_request_duration_seconds', ['cluster', 'namespace', 'job', 'route']), + }, + { + name: 'cortex_cache', + rules: + utils.histogramRules('cortex_memcache_request_duration_seconds', ['cluster', 'job', 'method']) + + utils.histogramRules('cortex_cache_request_duration_seconds', ['cluster', 'job']) + + utils.histogramRules('cortex_cache_request_duration_seconds', ['cluster', 'job', 'method']), + }, + { + name: 'cortex_storage', + rules: + utils.histogramRules('cortex_bigtable_request_duration_seconds', ['cluster', 'job', 'operation']) + + utils.histogramRules('cortex_cassandra_request_duration_seconds', ['cluster', 'job', 'operation']) + + utils.histogramRules('cortex_dynamo_request_duration_seconds', ['cluster', 'job', 'operation']) + + utils.histogramRules('cortex_chunk_store_index_lookups_per_query', ['cluster', 'job']) + + utils.histogramRules('cortex_chunk_store_series_pre_intersection_per_query', ['cluster', 'job']) + + utils.histogramRules('cortex_chunk_store_series_post_intersection_per_query', ['cluster', 'job']) + + utils.histogramRules('cortex_chunk_store_chunks_per_query', ['cluster', 'job']) + + utils.histogramRules('cortex_database_request_duration_seconds', ['cluster', 'job', 'method']) + + utils.histogramRules('cortex_gcs_request_duration_seconds', ['cluster', 'job', 'operation']) + + utils.histogramRules('cortex_kv_request_duration_seconds', ['cluster', 'job']), + }, + { + name: 'cortex_queries', + rules: + utils.histogramRules('cortex_query_frontend_retries', ['cluster', 'job']) + + utils.histogramRules('cortex_query_frontend_queue_duration_seconds', ['cluster', 'job']) + + utils.histogramRules('cortex_ingester_queried_series', ['cluster', 'job']) + + utils.histogramRules('cortex_ingester_queried_chunks', ['cluster', 'job']) + + utils.histogramRules('cortex_ingester_queried_samples', ['cluster', 'job']), + }, + { + name: 'cortex_received_samples', + rules: [ + { + record: '%(group_prefix_jobs)s:cortex_distributor_received_samples:rate5m' % _config, + expr: ||| + sum by (%(group_by_job)s) (rate(cortex_distributor_received_samples_total[5m])) + ||| % _config, + }, + ], + }, + { + name: 'cortex_scaling_rules', + rules: [ + { + // Convenience rule to get the number of replicas for both a deployment and a statefulset. + // Multi-zone deployments are grouped together removing the "zone-X" suffix. + record: 'cluster_namespace_deployment:actual_replicas:count', + expr: ||| + sum by (cluster, namespace, deployment) ( + label_replace( + kube_deployment_spec_replicas, + # The question mark in "(.*?)" is used to make it non-greedy, otherwise it + # always matches everything and the (optional) zone is not removed. + "deployment", "$1", "deployment", "(.*?)(?:-zone-[a-z])?" + ) + ) + or + sum by (cluster, namespace, deployment) ( + label_replace(kube_statefulset_replicas, "deployment", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?") + ) + |||, + }, + { + // Distributors should be able to deal with 240k samples/s. + record: 'cluster_namespace_deployment_reason:required_replicas:count', + labels: { + deployment: 'distributor', + reason: 'sample_rate', + }, + expr: ||| + ceil( + quantile_over_time(0.99, + sum by (cluster, namespace) ( + %(group_prefix_jobs)s:cortex_distributor_received_samples:rate5m + )[24h:] + ) + / %(max_samples_per_sec_per_distributor)s + ) + ||| % _config, + }, + { + // We should be about to cover 80% of our limits, + // and ingester can have 80k samples/s. + record: 'cluster_namespace_deployment_reason:required_replicas:count', + labels: { + deployment: 'distributor', + reason: 'sample_rate_limits', + }, + expr: ||| + ceil( + sum by (cluster, namespace) (%(cortex_overrides_metric)s{limit_name="ingestion_rate"}) + * %(limit_utilisation_target)s / %(max_samples_per_sec_per_distributor)s + ) + ||| % _config, + }, + { + // We want ingesters each ingester to deal with 80k samples/s. + // NB we measure this at the distributors and multiple by RF (3). + record: 'cluster_namespace_deployment_reason:required_replicas:count', + labels: { + deployment: 'ingester', + reason: 'sample_rate', + }, + expr: ||| + ceil( + quantile_over_time(0.99, + sum by (cluster, namespace) ( + %(group_prefix_jobs)s:cortex_distributor_received_samples:rate5m + )[24h:] + ) + * 3 / %(max_samples_per_sec_per_ingester)s + ) + ||| % _config, + }, + { + // Ingester should have 1.5M series in memory + record: 'cluster_namespace_deployment_reason:required_replicas:count', + labels: { + deployment: 'ingester', + reason: 'active_series', + }, + expr: ||| + ceil( + quantile_over_time(0.99, + sum by(cluster, namespace) ( + cortex_ingester_memory_series + )[24h:] + ) + / %(max_series_per_ingester)s + ) + ||| % _config, + }, + { + // We should be about to cover 60% of our limits, + // and ingester can have 1.5M series in memory + record: 'cluster_namespace_deployment_reason:required_replicas:count', + labels: { + deployment: 'ingester', + reason: 'active_series_limits', + }, + expr: ||| + ceil( + sum by (cluster, namespace) (%(cortex_overrides_metric)s{limit_name="max_global_series_per_user"}) + * 3 * %(limit_utilisation_target)s / %(max_series_per_ingester)s + ) + ||| % _config, + }, + { + // We should be about to cover 60% of our limits, + // and ingester can have 80k samples/s. + record: 'cluster_namespace_deployment_reason:required_replicas:count', + labels: { + deployment: 'ingester', + reason: 'sample_rate_limits', + }, + expr: ||| + ceil( + sum by (cluster, namespace) (%(cortex_overrides_metric)s{limit_name="ingestion_rate"}) + * %(limit_utilisation_target)s / %(max_samples_per_sec_per_ingester)s + ) + ||| % _config, + }, + { + // Ingesters store 96h of data on disk - we want memcached to store 1/4 of that. + record: 'cluster_namespace_deployment_reason:required_replicas:count', + labels: { + deployment: 'memcached', + reason: 'active_series', + }, + expr: ||| + ceil( + (sum by (cluster, namespace) ( + cortex_ingester_tsdb_storage_blocks_bytes{job=~".+/ingester.*"} + ) / 4) + / + avg by (cluster, namespace) ( + memcached_limit_bytes{job=~".+/memcached"} + ) + ) + |||, + }, + { + // Convenience rule to get the CPU utilization for both a deployment and a statefulset. + // Multi-zone deployments are grouped together removing the "zone-X" suffix. + record: 'cluster_namespace_deployment:container_cpu_usage_seconds_total:sum_rate', + expr: ||| + sum by (cluster, namespace, deployment) ( + label_replace( + label_replace( + node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate, + "deployment", "$1", "pod", "(.*)-(?:([0-9]+)|([a-z0-9]+)-([a-z0-9]+))" + ), + # The question mark in "(.*?)" is used to make it non-greedy, otherwise it + # always matches everything and the (optional) zone is not removed. + "deployment", "$1", "deployment", "(.*?)(?:-zone-[a-z])?" + ) + ) + |||, + }, + { + // Convenience rule to get the CPU request for both a deployment and a statefulset. + // Multi-zone deployments are grouped together removing the "zone-X" suffix. + record: 'cluster_namespace_deployment:kube_pod_container_resource_requests_cpu_cores:sum', + expr: ||| + # This recording rule is made compatible with the breaking changes introduced in kube-state-metrics v2 + # that remove resource metrics, ref: + # - https://github.com/kubernetes/kube-state-metrics/blob/master/CHANGELOG.md#v200-alpha--2020-09-16 + # - https://github.com/kubernetes/kube-state-metrics/pull/1004 + # + # This is the old expression, compatible with kube-state-metrics < v2.0.0, + # where kube_pod_container_resource_requests_cpu_cores was removed: + ( + sum by (cluster, namespace, deployment) ( + label_replace( + label_replace( + kube_pod_container_resource_requests_cpu_cores, + "deployment", "$1", "pod", "(.*)-(?:([0-9]+)|([a-z0-9]+)-([a-z0-9]+))" + ), + # The question mark in "(.*?)" is used to make it non-greedy, otherwise it + # always matches everything and the (optional) zone is not removed. + "deployment", "$1", "deployment", "(.*?)(?:-zone-[a-z])?" + ) + ) + ) + or + # This expression is compatible with kube-state-metrics >= v1.4.0, + # where kube_pod_container_resource_requests was introduced. + ( + sum by (cluster, namespace, deployment) ( + label_replace( + label_replace( + kube_pod_container_resource_requests{resource="cpu"}, + "deployment", "$1", "pod", "(.*)-(?:([0-9]+)|([a-z0-9]+)-([a-z0-9]+))" + ), + # The question mark in "(.*?)" is used to make it non-greedy, otherwise it + # always matches everything and the (optional) zone is not removed. + "deployment", "$1", "deployment", "(.*?)(?:-zone-[a-z])?" + ) + ) + ) + |||, + }, + { + // Jobs should be sized to their CPU usage. + // We do this by comparing 99th percentile usage over the last 24hrs to + // their current provisioned #replicas and resource requests. + record: 'cluster_namespace_deployment_reason:required_replicas:count', + labels: { + reason: 'cpu_usage', + }, + expr: ||| + ceil( + cluster_namespace_deployment:actual_replicas:count + * + quantile_over_time(0.99, cluster_namespace_deployment:container_cpu_usage_seconds_total:sum_rate[24h]) + / + cluster_namespace_deployment:kube_pod_container_resource_requests_cpu_cores:sum + ) + |||, + }, + { + // Convenience rule to get the Memory utilization for both a deployment and a statefulset. + // Multi-zone deployments are grouped together removing the "zone-X" suffix. + record: 'cluster_namespace_deployment:container_memory_usage_bytes:sum', + expr: ||| + sum by (cluster, namespace, deployment) ( + label_replace( + label_replace( + container_memory_usage_bytes, + "deployment", "$1", "pod", "(.*)-(?:([0-9]+)|([a-z0-9]+)-([a-z0-9]+))" + ), + # The question mark in "(.*?)" is used to make it non-greedy, otherwise it + # always matches everything and the (optional) zone is not removed. + "deployment", "$1", "deployment", "(.*?)(?:-zone-[a-z])?" + ) + ) + |||, + }, + { + // Convenience rule to get the Memory request for both a deployment and a statefulset. + // Multi-zone deployments are grouped together removing the "zone-X" suffix. + record: 'cluster_namespace_deployment:kube_pod_container_resource_requests_memory_bytes:sum', + expr: ||| + # This recording rule is made compatible with the breaking changes introduced in kube-state-metrics v2 + # that remove resource metrics, ref: + # - https://github.com/kubernetes/kube-state-metrics/blob/master/CHANGELOG.md#v200-alpha--2020-09-16 + # - https://github.com/kubernetes/kube-state-metrics/pull/1004 + # + # This is the old expression, compatible with kube-state-metrics < v2.0.0, + # where kube_pod_container_resource_requests_memory_bytes was removed: + ( + sum by (cluster, namespace, deployment) ( + label_replace( + label_replace( + kube_pod_container_resource_requests_memory_bytes, + "deployment", "$1", "pod", "(.*)-(?:([0-9]+)|([a-z0-9]+)-([a-z0-9]+))" + ), + # The question mark in "(.*?)" is used to make it non-greedy, otherwise it + # always matches everything and the (optional) zone is not removed. + "deployment", "$1", "deployment", "(.*?)(?:-zone-[a-z])?" + ) + ) + ) + or + # This expression is compatible with kube-state-metrics >= v1.4.0, + # where kube_pod_container_resource_requests was introduced. + ( + sum by (cluster, namespace, deployment) ( + label_replace( + label_replace( + kube_pod_container_resource_requests{resource="memory"}, + "deployment", "$1", "pod", "(.*)-(?:([0-9]+)|([a-z0-9]+)-([a-z0-9]+))" + ), + # The question mark in "(.*?)" is used to make it non-greedy, otherwise it + # always matches everything and the (optional) zone is not removed. + "deployment", "$1", "deployment", "(.*?)(?:-zone-[a-z])?" + ) + ) + ) + |||, + }, + { + // Jobs should be sized to their Memory usage. + // We do this by comparing 99th percentile usage over the last 24hrs to + // their current provisioned #replicas and resource requests. + record: 'cluster_namespace_deployment_reason:required_replicas:count', + labels: { + reason: 'memory_usage', + }, + expr: ||| + ceil( + cluster_namespace_deployment:actual_replicas:count + * + quantile_over_time(0.99, cluster_namespace_deployment:container_memory_usage_bytes:sum[24h]) + / + cluster_namespace_deployment:kube_pod_container_resource_requests_memory_bytes:sum + ) + |||, + }, + ], + }, + { + name: 'cortex_alertmanager_rules', + rules: [ + // Aggregations of per-user Alertmanager metrics used in dashboards. + { + record: 'cluster_job_%s:cortex_alertmanager_alerts:sum' % $._config.per_instance_label, + expr: ||| + sum by (cluster, job, %s) (cortex_alertmanager_alerts) + ||| % $._config.per_instance_label, + }, + { + record: 'cluster_job_%s:cortex_alertmanager_silences:sum' % $._config.per_instance_label, + expr: ||| + sum by (cluster, job, %s) (cortex_alertmanager_silences) + ||| % $._config.per_instance_label, + }, + { + record: 'cluster_job:cortex_alertmanager_alerts_received_total:rate5m', + expr: ||| + sum by (cluster, job) (rate(cortex_alertmanager_alerts_received_total[5m])) + |||, + }, + { + record: 'cluster_job:cortex_alertmanager_alerts_invalid_total:rate5m', + expr: ||| + sum by (cluster, job) (rate(cortex_alertmanager_alerts_invalid_total[5m])) + |||, + }, + { + record: 'cluster_job_integration:cortex_alertmanager_notifications_total:rate5m', + expr: ||| + sum by (cluster, job, integration) (rate(cortex_alertmanager_notifications_total[5m])) + |||, + }, + { + record: 'cluster_job_integration:cortex_alertmanager_notifications_failed_total:rate5m', + expr: ||| + sum by (cluster, job, integration) (rate(cortex_alertmanager_notifications_failed_total[5m])) + |||, + }, + { + record: 'cluster_job:cortex_alertmanager_state_replication_total:rate5m', + expr: ||| + sum by (cluster, job) (rate(cortex_alertmanager_state_replication_total[5m])) + |||, + }, + { + record: 'cluster_job:cortex_alertmanager_state_replication_failed_total:rate5m', + expr: ||| + sum by (cluster, job) (rate(cortex_alertmanager_state_replication_failed_total[5m])) + |||, + }, + { + record: 'cluster_job:cortex_alertmanager_partial_state_merges_total:rate5m', + expr: ||| + sum by (cluster, job) (rate(cortex_alertmanager_partial_state_merges_total[5m])) + |||, + }, + { + record: 'cluster_job:cortex_alertmanager_partial_state_merges_failed_total:rate5m', + expr: ||| + sum by (cluster, job) (rate(cortex_alertmanager_partial_state_merges_failed_total[5m])) + |||, + }, + ], + }, + ], + }, +} diff --git a/operations/mimir-mixin/scripts/lint-playbooks.sh b/operations/mimir-mixin/scripts/lint-playbooks.sh new file mode 100755 index 00000000000..7aa92122ab4 --- /dev/null +++ b/operations/mimir-mixin/scripts/lint-playbooks.sh @@ -0,0 +1,28 @@ +#!/usr/bin/env bash + +set -eu -o pipefail + +SCRIPT_DIR=$(realpath "$(dirname "${0}")") + +# List all alerts. +ALERTS=$(yq eval '.groups.[].rules.[].alert' "${SCRIPT_DIR}/../out/alerts.yaml" 2> /dev/stdout) +if [ $? -ne 0 ]; then + echo "Unable to list alerts. Got output:" + echo "$ALERTS" + exit 1 +elif [ -z "$ALERTS" ]; then + echo "No alerts found. Something went wrong with the listing." + exit 1 +fi + +# Check if each alert is referenced in the playbooks. +STATUS=0 + +for ALERT in $ALERTS; do + if ! grep -q "${ALERT}$" "${SCRIPT_DIR}/../docs/playbooks.md"; then + echo "Missing playbook for: $ALERT" + STATUS=1 + fi +done + +exit $STATUS