From cf9774dc055b8cd80f8d2c4e6e371eada2bc6bf1 Mon Sep 17 00:00:00 2001 From: Steve Simpson Date: Wed, 14 Jul 2021 16:07:37 +0200 Subject: [PATCH 1/2] Add playbook entry for CortexGossipMembersMismatch. --- cortex-mixin/docs/playbooks.md | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/cortex-mixin/docs/playbooks.md b/cortex-mixin/docs/playbooks.md index 769f9808..cecd400c 100644 --- a/cortex-mixin/docs/playbooks.md +++ b/cortex-mixin/docs/playbooks.md @@ -572,7 +572,31 @@ How to **fix**: ### CortexGossipMembersMismatch -_TODO: this playbook has not been written yet._ +This alert fires when any instance does not register all other instances as members of the memberlist cluster. + +How it **works**: +- This alert applies when memberlist is used for the ring backing store. +- All Cortex instances, regardless of type, join the a single memberlist cluster. +- Each instance (=memberlist cluster member) should be able to see all others. +- Therefore the following should be equal for every instance: + - The reported number of cluster members (`memberlist_client_cluster_members_count`) + - The total number of currently responsive instances. + +How to **investigate**: +- The instance which has the incomplete view of the cluster (too few members) is specified in the alert. +- If the count is zero: + - It is possible that the joining the cluster has yet to succeed. + - The following log message indicates that the _initial_ initial join did not succeed: `failed to join memberlist cluster` + - The following log messages indicate that subsequent re-join attempts are failing: `re-joining memberlist cluster failed` + - If it is the case that the initial join failed, take action according to the reason given. +- Verify communication with other members by checking memberlist traffic is being sent and received by the instance using the following metrics: + - `memberlist_tcp_transport_packets_received_total` + - `memberlist_tcp_transport_packets_sent_total` +- If traffic is present, then verify there are no errors sending or receiving packets using the following metrics: + - `memberlist_tcp_transport_packets_sent_errors_total` + - `memberlist_tcp_transport_packets_received_errors_total` + - These errors (and others) can be found by searching for messages prefixed with `TCPTransport:`. +- Logs coming directly from memberlist are also logged by Cortex; they may indicate where to investigate further. These can be identified as such due to being tagged with `caller=memberlist_logger.go:xyz`. ### EtcdAllocatingTooMuchMemory From 16e5d3bebbda0c517ab992163cc9178ba1b30c24 Mon Sep 17 00:00:00 2001 From: Steve Simpson Date: Thu, 15 Jul 2021 08:56:38 +0200 Subject: [PATCH 2/2] Review comments. --- cortex-mixin/docs/playbooks.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/cortex-mixin/docs/playbooks.md b/cortex-mixin/docs/playbooks.md index cecd400c..78373901 100644 --- a/cortex-mixin/docs/playbooks.md +++ b/cortex-mixin/docs/playbooks.md @@ -576,7 +576,7 @@ This alert fires when any instance does not register all other instances as memb How it **works**: - This alert applies when memberlist is used for the ring backing store. -- All Cortex instances, regardless of type, join the a single memberlist cluster. +- All Cortex instances using the ring, regardless of type, join a single memberlist cluster. - Each instance (=memberlist cluster member) should be able to see all others. - Therefore the following should be equal for every instance: - The reported number of cluster members (`memberlist_client_cluster_members_count`) @@ -587,7 +587,7 @@ How to **investigate**: - If the count is zero: - It is possible that the joining the cluster has yet to succeed. - The following log message indicates that the _initial_ initial join did not succeed: `failed to join memberlist cluster` - - The following log messages indicate that subsequent re-join attempts are failing: `re-joining memberlist cluster failed` + - The following log message indicates that subsequent re-join attempts are failing: `re-joining memberlist cluster failed` - If it is the case that the initial join failed, take action according to the reason given. - Verify communication with other members by checking memberlist traffic is being sent and received by the instance using the following metrics: - `memberlist_tcp_transport_packets_received_total`