Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add playbook entry for CortexGossipMembersMismatch. #356

Merged
merged 2 commits into from
Jul 15, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion cortex-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -572,7 +572,31 @@ How to **fix**:

### CortexGossipMembersMismatch

_TODO: this playbook has not been written yet._
This alert fires when any instance does not register all other instances as members of the memberlist cluster.

How it **works**:
- This alert applies when memberlist is used for the ring backing store.
- All Cortex instances, regardless of type, join the a single memberlist cluster.
stevesg marked this conversation as resolved.
Show resolved Hide resolved
- Each instance (=memberlist cluster member) should be able to see all others.
- Therefore the following should be equal for every instance:
- The reported number of cluster members (`memberlist_client_cluster_members_count`)
- The total number of currently responsive instances.
stevesg marked this conversation as resolved.
Show resolved Hide resolved

How to **investigate**:
- The instance which has the incomplete view of the cluster (too few members) is specified in the alert.
- If the count is zero:
- It is possible that the joining the cluster has yet to succeed.
- The following log message indicates that the _initial_ initial join did not succeed: `failed to join memberlist cluster`
- The following log messages indicate that subsequent re-join attempts are failing: `re-joining memberlist cluster failed`
stevesg marked this conversation as resolved.
Show resolved Hide resolved
- If it is the case that the initial join failed, take action according to the reason given.
- Verify communication with other members by checking memberlist traffic is being sent and received by the instance using the following metrics:
- `memberlist_tcp_transport_packets_received_total`
- `memberlist_tcp_transport_packets_sent_total`
- If traffic is present, then verify there are no errors sending or receiving packets using the following metrics:
- `memberlist_tcp_transport_packets_sent_errors_total`
- `memberlist_tcp_transport_packets_received_errors_total`
- These errors (and others) can be found by searching for messages prefixed with `TCPTransport:`.
- Logs coming directly from memberlist are also logged by Cortex; they may indicate where to investigate further. These can be identified as such due to being tagged with `caller=memberlist_logger.go:xyz`.

### EtcdAllocatingTooMuchMemory

Expand Down