-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Come up with a better scheme for dealing with consul outages #157
Comments
I was thinking about this - in a Kubernetes install we could just store the ring as an object in the api-server, and forget Consul. |
I imagine the benefit of Consul (or equivalent) is that all callers have a consistent view of the world. If callers have an inconsistent view then queries can go to ingesters that don't have all the samples. So what does "heartbeating the ingesters ourselves" mean? "Use a last-known-consistent copy of the ring, and check heartbeats from ingesters are still consistent with that" ? |
Consul is the SPOF for a cortex cluster, and most of the time we don't actively need its coordination - the ring only changes when we add and remove nodes. Except we use consul for heartbeats, to stop sending reads/writes to dead ingesters. So this issues is about replacing that heartbeat mechanism with an alternative, p2p one. Then cortex should be able to survive long consul outages without problem. |
OK, so "Use a last-known-consistent copy of the ring, make heartbeat calls directly to ingesters in that ring"? |
+1 to seeing if a kubernetes crd would be sufficient. Much lower administration cost when already running in k8s. |
+1 for kubernetes crd, as an alternative we can use dynamodb as well. |
One solution: We have healthcheck / heartbeats in the distributors to the ingesters. if consul is down, use the healthcheck info to coast until it is back up. |
With memberlist and dynamodb we can close this as done. No longer needed to run single consul. |
We should be able to operate if consul goes away. This probably means heartbeating the ingesters ourselves, or something like that.
The text was updated successfully, but these errors were encountered: