Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Come up with a better scheme for dealing with consul outages #157

Closed
tomwilkie opened this issue Nov 24, 2016 · 8 comments
Closed

Come up with a better scheme for dealing with consul outages #157

tomwilkie opened this issue Nov 24, 2016 · 8 comments
Labels
help wanted type/production Issues related to the production use of Cortex, inc. configuration, alerting and operating.

Comments

@tomwilkie
Copy link
Contributor

We should be able to operate if consul goes away. This probably means heartbeating the ingesters ourselves, or something like that.

@bboreham
Copy link
Contributor

I was thinking about this - in a Kubernetes install we could just store the ring as an object in the api-server, and forget Consul.

@bboreham
Copy link
Contributor

I imagine the benefit of Consul (or equivalent) is that all callers have a consistent view of the world. If callers have an inconsistent view then queries can go to ingesters that don't have all the samples.

So what does "heartbeating the ingesters ourselves" mean? "Use a last-known-consistent copy of the ring, and check heartbeats from ingesters are still consistent with that" ?

@tomwilkie
Copy link
Contributor Author

Consul is the SPOF for a cortex cluster, and most of the time we don't actively need its coordination - the ring only changes when we add and remove nodes.

Except we use consul for heartbeats, to stop sending reads/writes to dead ingesters. So this issues is about replacing that heartbeat mechanism with an alternative, p2p one. Then cortex should be able to survive long consul outages without problem.

@bboreham
Copy link
Contributor

OK, so "Use a last-known-consistent copy of the ring, make heartbeat calls directly to ingesters in that ring"?

@kfox1111
Copy link

+1 to seeing if a kubernetes crd would be sufficient. Much lower administration cost when already running in k8s.

@rverma-nikiai
Copy link

rverma-nikiai commented Jun 9, 2019

+1 for kubernetes crd, as an alternative we can use dynamodb as well.

@gouthamve gouthamve added the type/production Issues related to the production use of Cortex, inc. configuration, alerting and operating. label Sep 18, 2019
@gouthamve
Copy link
Contributor

One solution: We have healthcheck / heartbeats in the distributors to the ingesters. if consul is down, use the healthcheck info to coast until it is back up.

@friedrichg
Copy link
Member

With memberlist and dynamodb we can close this as done. No longer needed to run single consul.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted type/production Issues related to the production use of Cortex, inc. configuration, alerting and operating.
Projects
None yet
Development

No branches or pull requests

6 participants