Come up with a better scheme for dealing with consul outages #157

tomwilkie · 2016-11-24T20:41:33Z

We should be able to operate if consul goes away. This probably means heartbeating the ingesters ourselves, or something like that.

bboreham · 2018-02-22T10:19:53Z

I was thinking about this - in a Kubernetes install we could just store the ring as an object in the api-server, and forget Consul.

bboreham · 2018-02-22T10:25:13Z

I imagine the benefit of Consul (or equivalent) is that all callers have a consistent view of the world. If callers have an inconsistent view then queries can go to ingesters that don't have all the samples.

So what does "heartbeating the ingesters ourselves" mean? "Use a last-known-consistent copy of the ring, and check heartbeats from ingesters are still consistent with that" ?

tomwilkie · 2018-02-22T10:34:39Z

Consul is the SPOF for a cortex cluster, and most of the time we don't actively need its coordination - the ring only changes when we add and remove nodes.

Except we use consul for heartbeats, to stop sending reads/writes to dead ingesters. So this issues is about replacing that heartbeat mechanism with an alternative, p2p one. Then cortex should be able to survive long consul outages without problem.

bboreham · 2018-02-22T10:38:48Z

OK, so "Use a last-known-consistent copy of the ring, make heartbeat calls directly to ingesters in that ring"?

kfox1111 · 2018-12-19T18:28:21Z

+1 to seeing if a kubernetes crd would be sufficient. Much lower administration cost when already running in k8s.

rverma-nikiai · 2019-06-09T02:25:39Z

+1 for kubernetes crd, as an alternative we can use dynamodb as well.

gouthamve · 2020-05-07T14:35:38Z

One solution: We have healthcheck / heartbeats in the distributors to the ingesters. if consul is down, use the healthcheck info to coast until it is back up.

friedrichg · 2023-04-24T18:14:07Z

With memberlist and dynamodb we can close this as done. No longer needed to run single consul.

bboreham added the help wanted label Mar 12, 2018

bboreham mentioned this issue Mar 12, 2018

Add a healthcheck endpoint on the ingesters that distributors can use #741

Merged

BertHartm mentioned this issue Jul 23, 2018

frequent msg="error CASing, trying again" error messages #897

Closed

bboreham mentioned this issue Dec 19, 2018

etcd support #1144

Closed

bboreham mentioned this issue Jul 10, 2019

Honor -consul.consistent-reads on watches too. #1499

Merged

gouthamve added the type/production Issues related to the production use of Cortex, inc. configuration, alerting and operating. label Sep 18, 2019

gouthamve added help wanted and removed help wanted labels May 7, 2020

friedrichg closed this as completed Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Come up with a better scheme for dealing with consul outages #157

Come up with a better scheme for dealing with consul outages #157

tomwilkie commented Nov 24, 2016

bboreham commented Feb 22, 2018

bboreham commented Feb 22, 2018

tomwilkie commented Feb 22, 2018

bboreham commented Feb 22, 2018

kfox1111 commented Dec 19, 2018

rverma-nikiai commented Jun 9, 2019 •

edited

Loading

gouthamve commented May 7, 2020

friedrichg commented Apr 24, 2023

Come up with a better scheme for dealing with consul outages #157

Come up with a better scheme for dealing with consul outages #157

Comments

tomwilkie commented Nov 24, 2016

bboreham commented Feb 22, 2018

bboreham commented Feb 22, 2018

tomwilkie commented Feb 22, 2018

bboreham commented Feb 22, 2018

kfox1111 commented Dec 19, 2018

rverma-nikiai commented Jun 9, 2019 • edited Loading

gouthamve commented May 7, 2020

friedrichg commented Apr 24, 2023

rverma-nikiai commented Jun 9, 2019 •

edited

Loading