frequent `msg="error CASing, trying again"` error messages #897

BertHartm · 2018-07-23T21:42:21Z

I'm seeing the ingesters complaining a lot about not being able to CAS, which I think may be the cause of some other state transition problems, but I'm not certain.

I think based on the ring description (https://github.com/weaveworks/cortex/blob/master/pkg/ring/model.go#L24) it should be possible to give each ingester it's own key instead of having one key that they all try to share, which should mostly (entirely?) eliminate the issue.

I'm curious about feedback before I attempt.

I'm also curious if given the discussion of #157 if there might be a simpler approach to attempt instead of altering the consul interface.

tomwilkie · 2018-07-23T21:49:21Z

I think based on the ring description (https://github.com/weaveworks/cortex/blob/master/pkg/ring/model.go#L24) it should be possible to give each ingester it's own key instead of having one key that they all try to share,

We use a single value to store the entire ring to ensure we can pick unique tokens when we bootstrap new ingesters. With multiple values (one per ingester) there would be no way to ensure we can pick tokens atomically AFAIK.

Are you seeing high CPU usage from consul? How many consuls are you running? Given the data in consul is ephemeral, you can actually get away with running a single one - most of the problems we've had are to do with consul clustering anyway.

BertHartm · 2018-07-23T22:08:52Z

I'm using an existing consul cluster, so ~3 dedicated instances. CPU runs fairly high, and distinguishing what's from cortex and what isn't doesn't seem probable.

I can try running a single instance dedicated to consul and see if that helps.

I see a fair number of failures that seem like they're related to heartbeats instead of any change that should alter the token distribution. Perhaps the heartbeat logic can be pulled out so that doesn't conflict with token distribution and state changes? I'm not sure where it's used, but moving to a consul healthcheck or directly to the distributor or whichever other component may need it may make sense.

bboreham · 2019-08-25T16:13:00Z

Has this gone any further? Otherwise can we close?

BertHartm · 2019-08-26T12:50:56Z

Happy to close

BertHartm closed this as completed Aug 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frequent `msg="error CASing, trying again"` error messages #897

frequent `msg="error CASing, trying again"` error messages #897

BertHartm commented Jul 23, 2018

tomwilkie commented Jul 23, 2018

BertHartm commented Jul 23, 2018

bboreham commented Aug 25, 2019

BertHartm commented Aug 26, 2019

frequent msg="error CASing, trying again" error messages #897

frequent msg="error CASing, trying again" error messages #897

Comments

BertHartm commented Jul 23, 2018

tomwilkie commented Jul 23, 2018

BertHartm commented Jul 23, 2018

bboreham commented Aug 25, 2019

BertHartm commented Aug 26, 2019

frequent `msg="error CASing, trying again"` error messages #897

frequent `msg="error CASing, trying again"` error messages #897