Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frequent msg="error CASing, trying again" error messages #897

Closed
BertHartm opened this issue Jul 23, 2018 · 4 comments
Closed

frequent msg="error CASing, trying again" error messages #897

BertHartm opened this issue Jul 23, 2018 · 4 comments

Comments

@BertHartm
Copy link
Contributor

I'm seeing the ingesters complaining a lot about not being able to CAS, which I think may be the cause of some other state transition problems, but I'm not certain.

I think based on the ring description (https://github.com/weaveworks/cortex/blob/master/pkg/ring/model.go#L24) it should be possible to give each ingester it's own key instead of having one key that they all try to share, which should mostly (entirely?) eliminate the issue.

I'm curious about feedback before I attempt.

I'm also curious if given the discussion of #157 if there might be a simpler approach to attempt instead of altering the consul interface.

@tomwilkie
Copy link
Contributor

I think based on the ring description (https://github.com/weaveworks/cortex/blob/master/pkg/ring/model.go#L24) it should be possible to give each ingester it's own key instead of having one key that they all try to share,

We use a single value to store the entire ring to ensure we can pick unique tokens when we bootstrap new ingesters. With multiple values (one per ingester) there would be no way to ensure we can pick tokens atomically AFAIK.

Are you seeing high CPU usage from consul? How many consuls are you running? Given the data in consul is ephemeral, you can actually get away with running a single one - most of the problems we've had are to do with consul clustering anyway.

@BertHartm
Copy link
Contributor Author

I'm using an existing consul cluster, so ~3 dedicated instances. CPU runs fairly high, and distinguishing what's from cortex and what isn't doesn't seem probable.

I can try running a single instance dedicated to consul and see if that helps.

I see a fair number of failures that seem like they're related to heartbeats instead of any change that should alter the token distribution. Perhaps the heartbeat logic can be pulled out so that doesn't conflict with token distribution and state changes? I'm not sure where it's used, but moving to a consul healthcheck or directly to the distributor or whichever other component may need it may make sense.

@bboreham
Copy link
Contributor

Has this gone any further? Otherwise can we close?

@BertHartm
Copy link
Contributor Author

Happy to close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants