-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frequent msg="error CASing, trying again"
error messages
#897
Comments
We use a single value to store the entire ring to ensure we can pick unique tokens when we bootstrap new ingesters. With multiple values (one per ingester) there would be no way to ensure we can pick tokens atomically AFAIK. Are you seeing high CPU usage from consul? How many consuls are you running? Given the data in consul is ephemeral, you can actually get away with running a single one - most of the problems we've had are to do with consul clustering anyway. |
I'm using an existing consul cluster, so ~3 dedicated instances. CPU runs fairly high, and distinguishing what's from cortex and what isn't doesn't seem probable. I can try running a single instance dedicated to consul and see if that helps. I see a fair number of failures that seem like they're related to heartbeats instead of any change that should alter the token distribution. Perhaps the heartbeat logic can be pulled out so that doesn't conflict with token distribution and state changes? I'm not sure where it's used, but moving to a consul healthcheck or directly to the distributor or whichever other component may need it may make sense. |
Has this gone any further? Otherwise can we close? |
Happy to close |
I'm seeing the ingesters complaining a lot about not being able to CAS, which I think may be the cause of some other state transition problems, but I'm not certain.
I think based on the ring description (https://github.com/weaveworks/cortex/blob/master/pkg/ring/model.go#L24) it should be possible to give each ingester it's own key instead of having one key that they all try to share, which should mostly (entirely?) eliminate the issue.
I'm curious about feedback before I attempt.
I'm also curious if given the discussion of #157 if there might be a simpler approach to attempt instead of altering the consul interface.
The text was updated successfully, but these errors were encountered: