-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use BasicLifecycler for Compactor and autoforget unhealthy instances #3771
Conversation
Change the compactor to use BasicLifecycler instead of Lifecycler so that we can make use of autoforget functionality. This works around an issue where ownership of tenants is retained by unhealthy instances. The behavior of compactors while starting changes slightly because of the use of BasicLifecycler instead of Lifecycler. * Lifecycler didn't join the ring when starting, only after running so the `wait_active_instance_timeout` was only needed while waiting for the instance to become active. * BasicLifecycler joins the ring while starting, thus we need to apply the `wait_active_instance_timeout` when starting the lifecycler, _not_ just when waiting for the instance to become active. See #3708 Fixes #1588 Signed-off-by: Nick Pillitteri <[email protected]>
c8ec299
to
82b7b57
Compare
Signed-off-by: Nick Pillitteri <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Overall LGTM. I just left a couple of comments.
Do you mind testing it in a dev cluster and running a rollout to ensure everything works as expected? Unfortunately both unit tests and integration tests are not very good to catch issues with the ring initialization and state changes.
pkg/compactor/compactor.go
Outdated
return nil, nil, errors.Wrap(err, "failed to initialize compactors' lifecycler") | ||
} | ||
|
||
compactorsRing, err := ring.NewWithStoreClientAndStrategy(cfg.ToRingConfig(), "compactor", ringKey, kvStore, ring.NewDefaultReplicationStrategy(), prometheus.WrapRegistererWithPrefix("cortex_", reg), logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're reusing kvStore
for the ring client too. kvStore
is created with name compactor-lifecycler
which doesn't apply if we reuse it for the ring client too. In the old implementation, the name of the KV store used for the client was called compactor-ring
. I see two options:
- Use two different KV stores, to keep previous behaviour
- Rename
kvStore
fromcompactor-lifecycler
to justcompactor
, then please double check if there's any dashboard or alert to update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a look at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to go back to use two different KV stores and maintain the previous behavior rather than have compactors using a different setup than other components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also exposes a problem with the way the kvstore is created for compactors (via this PR) and distributors: The prometheus.Registerer
used for the kvstore isn't prefixed with cortex_
so it emits metrics like kv_request_duration_seconds_count
compared to ingesters which use cortex_kv_request_duration_seconds_count
Signed-off-by: Nick Pillitteri <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I left a couple of minor comments.
pkg/distributor/distributor.go
Outdated
@@ -445,6 +445,7 @@ func New(cfg Config, clientConfig ingester_client.Config, limits *validation.Ove | |||
|
|||
// newRingAndLifecycler creates a new distributor ring and lifecycler with all required lifecycler delegates | |||
func newRingAndLifecycler(cfg RingConfig, instanceCount *atomic.Uint32, logger log.Logger, reg prometheus.Registerer) (*ring.Ring, *ring.BasicLifecycler, error) { | |||
reg = prometheus.WrapRegistererWithPrefix("cortex_", reg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the metric names for the KV client used by the distributor. Two comments:
- Can you double check what's the impact on dashboards and alerts?
- This should be mentioned in the CHANGELOG entry too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you double check what's the impact on dashboards and alerts?
I checked it, and we always query cortex_kv_request_duration_seconds_*
in dashboards and alerts. This means that this is a fix, not a regression, which is good.
However, querying the metric name kv_request_duration_seconds
(without the cortex_
prefix) I've found it's also exposed by the query-scheduler.
Suggestion: I suggest to revert this change from this PR (keeping the "broken" version) and open a separate PR to fix it both here in the distributor and query-scheduler, so that the change is better tracked in the commits log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted. I'll open a separate PR to fix for distributors and query-scheduler.
Signed-off-by: Nick Pillitteri <[email protected]>
Fixes an issue with distributors and query-schedulers where they were emitting metrics without a `cortex_` prefix which is expected in all our dashboards. See #3771 (comment) Signed-off-by: Nick Pillitteri <[email protected]>
* Make sure lifecycler metrics have "cortex_" prefix Fixes an issue with distributors and query-schedulers where they were emitting metrics without a `cortex_` prefix which is expected in all our dashboards. See #3771 (comment) Signed-off-by: Nick Pillitteri <[email protected]>
Signed-off-by: Nick Pillitteri [email protected]
What this PR does
Change the compactor to use BasicLifecycler instead of Lifecycler so that we can make use of autoforget functionality. This works around an issue where ownership of tenants is retained by unhealthy instances.
The behavior of compactors while starting changes slightly because of the use of BasicLifecycler instead of Lifecycler.
wait_active_instance_timeout
was only needed while waiting for the instance to become active.wait_active_instance_timeout
when starting the lifecycler and waiting for it to become active in the ring, not just when waiting for the instance to become active.Which issue(s) this PR fixes or relates to
See #3708
Fixes #1588
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]