Add a healthcheck endpoint on the ingesters that distributors can use #741

csmarchbanks · 2018-03-12T14:48:34Z

This puts a Check endpoint into the ingesters that allows our ingester clients to do a healthcheck loop to make sure their connection is still ok. As part of the maintenance loop to remove stale ingesters, we will now also (in the background) healthcheck each remaining ingester, and if it is not ok delete the entry in distributor.clients. The client will then be remade the next time it is needed by getClientFor.

A couple questions:

As it stands the ingesters need to be upgraded before the ruler/distributors/querier, or else the healtchecks will constantly fail and we will be recreating a lot of grpc clients. Everything will still work, but the logs would be noisy/extra work will be done. Is that a concern for those with CI systems? If so how should it be addressed?
The Check endpoint is based off https://github.com/grpc/grpc/blob/master/doc/health-checking.md. Is that what we want? Or should we make an endpoint that is more ingester specific?

Another solution would be to just periodically close the ingester clients, perhaps put an age field on them and only use them until then.

This PR fixes: #702

Current state of this PR:

Ingesters have a new Check grpc endpoint, used to health check them
Refactored ingester clients in a distributor to be their own struct, IngesterPool
The IngesterPool has a function, CleanUnhealthy(), that will clean up unhealthy ingester clients
CleanUnhealthy() is run as part of the distributor periodic cleanup when the distributor.health-check-ingesters flag is turned on
util/compat.go was moved to ingester/client/compat.go in order to avoid cyclic dependencies

limscoder · 2018-03-12T15:43:25Z

pkg/ingester/client/cortex.proto

+}
+
+message HealthCheckResponse {
+  enum ServingStatus {


Do UNKNOWN or NOT_SERVING ever get sent?

They do not right now, that is the format suggested for health-checking by the gRPC docs. We could make a more cortex specific format if we want.

It will also be compatible with: https://godoc.org/google.golang.org/grpc/health

I could also try to spend some time importing the proto definitions from grpc/health, rather than duplicating them if we decide to stick with the general format instead of a cortex specific one.

Yeah, I'd say either use those protos or make our response an empty response.

csmarchbanks · 2018-03-12T17:12:20Z

This has been running for more than a day in staging, and is now running in our prod cluster.

From staging, started restarting gRPC clients ~17:00 on 3/10

tomwilkie · 2018-03-12T17:26:58Z

As it stands the ingesters need to be upgraded before the ruler/distributors/querier

General approach to this is to hide the behaviour behind a flag that defaults to off; new code gets rolled out with the behaviour off, then a flag changed can be rolled out to enable it.

tomwilkie · 2018-03-12T17:31:48Z

pkg/distributor/distributor.go

+	ctx, cancel := context.WithTimeout(context.Background(), d.cfg.RemoteTimeout)
+	ctx = user.InjectOrgID(ctx, "0")
+	resp, err := client.Check(ctx, &ingester_client.HealthCheckRequest{})
+	cancel()


Nit: can we move that up to the ctx, cancel := ... line and make it a defer please? Just causes minor confusion as it doesn't match the common pattern.

tomwilkie · 2018-03-12T17:35:19Z

pkg/distributor/distributor.go

@@ -184,29 +184,55 @@ func (d *Distributor) Stop() {
 }

 func (d *Distributor) removeStaleIngesterClients() {
-	d.clientsMtx.Lock()
-	defer d.clientsMtx.Unlock()


:-( I prefer the defer way, mainly because it makes the lock more robust to future modifications.

As the remote timeout is set to 2s, and the removeStateIngesterClients functions only runs every 15s, do we really need the extra goroutines and waitgroup?

I also prefer the defer way. Since we have the mutex, even 1 timeout of 2s would block any other rules getting evaluated during that time which concerned me.

I could get rid of the wait group, but keep the goroutines. That should be ok since all the healthchecks should be done after at most 2s, and cleanup period is 15s by default. What do you think of that?

Of course (d'oh).

How about making it all nice and inline, synchronous code that builds a new clients dict without holding the lock, and then replaces the old one under the lock?

I split the logic for removing stale ingester clients and healthchecking the ingester clients. I think everything is properly deferred now. Let me know what you think!

tomwilkie · 2018-03-12T17:37:52Z

From staging, started restarting gRPC clients ~17:00 on 3/10

Oh dear - I'm fine with fixes like this going in work around other issues, but do you have any idea why gRPC is getting so unwell?

bboreham · 2018-03-12T18:01:05Z

Does this help with #157 - use this endpoint as the heartbeat instead of the consul timestamp?

tomwilkie · 2018-03-12T18:47:24Z

Does this help with #157 - use this endpoint as the heartbeat instead of the consul timestamp?

Not as it stands (ingester clients will be created on demand), but it could be made to do so...

csmarchbanks · 2018-03-12T19:17:54Z

@tomwilkie Thanks for the comments!

I wish I knew why gRPC is getting so unwell, I have been banging my head against a wall for awhile about this. If you or someone else have any ideas I would be happy to test them out.

Also, do you or @bboreham think there is anything we could do in this PR to make it easier to heartbeat with respect to #157? I don't want to do all the work for that here, but would be happy to make it easier in the future.

tomwilkie · 2018-03-12T19:24:58Z

If you or someone else have any ideas

No ideas, but we had something similar with Bigtable; I just added a timeout and it unstuck it. Are the ingester timeouts working okay?

is anything we could do in this PR to make it easier to heartbeat with respect to #157?

The two systems are quite decoupled right now; lets get #279 merged and see what shakes out.

csmarchbanks · 2018-03-12T19:28:52Z

Sounds good on waiting for #279. The ingester timeouts seem to be working ok, but every request will start timing out after 2s (push) or 10s (query), and that ingester connection never manages to correct itself.

tomwilkie · 2018-03-12T20:09:04Z

The ingester timeouts seem to be working ok, but every request will start timing out after 2s (push) or 10s (query), and that ingester connection never manages to correct itself.

Is it worth destroying clients that experience a timeout then?

csmarchbanks · 2018-03-12T20:18:50Z

Are you asking: Instead of doing a healthcheck, just delete the client anytime they experience a timeout in regular use?

tomwilkie · 2018-03-12T20:39:00Z

Yeah, just as a thought...

csmarchbanks · 2018-03-12T21:13:22Z

It would work, and was something I was considering, but liked the healthcheck a bit more. I am happy to rework it based on what everyone thinks though.

A couple pros/cons off the top of my head
Pros:

No extra endpoint
Would find an error on the first time, rather than waiting for up to 15s

Cons:

Any future endpoints on the ingester would need the same handling, this could be abstracted but there would still be some entry function needed.
Putting the code in the maintenance loop keeps the maintenance code together
Long queries/pushes could cause a false positive (seems unlikely though)

tomwilkie · 2018-03-13T10:32:13Z

You had me at "Any future endpoints on the ingester..."

tomwilkie · 2018-03-13T10:34:49Z

pkg/distributor/distributor.go

@@ -28,6 +28,7 @@ import (
 	ingester_client "github.com/weaveworks/cortex/pkg/ingester/client"
 	"github.com/weaveworks/cortex/pkg/ring"
 	"github.com/weaveworks/cortex/pkg/util"
+	"google.golang.org/grpc/health/grpc_health_v1"


Can you move this import up to the block above please? The pattern is should be [std lib import, 3rd party imports, repo local import]. I thought this was standard (as described here https://github.com/golang/go/wiki/CodeReviewComments#imports) but its turns out it might just be me being picky.

tomwilkie · 2018-03-13T10:38:05Z

pkg/distributor/distributor.go

+	client, err := d.getClientFor(ingester)
+	if err != nil {
+		d.removeClientFor(ingester, err)
+	}


if err != nil, client == nil and the rest of this function will panic.

Yeah I think it err != nil you don't need to remove the client either, you should just log and return (or return an error).

tomwilkie · 2018-03-13T10:40:14Z

pkg/distributor/distributor.go

+	ctx = user.InjectOrgID(ctx, "0")
+
+	resp, err := client.Check(ctx, &grpc_health_v1.HealthCheckRequest{})
+	if err != nil || resp.Status != grpc_health_v1.HealthCheckResponse_SERVING {


Please either return the error (and log it in the calling function) or log it here. Slight preference for returning it and logging it in the caller.

tomwilkie · 2018-03-13T10:41:18Z

I think its looking much easier to follow wrt locking and concurrency. A couple of comments, then we should be good to go.

It might also be worth extracting all this to a separate file (and even package), as there is starting to be enough logic to justify it.

tomwilkie · 2018-03-13T10:43:56Z

Yeah, the more I think about it the more I like the idea of putting it in pkg/ingester/client as a "client cache".

tomwilkie · 2018-03-13T10:50:58Z

Also, would you mind enabling CircleCI on your fork so we get test runs? It should be free.

bboreham · 2018-03-13T10:55:04Z

I've done a pull/push to weaveworks/cortex so we get one test run now.

csmarchbanks · 2018-03-13T21:57:18Z

I did the refactor, and definitely think it cleans up distributor quite a bit. I also moved utils/compat to the ingester/client package since otherwise ingester/client could not import utils due to a circular dependency.

Right now the loops for cleaning up stale clients, and health checking each client remain in distributor.go in order to not complicate what should just be a cache. If you want me to move those into the client package I can do that.

csmarchbanks · 2018-03-13T22:26:05Z

Not sure what's wrong with my circle ci yet... hopefully I will get some time to look at it tonight/tomorrow morning. EDIT - got it working, needed to specify an IMAGE_PREFIX

…two methods

…Cache

csmarchbanks · 2018-03-16T15:01:27Z

@bboreham or @tomwilkie Another review would be great when you get a chance!

bboreham

Some thoughts attached to specific lines.

Can I ask that the PR description be modified to match the result of all changes to date?

bboreham · 2018-03-16T17:14:05Z

pkg/distributor/distributor.go

@@ -93,6 +93,7 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
 	flag.DurationVar(&cfg.ClientCleanupPeriod, "distributor.client-cleanup-period", 15*time.Second, "How frequently to clean up clients for ingesters that have gone away.")
 	flag.Float64Var(&cfg.IngestionRateLimit, "distributor.ingestion-rate-limit", 25000, "Per-user ingestion rate limit in samples per second.")
 	flag.IntVar(&cfg.IngestionBurstSize, "distributor.ingestion-burst-size", 50000, "Per-user allowed ingestion burst size (in number of samples).")
+	flag.BoolVar(&cfg.HealthCheckIngesters, "distributor.health-check-ingesters", false, "Run a health check on each ingester client during the cleanup period.")


can we say "during periodic cleanup" ? First off I thought "the cleanup period" was maybe a period during shutdown when we clean up.

bboreham · 2018-03-16T17:17:34Z

pkg/ingester/client/cache.go

+type Factory func(addr string, cfg Config) (IngesterClient, error)
+
+// IngesterClientCache holds a cache of ingester clients
+type IngesterClientCache struct {


Is "cache" the right word? Elsewhere I've seen this called a "connection pool", except we have max one connection per endpoint. Maybe just some more explanation of the intended uses would help.

I like IngesterPool, I will change it to that, and add some more explanations.

bboreham · 2018-03-16T17:22:38Z

pkg/distributor/distributor.go

 	ingesters := map[string]struct{}{}
 	for _, ing := range d.ring.GetAll() {
 		ingesters[ing.Addr] = struct{}{}
 	}

-	for addr, client := range d.clients {
+	for _, addr := range d.clientCache.RegisteredAddresses() {


Seems we are doing two kinds of cleanup now: removing cache entries which are "stale" because the ring no longer references an ingester at that address, and removing entries which fail healthcheck. Do we still need the first one?

We could get away with only the healthcheck loop (after everyone has upgraded their ingesters and turned the flag on). However, I like the difference in expected behavior/logging from removing stale ingesters (expected) vs. failing ingesters (unexpected).

bboreham · 2018-03-16T17:24:15Z

pkg/distributor/distributor.go

+func (d *Distributor) healthCheckAndRemoveIngesters() {
+	for _, addr := range d.clientCache.RegisteredAddresses() {
+		client, err := d.clientCache.GetClientFor(addr)
+		if err != nil {


This can only happen due to some race between RegisteredAddresses() and GetClientFor() - maybe change the cache API to remove the possibility? Or move this whole loop into the cache?

I refactored this to the pool, but it still has an if around it in case someone deletes the entry from the pool while a previous healthcheck is happening. Right now that shouldn't happen, but that would be an annoying bug to run into if things change.

csmarchbanks · 2018-03-16T21:25:20Z

PR description updated

bboreham · 2018-03-19T10:14:23Z

pkg/ingester/client/pool_test.go

+
+	pool.GetClientFor("2")
+	if pool.Count() != 2 {
+		t.Errorf("Expected Count() = 2, got %d", pool.Count())


This is a bit wordy - we use testify/assert elsewhere to reduce to assert.Equal(t, 2, pool.Count())

csmarchbanks mentioned this pull request Mar 12, 2018

Ruler performance frequently degrades #702

Closed

limscoder reviewed Mar 12, 2018

View reviewed changes

tomwilkie reviewed Mar 12, 2018

View reviewed changes

csmarchbanks force-pushed the healthcheck-ingesters branch from 70f618e to db98e77 Compare March 12, 2018 21:42

tomwilkie reviewed Mar 13, 2018

View reviewed changes

csmarchbanks force-pushed the healthcheck-ingesters branch from 4903a94 to 2a2d543 Compare March 13, 2018 22:02

csmarchbanks added 2 commits March 14, 2018 09:15

Add a healthcheck endpoint on the ingesters that distributors can use

da2cc6a

defer cancel for healthcheck timeout

9238ce3

csmarchbanks added 5 commits March 14, 2018 09:15

Vendor in health protobuf stuff

ddbdcc2

Split removeStaleIngesterClients and healtCheckAndRemoveIngesters to …

f641950

…two methods

Add config for turning on health check behavior

2c4426d

Refactored distributor client cache to ingester/client/IngesterClient…

4f2addb

…Cache

Add some cache and healtcheck tests

41475c6

csmarchbanks force-pushed the healthcheck-ingesters branch from 73c3a95 to 41475c6 Compare March 14, 2018 15:26

bboreham reviewed Mar 16, 2018

View reviewed changes

csmarchbanks added 2 commits March 16, 2018 13:25

Renaming cache -> pool

343da84

Move CleanUnhealthy to the ingester client package, add tests

72930ea

bboreham approved these changes Mar 19, 2018

View reviewed changes

bboreham merged commit fee02a5 into cortexproject:master Mar 19, 2018

bboreham mentioned this pull request Jul 27, 2021

Allow disabling of ring heartbeats by setting relevant options to zero. #4344

Merged

3 tasks

Add a healthcheck endpoint on the ingesters that distributors can use #741

Add a healthcheck endpoint on the ingesters that distributors can use #741

Conversation

csmarchbanks commented Mar 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csmarchbanks Mar 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csmarchbanks commented Mar 12, 2018 • edited Loading

tomwilkie commented Mar 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csmarchbanks Mar 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwilkie commented Mar 12, 2018

bboreham commented Mar 12, 2018

tomwilkie commented Mar 12, 2018

csmarchbanks commented Mar 12, 2018 • edited Loading

tomwilkie commented Mar 12, 2018

csmarchbanks commented Mar 12, 2018

tomwilkie commented Mar 12, 2018

csmarchbanks commented Mar 12, 2018

tomwilkie commented Mar 12, 2018

csmarchbanks commented Mar 12, 2018 • edited Loading

tomwilkie commented Mar 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwilkie commented Mar 13, 2018

tomwilkie commented Mar 13, 2018

tomwilkie commented Mar 13, 2018

bboreham commented Mar 13, 2018

csmarchbanks commented Mar 13, 2018

csmarchbanks commented Mar 13, 2018 • edited Loading

csmarchbanks commented Mar 16, 2018

bboreham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csmarchbanks commented Mar 16, 2018

Choose a reason for hiding this comment

csmarchbanks commented Mar 12, 2018 •

edited

Loading

csmarchbanks Mar 12, 2018 •

edited

Loading

csmarchbanks commented Mar 12, 2018 •

edited

Loading

csmarchbanks Mar 12, 2018 •

edited

Loading

csmarchbanks commented Mar 12, 2018 •

edited

Loading

csmarchbanks commented Mar 12, 2018 •

edited

Loading

csmarchbanks commented Mar 13, 2018 •

edited

Loading