move ringpop start call to service start #4510

alfred-landrum · 2023-06-16T18:10:50Z

What changed?
This moves the membership monitor Start call from a fx lifecycle hook to each individual service's start method.

Note: this is on top of #4506 , which is needed so that services can still get the hostinfo (server address) to use when they, eg, initialize their service tagged loggers.

Why?
In short, because a service instance shouldn't join ringpop, and hence begin to have requests forwarded to it, until it is ready to receive them. In internal testing, focusing on the impact of history service restarts/upgrades, we've seen requests suffer unnecessary delay because they were forwarded to a new history pod before it began accepting on its announced service address.

How did you test it?
This has been tested in a staging environment, and for each service (listener, matching, history, worker), scaling up & down pod counts, including to/from zero.

Note that with this change, the history service currently generates errors at startup as it attempts lookups during its fx start sequence (inside controller_impl.go), but it will acquire shards as expected. I'll address the error logs in a quick followup PR. (I addressed this in this PR.)

Potential risks
If there's a condition that we haven't seen in testing that causes services to lookup ringpop information during fx start time, they'll see errors in places that haven't before. That could manifest as services failing to start up, or failing to take on work as quickly as they had.

Is hotfix candidate?
No.

alfred-landrum · 2023-06-20T17:45:53Z

I couldn't tell initially tell due to the generated log volume, but the integration tests that are failing look relevant (they're in TestAcquireShard_OwnershipLostErrorSuite and some other shard tests), so I'll dig in shortly.

common/membership/ringpop/test_cluster.go

paulnpdev · 2023-06-26T19:03:40Z

aha, I'm already living in the world where that's been fixed. But I agree we cannot rely on this yet

…

On Mon, Jun 26, 2023 at 10:45 AM Alfred Landrum ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In common/membership/ringpop/test_cluster.go <#4510 (comment)>: > @@ -146,8 +146,9 @@ func newTestCluster( }).AnyTimes() for i := 0; i < size; i++ { + node := i This is the standard protection against accidental captures of the loop variable for closures and goroutines: https://go.dev/doc/faq#closures_and_goroutines https://github.com/golang/go/wiki/LoopvarExperiment — Reply to this email directly, view it on GitHub <#4510 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUMR7H2PFCSLMG7I7UAFWB3XNHDBXANCNFSM6AAAAAAZJSMJMA> . You are receiving this because you commented.Message ID: ***@***.***>

dnr · 2023-06-27T20:12:44Z

service/frontend/service.go

-	if err := s.server.Serve(listener); err != nil {
-		logger.Fatal("Failed to serve on frontend listener", tag.Error(err))
-	}
+	go func() {


This function is supposed to block and this'll make it not block anymore.. I think this will have effects on shutdown: effectively it'll make the process exit before properly shutting things down. I think... it's confusing

See https://github.com/temporalio/temporal/blob/master/service/frontend/fx.go#L602-L624 and https://github.com/temporalio/temporal/blob/master/temporal/server_impl.go#L106-L124 and https://github.com/temporalio/temporal/blob/master/temporal/fx.go#L312-L326

Unless you want to refactor the whole stopChan thing (not a bad idea but probably out of scope) I think you should add a channel or waitgroup here to block

@dnr : thank you for pointing this out. I discussed with @MichaelSnowden , and thought we could address this via doing the membership start in a goroutine vs the Serve call, so I've swapped those two in the latest change, PTAL.

MichaelSnowden · 2023-07-05T23:48:51Z

service/history/shard/controller_impl.go

@@ -117,17 +117,16 @@ func (c *ControllerImpl) Start() {
 	c.contextTaggedLogger = log.With(c.logger, tag.ComponentShardController, tag.Address(hostIdentity))
 	c.throttledLogger = log.With(c.throttledLogger, tag.ComponentShardController, tag.Address(hostIdentity))

-	c.acquireShards()


What happened to this call?

It's intentionally removed. Calling acquireShards requires asking membership what shards this history instance should own. With this change, the instance hasn't joined membership when this shard controller Start method is called, so it can't call acquireShards here anymore. The shard controller will get a membership update immediately after the instance joins, though, and that will call acquireShards.

dnr · 2023-07-06T05:04:26Z

Ironically I was prompted to "refactor the whole stopChan thing" by another report of startup/shutdown bugs: #4584

Basically I moved most of Service.Start into the fx hook synchronously, but then run the Serve call explicitly asynchronously. So I think that actually works well with this PR: you can do the membership Start after the Serve (both async but you can add a small delay if you want). Do you want to wait for that one, or you could merge this first and I could resolve?

alfred-landrum · 2023-07-06T13:51:37Z

Basically I moved most of Service.Start into the fx hook synchronously, but then run the Serve call explicitly asynchronously. So I think that actually works well with this PR: you can do the membership Start after the Serve (both async but you can add a small delay if you want). Do you want to wait for that one, or you could merge this first and I could resolve?

@dnr: thank you for the heads up - I'll merge this now since its been up for a while & I've got green tests.

**What changed?** This moves the membership monitor Start call from a fx lifecycle hook to each individual service's start method. Note: this is on top of #4506 , which is needed so that services can still get the hostinfo (server address) to use when they, eg, initialize their service tagged loggers.  **Why?** In short, because a service instance shouldn't join ringpop, and hence begin to have requests forwarded to it, until it is ready to receive them. In internal testing, focusing on the impact of history service restarts/upgrades, we've seen requests suffer unnecessary delay because they were forwarded to a new history pod before it began accepting on its announced service address.  **How did you test it?** This has been tested in a staging environment, and for each service (listener, matching, history, worker), scaling up & down pod counts, including to/from zero. ~~Note that with this change, the history service currently generates errors at startup as it attempts lookups during its fx start sequence (inside controller_impl.go), but it will acquire shards as expected. I'll address the error logs in a quick followup PR.~~ (I addressed this in this PR.)  **Potential risks** If there's a condition that we haven't seen in testing that causes services to lookup ringpop information during fx start time, they'll see errors in places that haven't before. That could manifest as services failing to start up, or failing to take on work as quickly as they had.  **Is hotfix candidate?** No.

alfred-landrum requested a review from MichaelSnowden June 16, 2023 18:10

alfred-landrum requested a review from a team as a code owner June 16, 2023 18:10

alfred-landrum force-pushed the alfred/membership-whoami-without-start branch 3 times, most recently from 3ee059e to 6a2ec50 Compare June 19, 2023 18:41

alfred-landrum force-pushed the alfred/ringpop-at-service-start branch 2 times, most recently from 5ce2a45 to 1109ca4 Compare June 19, 2023 20:37

Base automatically changed from alfred/membership-whoami-without-start to master June 21, 2023 13:51

paulnpdev reviewed Jun 26, 2023

View reviewed changes

common/membership/ringpop/test_cluster.go Outdated Show resolved Hide resolved

dnr reviewed Jun 27, 2023

View reviewed changes

dnr mentioned this pull request Jul 3, 2023

Make ringpop.monitor depend on our service listener #4567

Closed

alfred-landrum added 4 commits July 5, 2023 12:33

move ringpop start call to service start

82bbe5a

wait for first membership update to acquire shards

6b98c3e

ensure Service Start is blocking

593ed05

make simple resolver match new start behavior

9a9a499

alfred-landrum force-pushed the alfred/ringpop-at-service-start branch from a6cf3f7 to 9a9a499 Compare July 5, 2023 22:38

alfred-landrum assigned MichaelSnowden Jul 5, 2023

MichaelSnowden reviewed Jul 5, 2023

View reviewed changes

alfred-landrum mentioned this pull request Jul 6, 2023

support delay before history joins membership #4582

Merged

dnr approved these changes Jul 6, 2023

View reviewed changes

MichaelSnowden approved these changes Jul 6, 2023

View reviewed changes

alfred-landrum merged commit 6d175c2 into master Jul 6, 2023

alfred-landrum deleted the alfred/ringpop-at-service-start branch July 6, 2023 13:51

alfred-landrum added the release/1.21.5 label Aug 4, 2023

dnr mentioned this pull request Oct 25, 2023

gRPC health check may say the server is unhealthy even if it's responding successfully to GetSystemInfo #5015

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move ringpop start call to service start #4510

move ringpop start call to service start #4510

alfred-landrum commented Jun 16, 2023 •

edited

Loading

alfred-landrum commented Jun 20, 2023

paulnpdev commented Jun 26, 2023 via email

dnr Jun 27, 2023

alfred-landrum Jul 5, 2023

MichaelSnowden Jul 5, 2023

alfred-landrum Jul 6, 2023

dnr commented Jul 6, 2023

alfred-landrum commented Jul 6, 2023

move ringpop start call to service start #4510

move ringpop start call to service start #4510

Conversation

alfred-landrum commented Jun 16, 2023 • edited Loading

alfred-landrum commented Jun 20, 2023

paulnpdev commented Jun 26, 2023 via email

dnr Jun 27, 2023

Choose a reason for hiding this comment

alfred-landrum Jul 5, 2023

Choose a reason for hiding this comment

MichaelSnowden Jul 5, 2023

Choose a reason for hiding this comment

alfred-landrum Jul 6, 2023

Choose a reason for hiding this comment

dnr commented Jul 6, 2023

alfred-landrum commented Jul 6, 2023

alfred-landrum commented Jun 16, 2023 •

edited

Loading