Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move ringpop start call to service start #4510

Merged
merged 4 commits into from
Jul 6, 2023

Conversation

alfred-landrum
Copy link
Contributor

@alfred-landrum alfred-landrum commented Jun 16, 2023

What changed?
This moves the membership monitor Start call from a fx lifecycle hook to each individual service's start method.

Note: this is on top of #4506 , which is needed so that services can still get the hostinfo (server address) to use when they, eg, initialize their service tagged loggers.

Why?
In short, because a service instance shouldn't join ringpop, and hence begin to have requests forwarded to it, until it is ready to receive them. In internal testing, focusing on the impact of history service restarts/upgrades, we've seen requests suffer unnecessary delay because they were forwarded to a new history pod before it began accepting on its announced service address.

How did you test it?
This has been tested in a staging environment, and for each service (listener, matching, history, worker), scaling up & down pod counts, including to/from zero.

Note that with this change, the history service currently generates errors at startup as it attempts lookups during its fx start sequence (inside controller_impl.go), but it will acquire shards as expected. I'll address the error logs in a quick followup PR. (I addressed this in this PR.)

Potential risks
If there's a condition that we haven't seen in testing that causes services to lookup ringpop information during fx start time, they'll see errors in places that haven't before. That could manifest as services failing to start up, or failing to take on work as quickly as they had.

Is hotfix candidate?
No.

@alfred-landrum alfred-landrum requested a review from a team as a code owner June 16, 2023 18:10
@alfred-landrum alfred-landrum force-pushed the alfred/membership-whoami-without-start branch 3 times, most recently from 3ee059e to 6a2ec50 Compare June 19, 2023 18:41
@alfred-landrum alfred-landrum force-pushed the alfred/ringpop-at-service-start branch 2 times, most recently from 5ce2a45 to 1109ca4 Compare June 19, 2023 20:37
@alfred-landrum
Copy link
Contributor Author

I couldn't tell initially tell due to the generated log volume, but the integration tests that are failing look relevant (they're in TestAcquireShard_OwnershipLostErrorSuite and some other shard tests), so I'll dig in shortly.

Base automatically changed from alfred/membership-whoami-without-start to master June 21, 2023 13:51
@paulnpdev
Copy link
Member

paulnpdev commented Jun 26, 2023 via email

if err := s.server.Serve(listener); err != nil {
logger.Fatal("Failed to serve on frontend listener", tag.Error(err))
}
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is supposed to block and this'll make it not block anymore.. I think this will have effects on shutdown: effectively it'll make the process exit before properly shutting things down. I think... it's confusing

See https://github.com/temporalio/temporal/blob/master/service/frontend/fx.go#L602-L624 and https://github.com/temporalio/temporal/blob/master/temporal/server_impl.go#L106-L124 and https://github.com/temporalio/temporal/blob/master/temporal/fx.go#L312-L326

Unless you want to refactor the whole stopChan thing (not a bad idea but probably out of scope) I think you should add a channel or waitgroup here to block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dnr : thank you for pointing this out. I discussed with @MichaelSnowden , and thought we could address this via doing the membership start in a goroutine vs the Serve call, so I've swapped those two in the latest change, PTAL.

@@ -117,17 +117,16 @@ func (c *ControllerImpl) Start() {
c.contextTaggedLogger = log.With(c.logger, tag.ComponentShardController, tag.Address(hostIdentity))
c.throttledLogger = log.With(c.throttledLogger, tag.ComponentShardController, tag.Address(hostIdentity))

c.acquireShards()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to this call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intentionally removed. Calling acquireShards requires asking membership what shards this history instance should own. With this change, the instance hasn't joined membership when this shard controller Start method is called, so it can't call acquireShards here anymore. The shard controller will get a membership update immediately after the instance joins, though, and that will call acquireShards.

@dnr
Copy link
Member

dnr commented Jul 6, 2023

Ironically I was prompted to "refactor the whole stopChan thing" by another report of startup/shutdown bugs: #4584

Basically I moved most of Service.Start into the fx hook synchronously, but then run the Serve call explicitly asynchronously. So I think that actually works well with this PR: you can do the membership Start after the Serve (both async but you can add a small delay if you want). Do you want to wait for that one, or you could merge this first and I could resolve?

@alfred-landrum
Copy link
Contributor Author

Basically I moved most of Service.Start into the fx hook synchronously, but then run the Serve call explicitly asynchronously. So I think that actually works well with this PR: you can do the membership Start after the Serve (both async but you can add a small delay if you want). Do you want to wait for that one, or you could merge this first and I could resolve?

@dnr: thank you for the heads up - I'll merge this now since its been up for a while & I've got green tests.

@alfred-landrum alfred-landrum merged commit 6d175c2 into master Jul 6, 2023
@alfred-landrum alfred-landrum deleted the alfred/ringpop-at-service-start branch July 6, 2023 13:51
alfred-landrum added a commit that referenced this pull request Jul 28, 2023
<!-- Describe what has changed in this PR -->
**What changed?**
This moves the membership monitor Start call from a fx lifecycle hook to
each individual service's start method.

Note: this is on top of #4506
, which is needed so that services can still get the hostinfo (server
address) to use when they, eg, initialize their service tagged loggers.

<!-- Tell your future self why have you made these changes -->
**Why?**
In short, because a service instance shouldn't join ringpop, and hence
begin to have requests forwarded to it, until it is ready to receive
them. In internal testing, focusing on the impact of history service
restarts/upgrades, we've seen requests suffer unnecessary delay because
they were forwarded to a new history pod before it began accepting on
its announced service address.

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
This has been tested in a staging environment, and for each service
(listener, matching, history, worker), scaling up & down pod counts,
including to/from zero.

~~Note that with this change, the history service currently generates
errors at startup as it attempts lookups during its fx start sequence
(inside controller_impl.go), but it will acquire shards as expected.
I'll address the error logs in a quick followup PR.~~ (I addressed this
in this PR.)

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
If there's a condition that we haven't seen in testing that causes
services to lookup ringpop information during fx start time, they'll see
errors in places that haven't before. That could manifest as services
failing to start up, or failing to take on work as quickly as they had.

<!-- Is this PR a hotfix candidate or require that a notification be
sent to the broader community? (Yes/No) -->
**Is hotfix candidate?**
No.
alfred-landrum added a commit that referenced this pull request Aug 1, 2023
<!-- Describe what has changed in this PR -->
**What changed?**
This moves the membership monitor Start call from a fx lifecycle hook to
each individual service's start method.

Note: this is on top of #4506
, which is needed so that services can still get the hostinfo (server
address) to use when they, eg, initialize their service tagged loggers.

<!-- Tell your future self why have you made these changes -->
**Why?**
In short, because a service instance shouldn't join ringpop, and hence
begin to have requests forwarded to it, until it is ready to receive
them. In internal testing, focusing on the impact of history service
restarts/upgrades, we've seen requests suffer unnecessary delay because
they were forwarded to a new history pod before it began accepting on
its announced service address.

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
This has been tested in a staging environment, and for each service
(listener, matching, history, worker), scaling up & down pod counts,
including to/from zero.

~~Note that with this change, the history service currently generates
errors at startup as it attempts lookups during its fx start sequence
(inside controller_impl.go), but it will acquire shards as expected.
I'll address the error logs in a quick followup PR.~~ (I addressed this
in this PR.)

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
If there's a condition that we haven't seen in testing that causes
services to lookup ringpop information during fx start time, they'll see
errors in places that haven't before. That could manifest as services
failing to start up, or failing to take on work as quickly as they had.

<!-- Is this PR a hotfix candidate or require that a notification be
sent to the broader community? (Yes/No) -->
**Is hotfix candidate?**
No.
rodrigozhou pushed a commit that referenced this pull request Aug 7, 2023
<!-- Describe what has changed in this PR -->
**What changed?**
This moves the membership monitor Start call from a fx lifecycle hook to
each individual service's start method.

Note: this is on top of #4506
, which is needed so that services can still get the hostinfo (server
address) to use when they, eg, initialize their service tagged loggers.

<!-- Tell your future self why have you made these changes -->
**Why?**
In short, because a service instance shouldn't join ringpop, and hence
begin to have requests forwarded to it, until it is ready to receive
them. In internal testing, focusing on the impact of history service
restarts/upgrades, we've seen requests suffer unnecessary delay because
they were forwarded to a new history pod before it began accepting on
its announced service address.

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
This has been tested in a staging environment, and for each service
(listener, matching, history, worker), scaling up & down pod counts,
including to/from zero.

~~Note that with this change, the history service currently generates
errors at startup as it attempts lookups during its fx start sequence
(inside controller_impl.go), but it will acquire shards as expected.
I'll address the error logs in a quick followup PR.~~ (I addressed this
in this PR.)

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
If there's a condition that we haven't seen in testing that causes
services to lookup ringpop information during fx start time, they'll see
errors in places that haven't before. That could manifest as services
failing to start up, or failing to take on work as quickly as they had.

<!-- Is this PR a hotfix candidate or require that a notification be
sent to the broader community? (Yes/No) -->
**Is hotfix candidate?**
No.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants