-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move ringpop start call to service start #4510
Conversation
3ee059e
to
6a2ec50
Compare
5ce2a45
to
1109ca4
Compare
I couldn't tell initially tell due to the generated log volume, but the integration tests that are failing look relevant (they're in TestAcquireShard_OwnershipLostErrorSuite and some other shard tests), so I'll dig in shortly. |
aha, I'm already living in the world where that's been fixed. But I agree
we cannot rely on this yet
…On Mon, Jun 26, 2023 at 10:45 AM Alfred Landrum ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In common/membership/ringpop/test_cluster.go
<#4510 (comment)>:
> @@ -146,8 +146,9 @@ func newTestCluster(
}).AnyTimes()
for i := 0; i < size; i++ {
+ node := i
This is the standard protection against accidental captures of the loop
variable for closures and goroutines:
https://go.dev/doc/faq#closures_and_goroutines
https://github.com/golang/go/wiki/LoopvarExperiment
—
Reply to this email directly, view it on GitHub
<#4510 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AUMR7H2PFCSLMG7I7UAFWB3XNHDBXANCNFSM6AAAAAAZJSMJMA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
service/frontend/service.go
Outdated
if err := s.server.Serve(listener); err != nil { | ||
logger.Fatal("Failed to serve on frontend listener", tag.Error(err)) | ||
} | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is supposed to block and this'll make it not block anymore.. I think this will have effects on shutdown: effectively it'll make the process exit before properly shutting things down. I think... it's confusing
See https://github.com/temporalio/temporal/blob/master/service/frontend/fx.go#L602-L624 and https://github.com/temporalio/temporal/blob/master/temporal/server_impl.go#L106-L124 and https://github.com/temporalio/temporal/blob/master/temporal/fx.go#L312-L326
Unless you want to refactor the whole stopChan thing (not a bad idea but probably out of scope) I think you should add a channel or waitgroup here to block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dnr : thank you for pointing this out. I discussed with @MichaelSnowden , and thought we could address this via doing the membership start in a goroutine vs the Serve call, so I've swapped those two in the latest change, PTAL.
a6cf3f7
to
9a9a499
Compare
@@ -117,17 +117,16 @@ func (c *ControllerImpl) Start() { | |||
c.contextTaggedLogger = log.With(c.logger, tag.ComponentShardController, tag.Address(hostIdentity)) | |||
c.throttledLogger = log.With(c.throttledLogger, tag.ComponentShardController, tag.Address(hostIdentity)) | |||
|
|||
c.acquireShards() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened to this call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's intentionally removed. Calling acquireShards
requires asking membership what shards this history instance should own. With this change, the instance hasn't joined membership when this shard controller Start method is called, so it can't call acquireShards here anymore. The shard controller will get a membership update immediately after the instance joins, though, and that will call acquireShards.
Ironically I was prompted to "refactor the whole stopChan thing" by another report of startup/shutdown bugs: #4584 Basically I moved most of Service.Start into the fx hook synchronously, but then run the Serve call explicitly asynchronously. So I think that actually works well with this PR: you can do the membership Start after the Serve (both async but you can add a small delay if you want). Do you want to wait for that one, or you could merge this first and I could resolve? |
@dnr: thank you for the heads up - I'll merge this now since its been up for a while & I've got green tests. |
<!-- Describe what has changed in this PR --> **What changed?** This moves the membership monitor Start call from a fx lifecycle hook to each individual service's start method. Note: this is on top of #4506 , which is needed so that services can still get the hostinfo (server address) to use when they, eg, initialize their service tagged loggers. <!-- Tell your future self why have you made these changes --> **Why?** In short, because a service instance shouldn't join ringpop, and hence begin to have requests forwarded to it, until it is ready to receive them. In internal testing, focusing on the impact of history service restarts/upgrades, we've seen requests suffer unnecessary delay because they were forwarded to a new history pod before it began accepting on its announced service address. <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** This has been tested in a staging environment, and for each service (listener, matching, history, worker), scaling up & down pod counts, including to/from zero. ~~Note that with this change, the history service currently generates errors at startup as it attempts lookups during its fx start sequence (inside controller_impl.go), but it will acquire shards as expected. I'll address the error logs in a quick followup PR.~~ (I addressed this in this PR.) <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** If there's a condition that we haven't seen in testing that causes services to lookup ringpop information during fx start time, they'll see errors in places that haven't before. That could manifest as services failing to start up, or failing to take on work as quickly as they had. <!-- Is this PR a hotfix candidate or require that a notification be sent to the broader community? (Yes/No) --> **Is hotfix candidate?** No.
<!-- Describe what has changed in this PR --> **What changed?** This moves the membership monitor Start call from a fx lifecycle hook to each individual service's start method. Note: this is on top of #4506 , which is needed so that services can still get the hostinfo (server address) to use when they, eg, initialize their service tagged loggers. <!-- Tell your future self why have you made these changes --> **Why?** In short, because a service instance shouldn't join ringpop, and hence begin to have requests forwarded to it, until it is ready to receive them. In internal testing, focusing on the impact of history service restarts/upgrades, we've seen requests suffer unnecessary delay because they were forwarded to a new history pod before it began accepting on its announced service address. <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** This has been tested in a staging environment, and for each service (listener, matching, history, worker), scaling up & down pod counts, including to/from zero. ~~Note that with this change, the history service currently generates errors at startup as it attempts lookups during its fx start sequence (inside controller_impl.go), but it will acquire shards as expected. I'll address the error logs in a quick followup PR.~~ (I addressed this in this PR.) <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** If there's a condition that we haven't seen in testing that causes services to lookup ringpop information during fx start time, they'll see errors in places that haven't before. That could manifest as services failing to start up, or failing to take on work as quickly as they had. <!-- Is this PR a hotfix candidate or require that a notification be sent to the broader community? (Yes/No) --> **Is hotfix candidate?** No.
<!-- Describe what has changed in this PR --> **What changed?** This moves the membership monitor Start call from a fx lifecycle hook to each individual service's start method. Note: this is on top of #4506 , which is needed so that services can still get the hostinfo (server address) to use when they, eg, initialize their service tagged loggers. <!-- Tell your future self why have you made these changes --> **Why?** In short, because a service instance shouldn't join ringpop, and hence begin to have requests forwarded to it, until it is ready to receive them. In internal testing, focusing on the impact of history service restarts/upgrades, we've seen requests suffer unnecessary delay because they were forwarded to a new history pod before it began accepting on its announced service address. <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** This has been tested in a staging environment, and for each service (listener, matching, history, worker), scaling up & down pod counts, including to/from zero. ~~Note that with this change, the history service currently generates errors at startup as it attempts lookups during its fx start sequence (inside controller_impl.go), but it will acquire shards as expected. I'll address the error logs in a quick followup PR.~~ (I addressed this in this PR.) <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** If there's a condition that we haven't seen in testing that causes services to lookup ringpop information during fx start time, they'll see errors in places that haven't before. That could manifest as services failing to start up, or failing to take on work as quickly as they had. <!-- Is this PR a hotfix candidate or require that a notification be sent to the broader community? (Yes/No) --> **Is hotfix candidate?** No.
What changed?
This moves the membership monitor Start call from a fx lifecycle hook to each individual service's start method.
Note: this is on top of #4506 , which is needed so that services can still get the hostinfo (server address) to use when they, eg, initialize their service tagged loggers.
Why?
In short, because a service instance shouldn't join ringpop, and hence begin to have requests forwarded to it, until it is ready to receive them. In internal testing, focusing on the impact of history service restarts/upgrades, we've seen requests suffer unnecessary delay because they were forwarded to a new history pod before it began accepting on its announced service address.
How did you test it?
This has been tested in a staging environment, and for each service (listener, matching, history, worker), scaling up & down pod counts, including to/from zero.
Note that with this change, the history service currently generates errors at startup as it attempts lookups during its fx start sequence (inside controller_impl.go), but it will acquire shards as expected. I'll address the error logs in a quick followup PR.(I addressed this in this PR.)Potential risks
If there's a condition that we haven't seen in testing that causes services to lookup ringpop information during fx start time, they'll see errors in places that haven't before. That could manifest as services failing to start up, or failing to take on work as quickly as they had.
Is hotfix candidate?
No.