-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propagate task queue user data with long-poll requests #4334
Conversation
@@ -147,7 +151,7 @@ func NewConfig(dc *dynamicconfig.Collection) *Config { | |||
ShutdownDrainDuration: dc.GetDurationProperty(dynamicconfig.MatchingShutdownDrainDuration, 0*time.Second), | |||
VersionCompatibleSetLimitPerQueue: dc.GetIntProperty(dynamicconfig.VersionCompatibleSetLimitPerQueue, 10), | |||
VersionBuildIdLimitPerQueue: dc.GetIntProperty(dynamicconfig.VersionBuildIdLimitPerQueue, 1000), | |||
UserDataPollFrequency: dc.GetDurationProperty(dynamicconfig.MatchingUserDataPollFrequency, 5*time.Minute), | |||
GetUserDataLongPollTimeout: dc.GetDurationProperty(dynamicconfig.MatchingGetUserDataLongPollTimeout, 5*time.Minute), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this is a good default, it should be long enough if the cluster is active and the nodes are communicating regularly.
Otherwise, it should also be fine AFAICT, because by the time this request times out we'll have unloaded the task queue.
In any case, connection loss will eventually be detected and the caller will retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not totally sure but here was my reasoning:
If nodes go up and down "normally" the clients should get connection errors and retry. The case where a server goes away and the client doesn't realize until the timeout should be rare (network partition, dropped packets, etc.) Of course it can happen, but 5m is probably fine in that case
@@ -181,6 +185,8 @@ func newTaskQueueConfig(id *taskQueueID, config *Config, namespace namespace.Nam | |||
MaxTaskDeleteBatchSize: func() int { | |||
return config.MaxTaskDeleteBatchSize(namespace.String(), taskQueueName, taskType) | |||
}, | |||
GetUserDataLongPollTimeout: config.GetUserDataLongPollTimeout, | |||
GetUserDataMinWaitTime: 1 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason this is hardcoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem worth making it configurable. It's just to protect against logic bugs, basically.
db.userData = userData | ||
close(db.userDataChanged) | ||
db.userDataChanged = make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering why you chose to use the channel as a signaling mechanism instead of using it for delivering the updated data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be multiple listeners. A channel can't broadcast data, it can only "broadcast" close events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense, seems like the best way to do this using built-in Go constructs.
|
||
if req.WaitNewData { | ||
var cancel context.CancelFunc | ||
ctx, cancel = newChildContext(ctx, e.config.GetUserDataLongPollTimeout(), returnEmptyTaskTimeBudget) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this should be a caller concern, why set the deadline in the handler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at what newChildContext does: it's trimming off a second at the end to leave time to return a result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
// We don't really care if the initial fetch worked or not, anything that *requires* a bit of metadata should fail | ||
// that operation if it's never fetched OK. If the initial fetch errored, the metadataPoller will have been started. | ||
_, _ = c.userDataInitialFetch.Get(ctx) | ||
_, err = c.userDataInitialFetch.Get(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should be waiting for the initial fetch here at all, AFAICT, we'll be serving concurrent requests on this versioned task queue partition, each with separate deadlines.
Shouldn't we block indefinitely here?
What are the cases where we'd want to return this error to the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called by rpc handlers. We can't block indefinitely, we need to return an error so the rpc can complete (with an error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see, this will timeout based on the ctx deadline of the current request. Makes sense.
_ = backoff.ThrottleRetryContext(ctx, op, getUserDataRetryPolicy, nil) | ||
elapsed := time.Since(start) | ||
|
||
// In general we want to start a new call immediately on completion of the previous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this protection is in place?
Why would the remote return success immediately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the remote is broken
That's all, I just don't want a bug that causes the server to return success immediately to lead to this spinning. But nor do I want to delay the follow up call after a successful call if the call took 30s. It shouldn't happen (more than once) if things are working properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
Had a few questions and some stuff to discuss.
service/matching/matchingEngine.go
Outdated
resp.UserData = userData | ||
} else if userData.Version < version { | ||
// This is highly unlikely but may happen due to an edge case in during ownership transfer. | ||
// We rely on periodic refresh and client retries in this case to let the system eventually self-heal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there still a periodic refresh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I guess I should reword that. But a long poll with a timeout is basically the same as a periodic refresh
namespaceID := namespace.ID(uuid.New()) | ||
tq := "tupac" | ||
tq := "makeToast" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm insulted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah, I can put it back. That test got deleted and I wrote more by copying and pasting bits of other tests so it's kind of an artifact of the diff
actTq.Stop() | ||
actTqPart.Stop() | ||
require.Equal(t, data1, userData) | ||
tq.Stop() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any test for the case where user data is fetched OK, but then we don't see some long-poll update that we expected to see, but things still become eventually consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow.. you mean like our long polls fail for a while and then eventually we get one that's several versions ahead? That could certainly happen but I'm not sure it's worth a test, there's nothing special in the code about v+1 vs greater
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I mean. I'm fine with not having one if it's not meaningfully different.
Note: This commit came from a feature branch and is not expected to build.
Note: This commit came from a feature branch and is not expected to build.
Note: This commit came from a feature branch and is not expected to build.
Note: This commit came from a feature branch and is not expected to build.
What changed?
Change how user data is propagated: instead of a push/refresh, do continuous long polls to the node above it in the tree.
Why?
More robust to intermittent failures, somewhat simpler code. (Introducing the changed channel is a bunch of complexity but we need it anyway to interrupt blocked spooled task dispatch.)
How did you test it?
new unit tests (future integration tests will exercise all this too)
Potential risks
Is hotfix candidate?