Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate task queue user data with long-poll requests #4334

Merged
merged 19 commits into from
May 17, 2023

Conversation

dnr
Copy link
Member

@dnr dnr commented May 13, 2023

What changed?
Change how user data is propagated: instead of a push/refresh, do continuous long polls to the node above it in the tree.

Why?
More robust to intermittent failures, somewhat simpler code. (Introducing the changed channel is a bunch of complexity but we need it anyway to interrupt blocked spooled task dispatch.)

How did you test it?
new unit tests (future integration tests will exercise all this too)

Potential risks

Is hotfix candidate?

@dnr dnr requested review from bergundy and Sushisource May 13, 2023 00:13
@dnr dnr requested a review from a team as a code owner May 13, 2023 00:13
@@ -147,7 +151,7 @@ func NewConfig(dc *dynamicconfig.Collection) *Config {
ShutdownDrainDuration: dc.GetDurationProperty(dynamicconfig.MatchingShutdownDrainDuration, 0*time.Second),
VersionCompatibleSetLimitPerQueue: dc.GetIntProperty(dynamicconfig.VersionCompatibleSetLimitPerQueue, 10),
VersionBuildIdLimitPerQueue: dc.GetIntProperty(dynamicconfig.VersionBuildIdLimitPerQueue, 1000),
UserDataPollFrequency: dc.GetDurationProperty(dynamicconfig.MatchingUserDataPollFrequency, 5*time.Minute),
GetUserDataLongPollTimeout: dc.GetDurationProperty(dynamicconfig.MatchingGetUserDataLongPollTimeout, 5*time.Minute),
Copy link
Member

@bergundy bergundy May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this is a good default, it should be long enough if the cluster is active and the nodes are communicating regularly.
Otherwise, it should also be fine AFAICT, because by the time this request times out we'll have unloaded the task queue.
In any case, connection loss will eventually be detected and the caller will retry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not totally sure but here was my reasoning:
If nodes go up and down "normally" the clients should get connection errors and retry. The case where a server goes away and the client doesn't realize until the timeout should be rare (network partition, dropped packets, etc.) Of course it can happen, but 5m is probably fine in that case

@@ -181,6 +185,8 @@ func newTaskQueueConfig(id *taskQueueID, config *Config, namespace namespace.Nam
MaxTaskDeleteBatchSize: func() int {
return config.MaxTaskDeleteBatchSize(namespace.String(), taskQueueName, taskType)
},
GetUserDataLongPollTimeout: config.GetUserDataLongPollTimeout,
GetUserDataMinWaitTime: 1 * time.Second,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason this is hardcoded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem worth making it configurable. It's just to protect against logic bugs, basically.

Comment on lines +304 to +306
db.userData = userData
close(db.userDataChanged)
db.userDataChanged = make(chan struct{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why you chose to use the channel as a signaling mechanism instead of using it for delivering the updated data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be multiple listeners. A channel can't broadcast data, it can only "broadcast" close events

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense, seems like the best way to do this using built-in Go constructs.


if req.WaitNewData {
var cancel context.CancelFunc
ctx, cancel = newChildContext(ctx, e.config.GetUserDataLongPollTimeout(), returnEmptyTaskTimeBudget)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should be a caller concern, why set the deadline in the handler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at what newChildContext does: it's trimming off a second at the end to leave time to return a result

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

// We don't really care if the initial fetch worked or not, anything that *requires* a bit of metadata should fail
// that operation if it's never fetched OK. If the initial fetch errored, the metadataPoller will have been started.
_, _ = c.userDataInitialFetch.Get(ctx)
_, err = c.userDataInitialFetch.Get(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should be waiting for the initial fetch here at all, AFAICT, we'll be serving concurrent requests on this versioned task queue partition, each with separate deadlines.
Shouldn't we block indefinitely here?
What are the cases where we'd want to return this error to the caller?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called by rpc handlers. We can't block indefinitely, we need to return an error so the rpc can complete (with an error)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see, this will timeout based on the ctx deadline of the current request. Makes sense.

_ = backoff.ThrottleRetryContext(ctx, op, getUserDataRetryPolicy, nil)
elapsed := time.Since(start)

// In general we want to start a new call immediately on completion of the previous
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this protection is in place?
Why would the remote return success immediately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the remote is broken

That's all, I just don't want a bug that causes the server to return success immediately to lead to this spinning. But nor do I want to delay the follow up call after a successful call if the call took 30s. It shouldn't happen (more than once) if things are working properly

Copy link
Member

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.
Had a few questions and some stuff to discuss.

resp.UserData = userData
} else if userData.Version < version {
// This is highly unlikely but may happen due to an edge case in during ownership transfer.
// We rely on periodic refresh and client retries in this case to let the system eventually self-heal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there still a periodic refresh?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I guess I should reword that. But a long poll with a timeout is basically the same as a periodic refresh

namespaceID := namespace.ID(uuid.New())
tq := "tupac"
tq := "makeToast"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm insulted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, I can put it back. That test got deleted and I wrote more by copying and pasting bits of other tests so it's kind of an artifact of the diff

actTq.Stop()
actTqPart.Stop()
require.Equal(t, data1, userData)
tq.Stop()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any test for the case where user data is fetched OK, but then we don't see some long-poll update that we expected to see, but things still become eventually consistent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow.. you mean like our long polls fail for a while and then eventually we get one that's several versions ahead? That could certainly happen but I'm not sure it's worth a test, there's nothing special in the code about v+1 vs greater

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I mean. I'm fine with not having one if it's not meaningfully different.

@dnr dnr merged commit e7f2fed into temporalio:worker-versioning May 17, 2023
@dnr dnr deleted the ver12 branch May 17, 2023 22:36
dnr added a commit that referenced this pull request May 26, 2023
Note: This commit came from a feature branch and is not expected to build.
dnr added a commit to dnr/temporal that referenced this pull request May 26, 2023
Note: This commit came from a feature branch and is not expected to build.
dnr added a commit that referenced this pull request May 26, 2023
Note: This commit came from a feature branch and is not expected to build.
dnr added a commit that referenced this pull request May 26, 2023
Note: This commit came from a feature branch and is not expected to build.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants