Propagate task queue user data with long-poll requests #4334

dnr · 2023-05-13T00:13:15Z

What changed?
Change how user data is propagated: instead of a push/refresh, do continuous long polls to the node above it in the tree.

Why?
More robust to intermittent failures, somewhat simpler code. (Introducing the changed channel is a bunch of complexity but we need it anyway to interrupt blocked spooled task dispatch.)

How did you test it?
new unit tests (future integration tests will exercise all this too)

Potential risks

Is hotfix candidate?

bergundy · 2023-05-15T16:39:45Z

service/matching/config.go

@@ -147,7 +151,7 @@ func NewConfig(dc *dynamicconfig.Collection) *Config {
 		ShutdownDrainDuration:                 dc.GetDurationProperty(dynamicconfig.MatchingShutdownDrainDuration, 0*time.Second),
 		VersionCompatibleSetLimitPerQueue:     dc.GetIntProperty(dynamicconfig.VersionCompatibleSetLimitPerQueue, 10),
 		VersionBuildIdLimitPerQueue:           dc.GetIntProperty(dynamicconfig.VersionBuildIdLimitPerQueue, 1000),
-		UserDataPollFrequency:                 dc.GetDurationProperty(dynamicconfig.MatchingUserDataPollFrequency, 5*time.Minute),
+		GetUserDataLongPollTimeout:            dc.GetDurationProperty(dynamicconfig.MatchingGetUserDataLongPollTimeout, 5*time.Minute),


I hope this is a good default, it should be long enough if the cluster is active and the nodes are communicating regularly.
Otherwise, it should also be fine AFAICT, because by the time this request times out we'll have unloaded the task queue.
In any case, connection loss will eventually be detected and the caller will retry.

Yeah, I'm not totally sure but here was my reasoning:
If nodes go up and down "normally" the clients should get connection errors and retry. The case where a server goes away and the client doesn't realize until the timeout should be rare (network partition, dropped packets, etc.) Of course it can happen, but 5m is probably fine in that case

bergundy · 2023-05-15T16:41:21Z

service/matching/config.go

@@ -181,6 +185,8 @@ func newTaskQueueConfig(id *taskQueueID, config *Config, namespace namespace.Nam
 		MaxTaskDeleteBatchSize: func() int {
 			return config.MaxTaskDeleteBatchSize(namespace.String(), taskQueueName, taskType)
 		},
+		GetUserDataLongPollTimeout: config.GetUserDataLongPollTimeout,
+		GetUserDataMinWaitTime:     1 * time.Second,


What's the reason this is hardcoded?

It doesn't seem worth making it configurable. It's just to protect against logic bugs, basically.

bergundy · 2023-05-15T16:51:47Z

service/matching/db.go

+	db.userData = userData
+	close(db.userDataChanged)
+	db.userDataChanged = make(chan struct{})


Wondering why you chose to use the channel as a signaling mechanism instead of using it for delivering the updated data.

There may be multiple listeners. A channel can't broadcast data, it can only "broadcast" close events

Yeah, that makes sense, seems like the best way to do this using built-in Go constructs.

bergundy · 2023-05-15T16:59:04Z

service/matching/matchingEngine.go

+
+	if req.WaitNewData {
+		var cancel context.CancelFunc
+		ctx, cancel = newChildContext(ctx, e.config.GetUserDataLongPollTimeout(), returnEmptyTaskTimeBudget)


Seems like this should be a caller concern, why set the deadline in the handler?

Look at what newChildContext does: it's trimming off a second at the end to leave time to return a result

bergundy · 2023-05-15T17:23:20Z

service/matching/taskQueueManager.go

-	// We don't really care if the initial fetch worked or not, anything that *requires* a bit of metadata should fail
-	// that operation if it's never fetched OK. If the initial fetch errored, the metadataPoller will have been started.
-	_, _ = c.userDataInitialFetch.Get(ctx)
+	_, err = c.userDataInitialFetch.Get(ctx)


I'm wondering if we should be waiting for the initial fetch here at all, AFAICT, we'll be serving concurrent requests on this versioned task queue partition, each with separate deadlines.
Shouldn't we block indefinitely here?
What are the cases where we'd want to return this error to the caller?

This is called by rpc handlers. We can't block indefinitely, we need to return an error so the rpc can complete (with an error)

Okay, I see, this will timeout based on the ctx deadline of the current request. Makes sense.

bergundy · 2023-05-15T17:27:32Z

service/matching/taskQueueManager.go

+		_ = backoff.ThrottleRetryContext(ctx, op, getUserDataRetryPolicy, nil)
+		elapsed := time.Since(start)
+
+		// In general we want to start a new call immediately on completion of the previous


Can you explain why this protection is in place?
Why would the remote return success immediately?

But if the remote is broken

That's all, I just don't want a bug that causes the server to return success immediately to lead to this spinning. But nor do I want to delay the follow up call after a successful call if the call took 30s. It shouldn't happen (more than once) if things are working properly

bergundy

Overall LGTM.
Had a few questions and some stuff to discuss.

Sushisource · 2023-05-15T18:15:06Z

service/matching/matchingEngine.go

+				resp.UserData = userData
+			} else if userData.Version < version {
+				// This is highly unlikely but may happen due to an edge case in during ownership transfer.
+				// We rely on periodic refresh and client retries in this case to let the system eventually self-heal.


Is there still a periodic refresh?

Oh, I guess I should reword that. But a long poll with a timeout is basically the same as a periodic refresh

Sushisource · 2023-05-15T18:15:49Z

service/matching/matchingEngine_test.go

 	namespaceID := namespace.ID(uuid.New())
-	tq := "tupac"
+	tq := "makeToast"


I'm insulted

Hah, I can put it back. That test got deleted and I wrote more by copying and pasting bits of other tests so it's kind of an artifact of the diff

Sushisource · 2023-05-15T18:20:22Z

service/matching/taskQueueManager_test.go

-	actTq.Stop()
-	actTqPart.Stop()
+	require.Equal(t, data1, userData)
+	tq.Stop()
 }


Do we have any test for the case where user data is fetched OK, but then we don't see some long-poll update that we expected to see, but things still become eventually consistent?

I'm not sure I follow.. you mean like our long polls fail for a while and then eventually we get one that's several versions ahead? That could certainly happen but I'm not sure it's worth a test, there's nothing special in the code about v+1 vs greater

Yes, that's what I mean. I'm fine with not having one if it's not meaningfully different.

Note: This commit came from a feature branch and is not expected to build.

dnr added 16 commits May 12, 2023 17:12

remove InvalidateTaskQueueUserData

50649e4

fetchUserDataLoop

e3ecfc6

refactor CallerInfo contexts

f1be7f3

add changed channel

fdec5de

server side

e8c94f8

allow calls to non-root

239b311

wait on nil also

ebf812d

go to parent, not root

4081b34

resolve initial fetch

8b932c4

matchingEngine tests

fae1c43

small tweaks, test cases

0633678

clean up newChildContext

97b0f04

dynamic config for user data long poll timeout

c19d8fb

small tweaks

c636f06

one test

bae8576

more tests

bb5adad

dnr requested review from bergundy and Sushisource May 13, 2023 00:13

dnr requested a review from a team as a code owner May 13, 2023 00:13

bergundy reviewed May 15, 2023

View reviewed changes

Sushisource reviewed May 15, 2023

View reviewed changes

comments and stuff

f553e53

bergundy approved these changes May 16, 2023

View reviewed changes

fix initial fetch

45b300e

integration test

90c0d93

dnr merged commit e7f2fed into temporalio:worker-versioning May 17, 2023

dnr deleted the ver12 branch May 17, 2023 22:36

dnr added a commit that referenced this pull request May 26, 2023

Propagate task queue user data with long-poll requests (#4334)

b42bebb

Note: This commit came from a feature branch and is not expected to build.

dnr added a commit to dnr/temporal that referenced this pull request May 26, 2023

Propagate task queue user data with long-poll requests (temporalio#4334)

6aace81

Note: This commit came from a feature branch and is not expected to build.

dnr added a commit that referenced this pull request May 26, 2023

Propagate task queue user data with long-poll requests (#4334)

067dcd4

Note: This commit came from a feature branch and is not expected to build.

dnr added a commit that referenced this pull request May 26, 2023

Propagate task queue user data with long-poll requests (#4334)

42e766e

Note: This commit came from a feature branch and is not expected to build.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate task queue user data with long-poll requests #4334

Propagate task queue user data with long-poll requests #4334

dnr commented May 13, 2023 •

edited

Loading

bergundy May 15, 2023 •

edited

Loading

dnr May 15, 2023

bergundy May 15, 2023

dnr May 15, 2023

bergundy May 15, 2023

dnr May 15, 2023

bergundy May 16, 2023

bergundy May 15, 2023

dnr May 15, 2023

bergundy May 16, 2023

bergundy May 15, 2023

dnr May 15, 2023

bergundy May 16, 2023

bergundy May 15, 2023

dnr May 15, 2023

bergundy left a comment

Sushisource May 15, 2023

dnr May 15, 2023

Sushisource May 15, 2023

dnr May 15, 2023

Sushisource May 15, 2023

dnr May 15, 2023

Sushisource May 16, 2023

Propagate task queue user data with long-poll requests #4334

Propagate task queue user data with long-poll requests #4334

Conversation

dnr commented May 13, 2023 • edited Loading

bergundy May 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bergundy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnr commented May 13, 2023 •

edited

Loading

bergundy May 15, 2023 •

edited

Loading