Refactor versioning API #237

Sushisource · 2022-09-14T21:44:44Z

What changed?

Simpler representation of version "graph"
Support id for each compatible set
Better support for future WASM bundles

Why?
We need the ID to support separate physical task queues, which is required to avoid a backup in one version starving tasks on another version.

The graph representation was more complex than necessary.

Better support for future WASM bundles, now that we know we will be doing Rust SDK release soonish.

Breaking changes
Haha, yes - to an unreleased API.

bergundy

LGTM, had a question and a comment that I pretty much resolved myself.

temporal/api/taskqueue/v1/message.proto

temporal/api/common/v1/message.proto

temporal/api/taskqueue/v1/message.proto

temporal/api/common/v1/message.proto

temporal/api/history/v1/message.proto

temporal/api/taskqueue/v1/message.proto

temporal/api/workflowservice/v1/request_response.proto

dnr · 2022-09-20T03:28:43Z

temporal/api/workflowservice/v1/request_response.proto

-    int32 max_depth = 3;
+    // Limits how many compatible sets will be returned. Specify 1 to only return the current
+    // default major version set. 0 returns all sets.
+    int32 max_sets = 3;


limit on max versions per set also?

I dunno if it's worth it. The most we would return now is 10k versions, if every set is maxed out, and that's still not a gigantor response.

This mostly exist to just have a "get only the current default" toggle.

temporal/api/history/v1/message.proto

temporal/api/taskqueue/v1/message.proto

temporal/api/workflowservice/v1/request_response.proto

bergundy · 2022-09-28T00:08:54Z

temporal/api/taskqueue/v1/message.proto

+message CompatibleVersionSet {
+    // A unique identifier for this version set
+    string id = 1;
+    // All the compatible versions, ordered from oldest to newest


I find oldest to newest to be confusing since I can use the API to mark a previous version as the default which would not make it the "newest".

It will do that

If that's what you consider "newest", it doesn't read that way to me.
I'd just add the the "default" will alway be the last version in the list.

Well, within a set, the order never changes. You only mark overall sets as defaults, and the comment for that is clear about it. There's no reason to have some kind of "inner" default, since all the versions are compatible, we always pick the latest one.

Oh I see, so there's no way to revert marking a build ID as "latest" (or default) in the set?

What if I mark a build as compatible in the set and later change my mind? What if the deployment for that build fails?
How do I tell the server to route requests in that set to a previous build?

I'll add a promote-within-set op

I still think that we should have a default attribute on this message to mark the default in the set so we can leave the versions array sorted by insertion time.
Another thing we might want to consider is to store some more information like when was each version added to the set and by who (using identity in the request).

bergundy · 2022-11-10T05:51:45Z

temporal/api/workflowservice/v1/request_response.proto

@@ -774,6 +773,9 @@ message QueryWorkflowResponse {
 message DescribeWorkflowExecutionRequest {
    string namespace = 1;
    temporal.api.common.v1.WorkflowExecution execution = 2;
+    // If set, the response will include which build id shall be used to process the pending
+    // workflow task, if any.
+    bool include_pending_workflow_task_build_id = 3;


Why not include this in every response?

It's more expensive than a normal describe op.

Was this to populate will_use_build_id? If so this can be removed

Thanks, fixed.

bergundy · 2022-11-10T06:06:07Z

temporal/api/workflowservice/v1/request_response.proto

-    bool become_default = 5;
-}
-message UpdateWorkerBuildIdOrderingResponse {}
+    oneof operation {


All of these are exactly what we need but I feel like I'd have to look at the docs every time to know what each of these operations mean.

This is better IMHO:

message AddNewVersion { string version = 1; string compatible_with_version = 2; bool default_for_queue = 3; } message PromoteExistingVersion { string version = 1; bool default_for_queue = 2; }

There's one case that we might not want to allow: AddNewVersion(version, default_for_queue=false) but I think the readability makes up for that.

I don't think we should be optimizing our gRPC API for readability over correctness. We make this easy to use / readable in our clients and tctl. The API should optimize for correctness, so that when we implement those things mistakes are easily prevented. Looking at the docstring is quick and easy in every editing environment.

I think we should be optimizing for readability wherever possible TBH.
Notice the consistency in my proposed message names, they start with verbs as one might expect for an operation, they follow the same format, etc.
I'd be down to see an alternative proposal that disallows AddNewVersion(version, default_for_queue=false) and is easier to parse.

I'm talking about relative values, and correctness over readability is I think pretty clear.

But, regardless, I'm really not following what's hard to read about this. I could just make the field names slightly longer and that would seem to do it. I think your version is largely more clear to you because you wrote it

I'd say there are 3 issues with your proposed API:

These 2 variants sound the same, existing_version_id_in_set_to_promote and promote_version_id_within_set.

Some variant names sound like operations while others don't (the ones that don't start with a verb).

The term "default" is overloaded, there's a set default and a queue default, anywhere we mention default we should explicitly disambiguate.

I have made the field names a bit more clear to address this

temporal/api/workflowservice/v1/request_response.proto

* Simpler representation of version "graph" * Support id for each compatible set * Support for future WASM bundles

temporal/api/common/v1/message.proto

dnr · 2023-02-25T00:59:57Z

temporal/api/history/v1/message.proto

+    // Version info of the worker who processed this workflow task, or missing if worker is not
+    // using versioning. If present, the `build_id` field within supersedes `binary_checksum`, which
+    // may be populated with the same value to preserve compatability.
+    temporal.api.common.v1.WorkerVersionStamp worker_version = 5;


I think this is more clear:

Suggested change

// Version info of the worker who processed this workflow task, or missing if worker is not

// using versioning. If present, the `build_id` field within supersedes `binary_checksum`, which

// may be populated with the same value to preserve compatability.

temporal.api.common.v1.WorkerVersionStamp worker_version = 5;

// Version info of the worker who processed this workflow task, or missing if worker is not

// using versioning. If present, the `build_id` field within is also used as `binary_checksum`,

// which may be omitted in that case. (It also may be populated for backwards compatibility.)

temporal.api.common.v1.WorkerVersionStamp worker_version = 5;

but I'm still a little confused: are we going to populate both indefinitely? how does the server know when it can stop writing binary_checksum into history?

I think we'll have to have it around for some deprecation period before we just eliminate it entirely?

temporal/api/taskqueue/v1/message.proto

dnr · 2023-02-25T01:04:19Z

temporal/api/taskqueue/v1/message.proto

+message CompatibleVersionSet {
+    // A unique identifier for this version set. Users don't need to understand or care about this
+    // value, but it has value for debugging purposes.
+    string id = 1;


do we want to document anything about this value? I can say that it may not contain / or :. but maybe it's just distracting to document that since it's assigned by the server and clients should treat it as an opaque string

I think it'd probably just be distracting

temporal/api/taskqueue/v1/message.proto

temporal/api/workflowservice/v1/request_response.proto

dnr · 2023-02-25T01:10:22Z

temporal/api/workflow/v1/message.proto

@@ -100,6 +100,9 @@ message PendingWorkflowTaskInfo {
    google.protobuf.Timestamp original_scheduled_time = 3 [(gogoproto.stdtime) = true];
    google.protobuf.Timestamp started_time = 4 [(gogoproto.stdtime) = true];
    int32 attempt = 5;
+    // If set, this pending workflow task will need to be (or is currently being) handled by a
+    // a worker using this build id.
+    string will_use_build_id = 6;


I don't remember talking about this.. what's the planned use case?

what's the plan to implement it? I don't think it's obvious.

even if this is needed, what do you think about adding the WorkerVersionStamp for the previous wft in WorkflowExecutionInfo? that's going to be in mutable state so it's free to copy in there

IIRC I thought the reason for existence is to make it easier in the UI / CLI to show what the next task will use, and be able to easily do that "no compatible pollers" warning.... but as I write next that seems to be redundant.

You tell me 🤷 -- if this is hard to do, we can eliminate it since this info is available via the active_versions_and_pollers field in GetWorkerBuildIdCompatabilityResponse. This is technically slightly different since it shows last version / current compatible pollers rather than "next version"... but seems like we should probably just get rid of this unless we remember we need it for some reason and add it then.

Yes, I think this makes sense to add regardless

temporal/api/workflowservice/v1/request_response.proto

dnr · 2023-02-25T01:21:13Z

temporal/api/workflowservice/v1/request_response.proto

    // set, that value should also be considered as the `binary_checksum`.
-    temporal.api.taskqueue.v1.VersionId worker_versioning_id = 5;
+    temporal.api.common.v1.WorkerVersionCapabilities worker_version_capabilities = 5;


is there going to be a capability flag for this? how should a versioned worker behave if it connects to a server that doesn't support versioning?

There is -

api/temporal/api/workflowservice/v1/request_response.proto

Line 875 in 44e4397

bool build_id_based_versioning = 6;

Good question. We hadn't defined that. IMO the right thing to do is for the worker to throw errors on startup (we get capabilities on startup, and if you configured your worker to use versioning but the server can't do it, we can throw then)

In a discussion this week we decided we want to be able to turn this on for individual namespaces, so we might need more than capabilities. We could figure that out in another PR

dnr · 2023-02-25T01:27:41Z

temporal/api/common/v1/message.proto

+    string build_id = 1;
+    // Set if the worker used a dynamically loadable bundle to process
+    // the task. The bundle could be a WASM blob, JS bundle, etc.
+    string bundle_id = 2;


is bundle_id included in build_id, i.e. does build_id definitely change if bundle_id changes? the compatible set stuff is all about build ids, not bundle ids at all, so if one build_id can use multiple bundles then it seems like this won't work

My intention was that any worker with a given build_id should be able to load up all bundle_ids that any other compatible worker (as defined by build_id compatibility) would also be able to load. Hence by construction it should slot into the idea of build_id compatibility.

To directly answer your question - no, build_id doesn't necessarily change if bundle_id changes. However, from a compat perspective it should be consistent. EX: If workers A and B have the same (or compatible) build_ids, this scenario can happen/work:

A takes task one, replies build id foo and bundle_id b_1

B takes next task, build id foo.1 and bundle b_2

A takes next task, still has build id foo, but maybe doesn't have b_2, but it can download it and it's expected that works, because foo is compat with foo.1

How does A know that it should use b_2 at that point? I know this feature isn't fully designed yet, just trying to make sure we won't have to make incompatible changes here later, e.g. needing to have bundle ids also (i.e. the whole WorkerVersionStamp) where this change just has build ids

Ah, because it simply looks at the history which includes the version stamp on the last WFT complete, and downloads that bundle.

Oh duh, I'm thinking too server-centric

dnr · 2023-03-10T20:38:44Z

temporal/api/common/v1/message.proto

+    string build_id = 1;
+    // Set if the worker used a dynamically loadable bundle to process
+    // the task. The bundle could be a WASM blob, JS bundle, etc.
+    string bundle_id = 2;


Oh duh, I'm thinking too server-centric

Sushisource requested review from a team as code owners September 14, 2022 21:44

bergundy approved these changes Sep 15, 2022

View reviewed changes

temporal/api/taskqueue/v1/message.proto Outdated Show resolved Hide resolved

temporal/api/common/v1/message.proto Outdated Show resolved Hide resolved

bergundy reviewed Sep 15, 2022

View reviewed changes

temporal/api/common/v1/message.proto Outdated Show resolved Hide resolved

cretz reviewed Sep 16, 2022

View reviewed changes

temporal/api/common/v1/message.proto Outdated Show resolved Hide resolved

temporal/api/taskqueue/v1/message.proto Outdated Show resolved Hide resolved

dnr reviewed Sep 20, 2022

View reviewed changes

Sushisource force-pushed the versioning-api-refactor branch from 87faccb to cb9b18c Compare September 26, 2022 17:46

Sushisource mentioned this pull request Sep 27, 2022

Use new API's simpler representation of versioning data temporalio/temporal#3432

Merged

Sushisource force-pushed the versioning-api-refactor branch 3 times, most recently from e84fb09 to e219a57 Compare September 27, 2022 23:49

bergundy reviewed Sep 27, 2022

View reviewed changes

temporal/api/history/v1/message.proto Outdated Show resolved Hide resolved

bergundy reviewed Sep 27, 2022

View reviewed changes

temporal/api/taskqueue/v1/message.proto Outdated Show resolved Hide resolved

bergundy reviewed Sep 27, 2022

View reviewed changes

temporal/api/workflowservice/v1/request_response.proto Outdated Show resolved Hide resolved

bergundy reviewed Sep 28, 2022

View reviewed changes

temporal/api/workflowservice/v1/request_response.proto Outdated Show resolved Hide resolved

bergundy reviewed Sep 28, 2022

View reviewed changes

Sushisource force-pushed the versioning-api-refactor branch from 0cced3e to 14e34f2 Compare October 26, 2022 17:16

Sushisource force-pushed the versioning-api-refactor branch from 2f4d2c0 to aa7bd92 Compare November 2, 2022 19:44

bergundy reviewed Nov 10, 2022

View reviewed changes

temporal/api/workflowservice/v1/request_response.proto Outdated Show resolved Hide resolved

bergundy reviewed Nov 10, 2022

View reviewed changes

temporal/api/workflowservice/v1/request_response.proto Outdated Show resolved Hide resolved

bergundy reviewed Nov 10, 2022

View reviewed changes

temporal/api/workflowservice/v1/request_response.proto Show resolved Hide resolved

Sushisource added 6 commits February 22, 2023 14:15

Refactor versioning API

239d94c

* Simpler representation of version "graph" * Support id for each compatible set * Support for future WASM bundles

Oneof-ify the update message for more clarity

08a130d

Minor renames/comment adjust

98dd416

Add promote-within-set operation

b7ebed8

Add clarifying comments about reasons for set ids to exist in API

fb952ae

Add flags and response info for questions about build ids

9f3d295

Sushisource force-pushed the versioning-api-refactor branch from aa7bd92 to 9f3d295 Compare February 22, 2023 22:16

Final bits of review feedback now that we're coming back to this

059335f

dnr reviewed Feb 25, 2023

View reviewed changes

Sushisource added 2 commits February 27, 2023 15:33

Address David's recent feedback

c541c57

Fix extraneous field

32e0a1d

dnr approved these changes Mar 10, 2023

View reviewed changes

Merge branch 'master' into versioning-api-refactor

2ba5287

Sushisource merged commit 9206f15 into master Mar 13, 2023

Sushisource deleted the versioning-api-refactor branch March 13, 2023 19:29

Refactor versioning API #237

Refactor versioning API #237

Conversation

Sushisource commented Sep 14, 2022

bergundy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sushisource Nov 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sushisource Feb 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sushisource Nov 11, 2022 •

edited

Loading

Sushisource Feb 27, 2023 •

edited

Loading