Expose history size to workflows #3055

dnr · 2022-07-05T16:51:39Z

What changed?
This fills in HistorySizeBytes and SuggestContinueAsNew on WorkflowTaskStartedEventAttributes, added in temporalio/api#178.

Why?
So workflows can decide whether to continue-as-new with less guessing. Fixes #2726 and #1114

How did you test it?
New integration test

Potential risks
Bugs in this logic could lead to inconsistency between the values sent in transient workflow tasks and values actually recorded in history, which could lead to determinism errors on replay.

Is hotfix candidate?
no

dnr · 2022-07-05T17:34:24Z

service/history/workflow/workflow_task_state_machine.go

@@ -221,9 +233,6 @@ func (m *workflowTaskStateMachine) AddWorkflowTaskScheduleToStartTimeoutEvent(
 		return nil, m.ms.createInternalServerError(opTag)
 	}

-	// clear stickiness whenever workflow task fails
-	m.ms.ClearStickyness()


this isn't needed here because ReplicateWorkflowTaskTimedOutEvent(enumspb.TIMEOUT_TYPE_SCHEDULE_TO_START) below will always call ClearStickyness itself

It could affect incrementTimeout calculation in FailWorkflowTask but because timeout type is enumspb.TIMEOUT_TYPE_SCHEDULE_TO_START, it is false anyway, so I think it is safe to remove it from here.

yiminc

I think we only need the values in WorkflowTaskStartedEventAttributes.
Could you explain what is purpose of the values used in ExecutionInfo, and what the purpose of it in WorkflowTaskInfo.

host/transient_task_test.go

yiminc · 2022-07-16T06:23:33Z

proto/internal/temporal/server/api/persistence/v1/executions.proto

+    bool workflow_task_suggest_continue_as_new = 67;
+    int64 workflow_task_history_size_bytes = 68;
+
+    bool cancel_requested = 29;


shall we keep the field number in order?

no, I'd much rather see fields grouped by function

yiminc · 2022-07-16T06:27:18Z

service/history/configs/config.go

+		HistorySizeSuggestContinueAsNew:  dc.GetIntPropertyFilteredByNamespace(dynamicconfig.HistorySizeSuggestContinueAsNew, 2*1024*1024),
+		HistoryCountSuggestContinueAsNew: dc.GetIntPropertyFilteredByNamespace(dynamicconfig.HistoryCountSuggestContinueAsNew, 2*1024),
+


Feels a bit aggressive. Maybe double it to 4MB and 4K, which still is arbitrary. But gRPC default size limit is also 4MB. :)

I don't have a great feel for it so I'll trust your judgement here. I'm curious what sdk team would say

service/history/workflow/mutable_state_impl.go

yiminc · 2022-07-16T06:47:20Z

service/history/workflow/workflow_task_state_machine.go

+	if stats == nil {
+		return false, 0
+	}
+	// QUESTION: in some cases we might have history events in memory that we haven't written


do you mean buffered events? Those are not visible to workflow yet, so I think they should not be counted.

Not exactly... look at where this is called. Consider the case where we had buffered events, and then we retry the wft. So on line 400, we do AddWorkflowTaskScheduledEvent and then reset Attempt to 1, so we end up here. That wftscheduledevent will be visible to to the workflow, but it won't be counted in HistorySize here, since that only gets updated when the transaction is closed

If I get it right, stats is updated on write. We get here, when there are some events were added to history but not persisted yet. And stats doesn't reflect them, but m.ms.GetNextEventID() is current in-memory last event. I think they are not consistent here. But this is probably not a big deal.

Not consistent (in the sense I described) would be a big deal. But I think they are consistent but just not accurate, which isn't a big deal

Sorry, by "consistent" I mean consistency between "size" and "count". "Accurate" is a better word, yes.

service/history/workflowTaskHandlerCallbacks.go

yiminc · 2022-07-16T06:54:22Z

service/history/workflow/workflow_task_state_machine.go

+		// QUESTION: should we preserve these here? this is used by mutable state rebuilder. it
+		// seems like the same logic as case 1 above applies: if a failover happens right after
+		// this, then AddWorkflowTaskStartedEvent will rewrite these anyway. is that correct?


I'm not sure about this. cc @yycptt

Yeah I think the value here doesn't matter as it will get overwritten anyway, either when starting the workflow task (if failover happens) or when replicating started event (no failover).

Same here. Let's not set them unit WT is started.

dnr · 2022-07-18T18:48:19Z

I think we only need the values in WorkflowTaskStartedEventAttributes. Could you explain what is purpose of the values used in ExecutionInfo, and what the purpose of it in WorkflowTaskInfo.

It's an obscure corner case, but I thought in the discussion we decided we needed to handle it:

What if you send the worker a wftstartedevent with suggestcontinueasnew == false, and it fails/times out. Then you send a second attempt, which is now a transient wft, with suggestcontinueasnew == false. Then you change dynamic config so that the same history size now makes suggestcontinueasnew == true. Now the worker responds to the wft successfully, and you have to write out the transient events to history. If you re-evaluate suggestcontinueasnew at that point and write a wftstartedevent with it as true, you'll get a determinism error on replay. (Assuming the workflow follows the suggestion.)

If we didn't use dynamic config I agree we wouldn't have to keep it in mutable state.

alexshtin · 2023-02-04T01:04:12Z

service/history/workflow/mutable_state_impl.go

@@ -1209,32 +1209,6 @@ func (e *MutableStateImpl) DeleteUserTimer(
 	return nil
 }

-// nolint:unused


This is already gone.

alexshtin · 2023-02-04T01:08:35Z

host/transient_task_test.go

+	}
+
+	// workflow logic
+	stage := 0


I call it wtHandlerCalls.

alexshtin · 2023-02-04T01:10:53Z

proto/internal/temporal/server/api/persistence/v1/executions.proto

@@ -87,16 +87,21 @@ message WorkflowExecutionInfo {
    int64 last_workflow_task_started_event_id = 19;
    google.protobuf.Timestamp start_time = 20 [(gogoproto.stdtime) = true];
    google.protobuf.Timestamp last_update_time = 21 [(gogoproto.stdtime) = true];
+
+    // This group of fields contains info about the current in-flight workflow task


Suggested change

// This group of fields contains info about the current in-flight workflow task

// This group of fields contains info about the current workflow task

"in-flight" means running in other places. I reordered these already too.

alexshtin · 2023-02-04T01:14:23Z

service/history/workflow/history_builder.go

+			SuggestContinueAsNew: suggestContinueAsNew,
+			HistorySizeBytes:     historySizeBytes,


Oh.. Attributes are already there for half a year.

alexshtin · 2023-02-04T01:30:23Z

service/history/workflow/mutable_state.go

+		// These two fields are sent to workers in the WorkflowTaskStarted event. We need to save a
+		// copy here to ensure that we send the same values with every transient WorkflowTaskStarted
+		// event, otherwise a dynamic config change of the suggestion threshold could cause the
+		// event that the worker used to not match the event we saved in history.
+		SuggestContinueAsNew bool
+		HistorySizeBytes     int64


Are you trying to make history deterministic? I don't think it is necessary. First of all SDKs can ignore these fields in non-determinism detector same way as they do for activity arguments. But even if they don't, SDK will just replay history from the beginning which is ok for workflows with continuously failing WT. I think SDK already does it (may be not).

But you still need these fields just to pass this data around. All WT related fields from executions.proto must be here.

as we discussed: they can change across attempts, but we do need to keep them and can't just recompute them because of determinism. updated comment

alexshtin · 2023-02-04T01:42:45Z

service/history/workflow/workflow_task_state_machine.go

+		// QUESTION: should we preserve these here? this is used by mutable state rebuilder. it
+		// seems like the same logic as case 1 above applies: if a failover happens right after
+		// this, then AddWorkflowTaskStartedEvent will rewrite these anyway. is that correct?


Same here. Let's not set them unit WT is started.

alexshtin · 2023-02-04T01:46:53Z

service/history/workflow/workflow_task_state_machine.go

@@ -221,9 +233,6 @@ func (m *workflowTaskStateMachine) AddWorkflowTaskScheduleToStartTimeoutEvent(
 		return nil, m.ms.createInternalServerError(opTag)
 	}

-	// clear stickiness whenever workflow task fails
-	m.ms.ClearStickyness()


It could affect incrementTimeout calculation in FailWorkflowTask but because timeout type is enumspb.TIMEOUT_TYPE_SCHEDULE_TO_START, it is false anyway, so I think it is safe to remove it from here.

alexshtin · 2023-02-04T01:48:54Z

service/history/workflow/workflow_task_state_machine.go

@@ -614,7 +633,10 @@ func (m *workflowTaskStateMachine) DeleteWorkflowTask() {

 		TaskQueue: nil,
 		// Keep the last original scheduled Timestamp, so that AddWorkflowTaskScheduledEventAsHeartbeat can continue with it.
-		OriginalScheduledTime: m.getWorkflowTaskInfo().OriginalScheduledTime,
+		OriginalScheduledTime: m.ms.executionInfo.WorkflowTaskOriginalScheduledTime,


Please leave getWorkflowTaskInfo(). It will help me to refactor WT state machine in future.

alexshtin · 2023-02-04T02:07:11Z

service/history/workflow/workflow_task_state_machine.go

+	if stats == nil {
+		return false, 0
+	}
+	// QUESTION: in some cases we might have history events in memory that we haven't written


If I get it right, stats is updated on write. We get here, when there are some events were added to history but not persisted yet. And stats doesn't reflect them, but m.ms.GetNextEventID() is current in-memory last event. I think they are not consistent here. But this is probably not a big deal.

alexshtin · 2023-02-04T02:24:11Z

service/history/workflow/workflow_task_state_machine.go

+		workflowTask.SuggestContinueAsNew, workflowTask.HistorySizeBytes = m.getHistorySizeInfo()
+


Why not to compute it for every attempt?

If we compute it for every attempt, but only actually write the event to history on the first attempt, then the value used by the workflow may be different from the value written to history, so replay would cause a determinism error. Like 90% of the complexity of this PR is just for that

For the record, the answer here is that we write a new started event one when the retried wft completes, so as long as that new started event has the right values, we're good

dnr

rebased!

dnr · 2023-02-06T18:49:43Z

service/history/workflow/workflow_task_state_machine.go

+		workflowTask.SuggestContinueAsNew, workflowTask.HistorySizeBytes = m.getHistorySizeInfo()
+


If we compute it for every attempt, but only actually write the event to history on the first attempt, then the value used by the workflow may be different from the value written to history, so replay would cause a determinism error. Like 90% of the complexity of this PR is just for that

dnr · 2023-02-06T18:51:00Z

service/history/workflow/workflow_task_state_machine.go

+			workflowTask.SuggestContinueAsNew,
+			workflowTask.HistorySizeBytes,


I was a little unsure about these. I think probably it should recompute them here? (It seems logical that it should recompute every time hBuilder.AddWorkflowTaskStartedEvent is called, and not any other time.) But I'm not sure how speculative works...

this is a failure, so it doesn't matter

dnr · 2023-02-06T18:52:46Z

service/history/workflow/workflow_task_state_machine.go

+	if stats == nil {
+		return false, 0
+	}
+	// QUESTION: in some cases we might have history events in memory that we haven't written


Not consistent (in the sense I described) would be a big deal. But I think they are consistent but just not accurate, which isn't a big deal

dnr

simplified a little based on discussion

dnr · 2023-02-09T06:40:48Z

service/history/workflow/mutable_state.go

+		// These two fields are sent to workers in the WorkflowTaskStarted event. We need to save a
+		// copy here to ensure that we send the same values with every transient WorkflowTaskStarted
+		// event, otherwise a dynamic config change of the suggestion threshold could cause the
+		// event that the worker used to not match the event we saved in history.
+		SuggestContinueAsNew bool
+		HistorySizeBytes     int64


as we discussed: they can change across attempts, but we do need to keep them and can't just recompute them because of determinism. updated comment

dnr · 2023-02-09T06:41:31Z

service/history/workflow/workflow_task_state_machine.go

+	// events. That's okay, it doesn't have to be 100% accurate. It just has to be kept
+	// consistent between the started event in history and the event that was sent to the SDK
+	// that resulted in the successful completion.
+	suggestContinueAsNew, historySizeBytes := m.getHistorySizeInfo()


as discussed, just compute every time we get here

dnr · 2023-02-09T06:41:55Z

service/history/workflow/workflow_task_state_machine.go

+			workflowTask.SuggestContinueAsNew,
+			workflowTask.HistorySizeBytes,


this is a failure, so it doesn't matter

alexshtin · 2023-02-09T07:07:26Z

service/history/workflow/mutable_state.go

+		// These two fields are sent to workers in the WorkflowTaskStarted event. We need to save a
+		// copy in mutable state to know the last values we sent (which might have been in a
+		// transient event), otherwise a dynamic config change of the suggestion threshold could
+		// cause the WorkflowTaskStarted event that the worker used to not match the event we saved
+		// in history.


I guess this comment also needs to be updated.

what part is wrong? I just updated this one

alexshtin · 2023-02-09T20:35:12Z

Feb,9-Jul,5 = 219

dnr commented Jul 5, 2022

View reviewed changes

cretz mentioned this pull request Jul 13, 2022

SDKs should expose history length and size via Workflow info temporalio/features#16

Open

dnr mentioned this pull request Jul 14, 2022

Make staticcheck great again #3103

Merged

yiminc reviewed Jul 16, 2022

View reviewed changes

dnr mentioned this pull request Oct 28, 2022

[Feature Request] Expose history size bytes to workflow execution info temporalio/features#140

Open

alexshtin reviewed Feb 4, 2023

View reviewed changes

rebase!

e237199

dnr force-pushed the wfsize branch from cbffe9c to e237199 Compare February 4, 2023 19:43

dnr commented Feb 6, 2023

View reviewed changes

simplify logic slightly, recompute on every attempt

6d35260

dnr marked this pull request as ready for review February 9, 2023 06:32

dnr requested a review from a team as a code owner February 9, 2023 06:32

dnr commented Feb 9, 2023

View reviewed changes

comment

40e1e43

alexshtin approved these changes Feb 9, 2023

View reviewed changes

dnr added 2 commits February 8, 2023 23:56

rearrange test to work with recompute per attempt, and check history

61b74a8

Merge branch 'master' of github.com:temporalio/temporal into wfsize

62824cb

dnr merged commit d4498d1 into temporalio:master Feb 9, 2023

dnr deleted the wfsize branch February 9, 2023 20:24

jeffschoner mentioned this pull request Oct 16, 2023

History size and suggest continue as new coinbase/temporal-ruby#269

Merged

		HistorySizeSuggestContinueAsNew: dc.GetIntPropertyFilteredByNamespace(dynamicconfig.HistorySizeSuggestContinueAsNew, 210241024),
		HistoryCountSuggestContinueAsNew: dc.GetIntPropertyFilteredByNamespace(dynamicconfig.HistoryCountSuggestContinueAsNew, 2*1024),

	// This group of fields contains info about the current in-flight workflow task
	// This group of fields contains info about the current workflow task

		SuggestContinueAsNew: suggestContinueAsNew,
		HistorySizeBytes: historySizeBytes,

		workflowTask.SuggestContinueAsNew, workflowTask.HistorySizeBytes = m.getHistorySizeInfo()

		workflowTask.SuggestContinueAsNew,
		workflowTask.HistorySizeBytes,

Expose history size to workflows #3055

Expose history size to workflows #3055

Conversation

dnr commented Jul 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yiminc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yycptt Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnr commented Jul 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexshtin commented Feb 9, 2023

dnr commented Jul 5, 2022 •

edited

Loading

yycptt Jul 19, 2022 •

edited

Loading