Implement server side retry for workflow/child workflow #915

yiminc-zz · 2018-06-29T22:51:51Z

This PR added server side retry for workflow/child workflow.
The change was landed #885 but reverted because of regression when doing replication for ContinueAsNew event.
The regression is fixed.

wxing1292 · 2018-06-30T21:55:02Z

service/history/stateBuilder.go

 			newRunExecutionInfo.NextEventID = nextEventID
 			newRunExecutionInfo.LastFirstEventID = startedEvent.GetEventId()
 			// Set the history from replication task on the newStateBuilder
 			newRunStateBuilder.SetHistoryBuilder(newHistoryBuilderFromEvents(newRunHistory.Events, b.logger))
 			sourceClusterName := b.clusterMetadata.ClusterNameForFailoverVersion(startedEvent.GetVersion())
 			newRunStateBuilder.UpdateReplicationStateLastEventID(sourceClusterName, startedEvent.GetVersion(), nextEventID-1)

-			b.newRunTransferTasks = append(b.newRunTransferTasks, b.scheduleDecisionTransferTask(domainID,
-				b.getTaskList(newRunStateBuilder), di.ScheduleID))
+			if startedAttributes.GetAttempt() == 0 {


if attempt is !=0, then no decision is scheduled?
this can be fine if the workflow is being replicated, however, when doing a failover and the new attempt workflow is picked up by the previous standby cluster (now active), the no worker can pick this workflow up.
this workflow will be invisible until the decision task scheduled event is replicated from active cluster (now standby after the failover). however, there is no guarantee that this decision task scheduled event will be replicated.

the first scheduled event will be created by a backoff timer, the timer should fire and because it is now become active, it would create the scheduled event locally.

I think the stand by timer queue process will need to be updated to handle that case. will update it.

@yiminc the standby processor only does verification, as long as the timer task is created, the failover will handle the task processing

the failover processing logic is the same as the active processing logic

wxing1292 · 2018-06-30T22:40:23Z

service/history/historyEngine.go

@@ -1392,6 +1423,9 @@ Update_History_Loop:
 		// the history and try the operation again.
 		var updateErr error
 		if continueAsNewBuilder != nil {
+			if msBuilder.GetContinueAsNew() != nil {


i think this will always be non nil after checking the continueAsNewBuilder != nil
ref: https://github.com/uber/cadence/pull/915/files#diff-01045b0895719962d1f13440e8795d1bR2416

samarabbas

Looks good. I'm ok landing this after cutting the release for XDC.

…cadence-workflow#910)" This reverts commit 492faf0. Fix continueAsNew replication issue update doc for retry policy in idl remove unnecessary nil check handle WorkflowRetryTimerTask correctly on standby side update test for cassandra tools

yiminc-zz requested review from samarabbas and wxing1292 June 29, 2018 22:51

wxing1292 reviewed Jun 30, 2018

View reviewed changes

samarabbas approved these changes Aug 3, 2018

View reviewed changes

yiminc-zz force-pushed the continue_as_new branch from 083f300 to fb6aae4 Compare August 22, 2018 20:39

yiminc-zz changed the title ~~Fix ContinueAsNew event replication~~ Implement server side retry for child workflow Aug 22, 2018

yiminc-zz changed the title ~~Implement server side retry for child workflow~~ Implement server side retry for workflow/child workflow Aug 22, 2018

yiminc-zz force-pushed the continue_as_new branch from 5ea2bc2 to 84e65fd Compare August 22, 2018 23:34

yiminc-zz merged commit c183384 into cadence-workflow:master Aug 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement server side retry for workflow/child workflow #915

Implement server side retry for workflow/child workflow #915

yiminc-zz commented Jun 29, 2018 •

edited

Loading

wxing1292 Jun 30, 2018

yiminc-zz Jul 2, 2018

yiminc-zz Jul 3, 2018

wxing1292 Jul 3, 2018

wxing1292 Jul 3, 2018

wxing1292 Jun 30, 2018

samarabbas left a comment

Implement server side retry for workflow/child workflow #915

Implement server side retry for workflow/child workflow #915

Conversation

yiminc-zz commented Jun 29, 2018 • edited Loading

wxing1292 Jun 30, 2018

Choose a reason for hiding this comment

yiminc-zz Jul 2, 2018

Choose a reason for hiding this comment

yiminc-zz Jul 3, 2018

Choose a reason for hiding this comment

wxing1292 Jul 3, 2018

Choose a reason for hiding this comment

wxing1292 Jul 3, 2018

Choose a reason for hiding this comment

wxing1292 Jun 30, 2018

Choose a reason for hiding this comment

samarabbas left a comment

Choose a reason for hiding this comment

yiminc-zz commented Jun 29, 2018 •

edited

Loading