-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement server side retry for workflow/child workflow #915
Conversation
newRunExecutionInfo.NextEventID = nextEventID | ||
newRunExecutionInfo.LastFirstEventID = startedEvent.GetEventId() | ||
// Set the history from replication task on the newStateBuilder | ||
newRunStateBuilder.SetHistoryBuilder(newHistoryBuilderFromEvents(newRunHistory.Events, b.logger)) | ||
sourceClusterName := b.clusterMetadata.ClusterNameForFailoverVersion(startedEvent.GetVersion()) | ||
newRunStateBuilder.UpdateReplicationStateLastEventID(sourceClusterName, startedEvent.GetVersion(), nextEventID-1) | ||
|
||
b.newRunTransferTasks = append(b.newRunTransferTasks, b.scheduleDecisionTransferTask(domainID, | ||
b.getTaskList(newRunStateBuilder), di.ScheduleID)) | ||
if startedAttributes.GetAttempt() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if attempt is !=0, then no decision is scheduled?
this can be fine if the workflow is being replicated, however, when doing a failover and the new attempt workflow is picked up by the previous standby cluster (now active), the no worker can pick this workflow up.
this workflow will be invisible until the decision task scheduled event is replicated from active cluster (now standby after the failover). however, there is no guarantee that this decision task scheduled event will be replicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first scheduled event will be created by a backoff timer, the timer should fire and because it is now become active, it would create the scheduled event locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the stand by timer queue process will need to be updated to handle that case. will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yiminc the standby processor only does verification, as long as the timer task is created, the failover will handle the task processing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the failover processing logic is the same as the active processing logic
service/history/historyEngine.go
Outdated
@@ -1392,6 +1423,9 @@ Update_History_Loop: | |||
// the history and try the operation again. | |||
var updateErr error | |||
if continueAsNewBuilder != nil { | |||
if msBuilder.GetContinueAsNew() != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this will always be non nil after checking the continueAsNewBuilder != nil
ref: https://github.com/uber/cadence/pull/915/files#diff-01045b0895719962d1f13440e8795d1bR2416
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I'm ok landing this after cutting the release for XDC.
083f300
to
fb6aae4
Compare
…cadence-workflow#910)" This reverts commit 492faf0. Fix continueAsNew replication issue update doc for retry policy in idl remove unnecessary nil check handle WorkflowRetryTimerTask correctly on standby side update test for cassandra tools
5ea2bc2
to
84e65fd
Compare
This PR added server side retry for workflow/child workflow.
The change was landed #885 but reverted because of regression when doing replication for ContinueAsNew event.
The regression is fixed.