Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement server side retry for workflow/child workflow #915

Merged
merged 1 commit into from
Aug 23, 2018

Conversation

yiminc-zz
Copy link

@yiminc-zz yiminc-zz commented Jun 29, 2018

This PR added server side retry for workflow/child workflow.
The change was landed #885 but reverted because of regression when doing replication for ContinueAsNew event.
The regression is fixed.

newRunExecutionInfo.NextEventID = nextEventID
newRunExecutionInfo.LastFirstEventID = startedEvent.GetEventId()
// Set the history from replication task on the newStateBuilder
newRunStateBuilder.SetHistoryBuilder(newHistoryBuilderFromEvents(newRunHistory.Events, b.logger))
sourceClusterName := b.clusterMetadata.ClusterNameForFailoverVersion(startedEvent.GetVersion())
newRunStateBuilder.UpdateReplicationStateLastEventID(sourceClusterName, startedEvent.GetVersion(), nextEventID-1)

b.newRunTransferTasks = append(b.newRunTransferTasks, b.scheduleDecisionTransferTask(domainID,
b.getTaskList(newRunStateBuilder), di.ScheduleID))
if startedAttributes.GetAttempt() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if attempt is !=0, then no decision is scheduled?
this can be fine if the workflow is being replicated, however, when doing a failover and the new attempt workflow is picked up by the previous standby cluster (now active), the no worker can pick this workflow up.
this workflow will be invisible until the decision task scheduled event is replicated from active cluster (now standby after the failover). however, there is no guarantee that this decision task scheduled event will be replicated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first scheduled event will be created by a backoff timer, the timer should fire and because it is now become active, it would create the scheduled event locally.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the stand by timer queue process will need to be updated to handle that case. will update it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiminc the standby processor only does verification, as long as the timer task is created, the failover will handle the task processing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the failover processing logic is the same as the active processing logic

@@ -1392,6 +1423,9 @@ Update_History_Loop:
// the history and try the operation again.
var updateErr error
if continueAsNewBuilder != nil {
if msBuilder.GetContinueAsNew() != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this will always be non nil after checking the continueAsNewBuilder != nil
ref: https://github.com/uber/cadence/pull/915/files#diff-01045b0895719962d1f13440e8795d1bR2416

Copy link
Contributor

@samarabbas samarabbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'm ok landing this after cutting the release for XDC.

@yiminc-zz yiminc-zz changed the title Fix ContinueAsNew event replication Implement server side retry for child workflow Aug 22, 2018
@yiminc-zz yiminc-zz changed the title Implement server side retry for child workflow Implement server side retry for workflow/child workflow Aug 22, 2018
…cadence-workflow#910)"

This reverts commit 492faf0.

Fix continueAsNew replication issue

update doc for retry policy in idl

remove unnecessary nil check

handle WorkflowRetryTimerTask correctly on standby side

update test for cassandra tools
@yiminc-zz yiminc-zz merged commit c183384 into cadence-workflow:master Aug 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants