Add verification of replication tasks in force replication #4630

hehaifengcn · 2023-07-13T23:44:42Z

What changed?
Add a verification step to check if generated workflow executions exist on target cluster.

Why?
To ensure all generated replication tasks have been successfully applied on target cluster.

How did you test it?
Unit tests + cluster tests

Potential risks

Is hotfix candidate?
No

hehaifengcn · 2023-07-13T23:52:34Z

service/worker/migration/activities.go

-	case *serviceerror.NotFound:
-		return nil
-	default:
+	if err != nil {
 		return err


surface NotFound so verification can skipped such WF.

wxing1292 · 2023-07-14T00:46:49Z

service/worker/migration/activities.go

+		we := request.Executions[i]
+		tags := []tag.Tag{tag.WorkflowType(forceReplicationWorkflowName), tag.WorkflowNamespaceID(request.NamespaceID), tag.WorkflowID(we.WorkflowId), tag.WorkflowRunID(we.RunId)}
+
+		resp, err := a.historyClient.DescribeMutableState(ctx, &historyservice.DescribeMutableStateRequest{


a.historyClient -> a.localHistoryClient?

or source history client

i think historyClient implicitly mean local as other where in the codebase. I can change it if you feel strong about it.

wxing1292 · 2023-07-14T00:47:59Z

service/worker/migration/activities.go

+
+		switch err.(type) {
+		case nil:
+			if resp.GetCacheMutableState().GetExecutionState().GetState() == enumsspb.WORKFLOW_EXECUTION_STATE_ZOMBIE {


if resp.GetCacheMutableState() this may return nil, use GetDatabaseMutableState

applied. what are the difference between 2? GetDatabaseMutableState also just read from the object?

GetCacheMutableState contains the cached version of mutable state, which can be nil (not cached)
GetDatabaseMutableState directly load mutable state from DB

wxing1292

overall idea LGTM

implementation can be improved, by e.g. break giant for loop into for loop and function invocations

service/worker/migration/activities.go

yux0 · 2023-07-14T16:30:42Z

service/worker/migration/activities.go

+	replicationTasksHeartbeatDetails struct {
+		Results                       []VerifyResult
+		CheckPoint                    time.Time
+		LastNotFoundWorkflowExecution commonpb.WorkflowExecution


What is the purpose of this? Should we use a fixed length slice?

VerifyResult keeps the status of replication tasks for each execution task from input, which is a variable-length array:

service/worker/migration/activities.go

yux0 · 2023-07-14T16:40:45Z

common/metrics/metric_defs.go

@@ -1563,6 +1563,12 @@ var (
 	ScheduleCancelWorkflowErrors                      = NewCounterDef("schedule_cancel_workflow_errors")
 	ScheduleTerminateWorkflowErrors                   = NewCounterDef("schedule_terminate_workflow_errors")

+	// Force replication


Do we care about verification failure?

i think we can use activity failure metrics. we can always add later if needed.

yux0 · 2023-07-14T16:49:23Z

service/worker/migration/activities.go

+				a.forceReplicationMetricsHandler.Counter(metrics.EncounterZombieWorkflowCount.GetMetricName()).Record(1)
+				a.logger.Info("createReplicationTasks skip Zombie workflow", tags...)
+
+				r.Status = VERIFY_SKIPPED


Seem like we skip generate replication task, not just skip verify. And we should we filter this in the force replication API?

**What changed?** Add a verification step to check if generated workflow executions exist on target cluster.  **Why?** To ensure all generated replication tasks have been successfully applied on target cluster.  **How did you test it?** Unit tests + cluster tests  **Potential risks**  **Is hotfix candidate?** No

hehaifengcn marked this pull request as ready for review July 14, 2023 00:20

hehaifengcn requested a review from a team as a code owner July 14, 2023 00:20

hehaifengcn requested review from yux0 and wxing1292 July 14, 2023 00:21

hehaifengcn commented Jul 14, 2023

View reviewed changes

wxing1292 reviewed Jul 14, 2023

View reviewed changes

wxing1292 approved these changes Jul 14, 2023

View reviewed changes

hehaifengcn force-pushed the haifengh/force-replication-master-pr branch from 27300fd to f138cb9 Compare July 14, 2023 03:33

hehaifengcn enabled auto-merge (squash) July 14, 2023 03:35

yux0 reviewed Jul 14, 2023

View reviewed changes

yux0 approved these changes Jul 14, 2023

View reviewed changes

hehaifengcn added 6 commits July 14, 2023 14:24

Add verification of replication tasks in force replication

7814aeb

update comments

a921c08

update

4a0e00c

lint

9781abd

address PR comment

13f5029

add license header

bf5dfa0

hehaifengcn force-pushed the haifengh/force-replication-master-pr branch from ae0f352 to bf5dfa0 Compare July 14, 2023 21:24

hehaifengcn merged commit 7efba9f into master Jul 14, 2023

hehaifengcn deleted the haifengh/force-replication-master-pr branch July 14, 2023 22:16

hehaifengcn added the release/1.21.3 label Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add verification of replication tasks in force replication #4630

Add verification of replication tasks in force replication #4630

hehaifengcn commented Jul 13, 2023

hehaifengcn Jul 13, 2023

wxing1292 Jul 14, 2023

wxing1292 Jul 14, 2023

hehaifengcn Jul 14, 2023

wxing1292 Jul 14, 2023

hehaifengcn Jul 14, 2023 •

edited

Loading

wxing1292 Jul 14, 2023

wxing1292 left a comment

yux0 Jul 14, 2023

hehaifengcn Jul 14, 2023

yux0 Jul 14, 2023

hehaifengcn Jul 14, 2023

yux0 Jul 14, 2023

Add verification of replication tasks in force replication #4630

Add verification of replication tasks in force replication #4630

Conversation

hehaifengcn commented Jul 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hehaifengcn Jul 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wxing1292 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hehaifengcn Jul 14, 2023 •

edited

Loading