Execute VerifyReplicationTasks as an individual activity #4656

hehaifengcn · 2023-07-20T05:57:57Z

What changed?
Divide GenerateAndVerifyReplicationTasks activity into two activities: GenerateReplicationTasks (reuse previous one) and VerifyReplicationTasks

Why?
Based on cluster tests, GenerateReplicationTasks is expensive (10ms latency for GenerateLastHistoryReplicationTasks call). In previous implementation, VerificationTasks runs after GenerateReplicationTasks and we only get ~60 RPS for GenerateAndVerifyReplicationTasks. By dividing the two, we can achieve ~100 RPS VerifyReplicationTasks for a single activity (bottleneck is still GenerateReplicationTasks because of 10ms latency).

Also moved the special handling of WF not_found on target to VerifyReplicationTasks, which reduced # of DescribeMutableState call on source cluster. In previous implementation, DescribeMutableState is called for every replication task. Now we only call DescribeMutableState if WF was not found on target (which should be rare for steady state). The downside is that we can potentially replicate Zombie WF from source to target. But it should be avoidable by eliminating Zombie during migration process (i.e., delete WF on target if migration is incomplete).

How did you test it?
Unit test & cluster tests.

Potential risks
Low, the feature is disabled by default and only affect force replication workflow.

Is hotfix candidate?
No.

service/worker/migration/activities.go

wxing1292 · 2023-07-20T22:54:42Z

service/worker/migration/force_replication_workflow.go

@@ -374,41 +374,49 @@ func enqueueReplicationTasks(ctx workflow.Context, workflowExecutionsCh workflow
 	var lastActivityErr error

 	for workflowExecutionsCh.Receive(ctx, &workflowExecutions) {


do you want to control the activity concurrency?

if so, here is a example: https://github.com/uber/cadence/blob/master/canary/concurrentExec.go

explained offline. code already handle concurrency control.

**What changed?** Divide GenerateAndVerifyReplicationTasks activity into two activities: GenerateReplicationTasks (reuse previous one) and VerifyReplicationTasks  **Why?** Based on cluster tests, GenerateReplicationTasks is expensive (10ms latency for `GenerateLastHistoryReplicationTasks` call). In previous implementation, VerificationTasks runs after GenerateReplicationTasks and we only get ~60 RPS for GenerateAndVerifyReplicationTasks. By dividing the two, we can achieve ~100 RPS VerifyReplicationTasks for a single activity (bottleneck is still GenerateReplicationTasks because of 10ms latency). Also moved the special handling of WF not_found on target to VerifyReplicationTasks, which reduced # of `DescribeMutableState` call on source cluster. In previous implementation, `DescribeMutableState` is called for every replication task. Now we only call `DescribeMutableState` if WF was not found on target (which should be rare for steady state). The downside is that we can potentially replicate Zombie WF from source to target. But it should be avoidable by eliminating Zombie during migration process (i.e., delete WF on target if migration is incomplete).  **How did you test it?** Unit test & cluster tests.  **Potential risks** Low, the feature is disabled by default and only affect force replication workflow.  **Is hotfix candidate?** No.

yux0 reviewed Jul 20, 2023

View reviewed changes

service/worker/migration/activities.go Outdated Show resolved Hide resolved

hehaifengcn marked this pull request as ready for review July 20, 2023 18:58

hehaifengcn requested a review from a team as a code owner July 20, 2023 18:58

hehaifengcn requested a review from wxing1292 July 20, 2023 18:58

hehaifengcn force-pushed the haifengh/v1.21.2-verify-test-parallel-pr branch from fa6fdd6 to e94efbb Compare July 20, 2023 19:07

wxing1292 reviewed Jul 20, 2023

View reviewed changes

wxing1292 approved these changes Jul 20, 2023

View reviewed changes

hehaifengcn added 2 commits July 20, 2023 20:38

Execute VerifyReplicationTasks as an individual activity

d9ca331

update

990079d

hehaifengcn force-pushed the haifengh/v1.21.2-verify-test-parallel-pr branch from e94efbb to 990079d Compare July 21, 2023 03:38

meiliang86 approved these changes Jul 21, 2023

View reviewed changes

yux0 approved these changes Jul 21, 2023

View reviewed changes

hehaifengcn merged commit 00587f6 into master Jul 21, 2023

hehaifengcn deleted the haifengh/v1.21.2-verify-test-parallel-pr branch July 21, 2023 05:04

meiliang86 added the release/1.21.3 label Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execute VerifyReplicationTasks as an individual activity #4656

Execute VerifyReplicationTasks as an individual activity #4656

hehaifengcn commented Jul 20, 2023 •

edited

Loading

wxing1292 Jul 20, 2023

hehaifengcn Jul 21, 2023 •

edited

Loading

		@@ -374,41 +374,49 @@ func enqueueReplicationTasks(ctx workflow.Context, workflowExecutionsCh workflow
		var lastActivityErr error

		for workflowExecutionsCh.Receive(ctx, &workflowExecutions) {

Execute VerifyReplicationTasks as an individual activity #4656

Execute VerifyReplicationTasks as an individual activity #4656

Conversation

hehaifengcn commented Jul 20, 2023 • edited Loading

wxing1292 Jul 20, 2023

Choose a reason for hiding this comment

hehaifengcn Jul 21, 2023 • edited Loading

Choose a reason for hiding this comment

hehaifengcn commented Jul 20, 2023 •

edited

Loading

hehaifengcn Jul 21, 2023 •

edited

Loading