Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute VerifyReplicationTasks as an individual activity #4656

Merged
merged 2 commits into from
Jul 21, 2023

Conversation

hehaifengcn
Copy link
Contributor

@hehaifengcn hehaifengcn commented Jul 20, 2023

What changed?
Divide GenerateAndVerifyReplicationTasks activity into two activities: GenerateReplicationTasks (reuse previous one) and VerifyReplicationTasks

Why?
Based on cluster tests, GenerateReplicationTasks is expensive (10ms latency for GenerateLastHistoryReplicationTasks call). In previous implementation, VerificationTasks runs after GenerateReplicationTasks and we only get ~60 RPS for GenerateAndVerifyReplicationTasks. By dividing the two, we can achieve ~100 RPS VerifyReplicationTasks for a single activity (bottleneck is still GenerateReplicationTasks because of 10ms latency).

Also moved the special handling of WF not_found on target to VerifyReplicationTasks, which reduced # of DescribeMutableState call on source cluster. In previous implementation, DescribeMutableState is called for every replication task. Now we only call DescribeMutableState if WF was not found on target (which should be rare for steady state). The downside is that we can potentially replicate Zombie WF from source to target. But it should be avoidable by eliminating Zombie during migration process (i.e., delete WF on target if migration is incomplete).

How did you test it?
Unit test & cluster tests.

Potential risks
Low, the feature is disabled by default and only affect force replication workflow.

Is hotfix candidate?
No.

@hehaifengcn hehaifengcn marked this pull request as ready for review July 20, 2023 18:58
@hehaifengcn hehaifengcn requested a review from a team as a code owner July 20, 2023 18:58
@hehaifengcn hehaifengcn requested a review from wxing1292 July 20, 2023 18:58
@hehaifengcn hehaifengcn force-pushed the haifengh/v1.21.2-verify-test-parallel-pr branch from fa6fdd6 to e94efbb Compare July 20, 2023 19:07
@@ -374,41 +374,49 @@ func enqueueReplicationTasks(ctx workflow.Context, workflowExecutionsCh workflow
var lastActivityErr error

for workflowExecutionsCh.Receive(ctx, &workflowExecutions) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to control the activity concurrency?

if so, here is a example: https://github.com/uber/cadence/blob/master/canary/concurrentExec.go

Copy link
Contributor Author

@hehaifengcn hehaifengcn Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explained offline. code already handle concurrency control.

@hehaifengcn hehaifengcn force-pushed the haifengh/v1.21.2-verify-test-parallel-pr branch from e94efbb to 990079d Compare July 21, 2023 03:38
@hehaifengcn hehaifengcn merged commit 00587f6 into master Jul 21, 2023
@hehaifengcn hehaifengcn deleted the haifengh/v1.21.2-verify-test-parallel-pr branch July 21, 2023 05:04
wxing1292 pushed a commit that referenced this pull request Jul 21, 2023
<!-- Describe what has changed in this PR -->
**What changed?**
Divide GenerateAndVerifyReplicationTasks activity into two activities:
GenerateReplicationTasks (reuse previous one) and VerifyReplicationTasks

<!-- Tell your future self why have you made these changes -->
**Why?**
Based on cluster tests, GenerateReplicationTasks is expensive (10ms
latency for `GenerateLastHistoryReplicationTasks` call). In previous
implementation, VerificationTasks runs after GenerateReplicationTasks
and we only get ~60 RPS for GenerateAndVerifyReplicationTasks. By
dividing the two, we can achieve ~100 RPS VerifyReplicationTasks for a
single activity (bottleneck is still GenerateReplicationTasks because of
10ms latency).

Also moved the special handling of WF not_found on target to
VerifyReplicationTasks, which reduced # of `DescribeMutableState` call
on source cluster. In previous implementation, `DescribeMutableState` is
called for every replication task. Now we only call
`DescribeMutableState` if WF was not found on target (which should be
rare for steady state). The downside is that we can potentially
replicate Zombie WF from source to target. But it should be avoidable by
eliminating Zombie during migration process (i.e., delete WF on target
if migration is incomplete).

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
Unit test & cluster tests. 

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
Low, the feature is disabled by default and only affect force
replication workflow.

<!-- Is this PR a hotfix candidate or require that a notification be
sent to the broader community? (Yes/No) -->
**Is hotfix candidate?**
No.
hehaifengcn added a commit that referenced this pull request Jul 21, 2023
<!-- Describe what has changed in this PR -->
**What changed?**
Divide GenerateAndVerifyReplicationTasks activity into two activities:
GenerateReplicationTasks (reuse previous one) and VerifyReplicationTasks

<!-- Tell your future self why have you made these changes -->
**Why?**
Based on cluster tests, GenerateReplicationTasks is expensive (10ms
latency for `GenerateLastHistoryReplicationTasks` call). In previous
implementation, VerificationTasks runs after GenerateReplicationTasks
and we only get ~60 RPS for GenerateAndVerifyReplicationTasks. By
dividing the two, we can achieve ~100 RPS VerifyReplicationTasks for a
single activity (bottleneck is still GenerateReplicationTasks because of
10ms latency).

Also moved the special handling of WF not_found on target to
VerifyReplicationTasks, which reduced # of `DescribeMutableState` call
on source cluster. In previous implementation, `DescribeMutableState` is
called for every replication task. Now we only call
`DescribeMutableState` if WF was not found on target (which should be
rare for steady state). The downside is that we can potentially
replicate Zombie WF from source to target. But it should be avoidable by
eliminating Zombie during migration process (i.e., delete WF on target
if migration is incomplete).

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
Unit test & cluster tests. 

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
Low, the feature is disabled by default and only affect force
replication workflow.

<!-- Is this PR a hotfix candidate or require that a notification be
sent to the broader community? (Yes/No) -->
**Is hotfix candidate?**
No.
hehaifengcn added a commit that referenced this pull request Jul 22, 2023
<!-- Describe what has changed in this PR -->
**What changed?**
Divide GenerateAndVerifyReplicationTasks activity into two activities:
GenerateReplicationTasks (reuse previous one) and VerifyReplicationTasks

<!-- Tell your future self why have you made these changes -->
**Why?**
Based on cluster tests, GenerateReplicationTasks is expensive (10ms
latency for `GenerateLastHistoryReplicationTasks` call). In previous
implementation, VerificationTasks runs after GenerateReplicationTasks
and we only get ~60 RPS for GenerateAndVerifyReplicationTasks. By
dividing the two, we can achieve ~100 RPS VerifyReplicationTasks for a
single activity (bottleneck is still GenerateReplicationTasks because of
10ms latency).

Also moved the special handling of WF not_found on target to
VerifyReplicationTasks, which reduced # of `DescribeMutableState` call
on source cluster. In previous implementation, `DescribeMutableState` is
called for every replication task. Now we only call
`DescribeMutableState` if WF was not found on target (which should be
rare for steady state). The downside is that we can potentially
replicate Zombie WF from source to target. But it should be avoidable by
eliminating Zombie during migration process (i.e., delete WF on target
if migration is incomplete).

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
Unit test & cluster tests. 

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
Low, the feature is disabled by default and only affect force
replication workflow.

<!-- Is this PR a hotfix candidate or require that a notification be
sent to the broader community? (Yes/No) -->
**Is hotfix candidate?**
No.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants