-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Execute VerifyReplicationTasks as an individual activity #4656
Conversation
fa6fdd6
to
e94efbb
Compare
@@ -374,41 +374,49 @@ func enqueueReplicationTasks(ctx workflow.Context, workflowExecutionsCh workflow | |||
var lastActivityErr error | |||
|
|||
for workflowExecutionsCh.Receive(ctx, &workflowExecutions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to control the activity concurrency?
if so, here is a example: https://github.com/uber/cadence/blob/master/canary/concurrentExec.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explained offline. code already handle concurrency control.
e94efbb
to
990079d
Compare
<!-- Describe what has changed in this PR --> **What changed?** Divide GenerateAndVerifyReplicationTasks activity into two activities: GenerateReplicationTasks (reuse previous one) and VerifyReplicationTasks <!-- Tell your future self why have you made these changes --> **Why?** Based on cluster tests, GenerateReplicationTasks is expensive (10ms latency for `GenerateLastHistoryReplicationTasks` call). In previous implementation, VerificationTasks runs after GenerateReplicationTasks and we only get ~60 RPS for GenerateAndVerifyReplicationTasks. By dividing the two, we can achieve ~100 RPS VerifyReplicationTasks for a single activity (bottleneck is still GenerateReplicationTasks because of 10ms latency). Also moved the special handling of WF not_found on target to VerifyReplicationTasks, which reduced # of `DescribeMutableState` call on source cluster. In previous implementation, `DescribeMutableState` is called for every replication task. Now we only call `DescribeMutableState` if WF was not found on target (which should be rare for steady state). The downside is that we can potentially replicate Zombie WF from source to target. But it should be avoidable by eliminating Zombie during migration process (i.e., delete WF on target if migration is incomplete). <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** Unit test & cluster tests. <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** Low, the feature is disabled by default and only affect force replication workflow. <!-- Is this PR a hotfix candidate or require that a notification be sent to the broader community? (Yes/No) --> **Is hotfix candidate?** No.
<!-- Describe what has changed in this PR --> **What changed?** Divide GenerateAndVerifyReplicationTasks activity into two activities: GenerateReplicationTasks (reuse previous one) and VerifyReplicationTasks <!-- Tell your future self why have you made these changes --> **Why?** Based on cluster tests, GenerateReplicationTasks is expensive (10ms latency for `GenerateLastHistoryReplicationTasks` call). In previous implementation, VerificationTasks runs after GenerateReplicationTasks and we only get ~60 RPS for GenerateAndVerifyReplicationTasks. By dividing the two, we can achieve ~100 RPS VerifyReplicationTasks for a single activity (bottleneck is still GenerateReplicationTasks because of 10ms latency). Also moved the special handling of WF not_found on target to VerifyReplicationTasks, which reduced # of `DescribeMutableState` call on source cluster. In previous implementation, `DescribeMutableState` is called for every replication task. Now we only call `DescribeMutableState` if WF was not found on target (which should be rare for steady state). The downside is that we can potentially replicate Zombie WF from source to target. But it should be avoidable by eliminating Zombie during migration process (i.e., delete WF on target if migration is incomplete). <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** Unit test & cluster tests. <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** Low, the feature is disabled by default and only affect force replication workflow. <!-- Is this PR a hotfix candidate or require that a notification be sent to the broader community? (Yes/No) --> **Is hotfix candidate?** No.
<!-- Describe what has changed in this PR --> **What changed?** Divide GenerateAndVerifyReplicationTasks activity into two activities: GenerateReplicationTasks (reuse previous one) and VerifyReplicationTasks <!-- Tell your future self why have you made these changes --> **Why?** Based on cluster tests, GenerateReplicationTasks is expensive (10ms latency for `GenerateLastHistoryReplicationTasks` call). In previous implementation, VerificationTasks runs after GenerateReplicationTasks and we only get ~60 RPS for GenerateAndVerifyReplicationTasks. By dividing the two, we can achieve ~100 RPS VerifyReplicationTasks for a single activity (bottleneck is still GenerateReplicationTasks because of 10ms latency). Also moved the special handling of WF not_found on target to VerifyReplicationTasks, which reduced # of `DescribeMutableState` call on source cluster. In previous implementation, `DescribeMutableState` is called for every replication task. Now we only call `DescribeMutableState` if WF was not found on target (which should be rare for steady state). The downside is that we can potentially replicate Zombie WF from source to target. But it should be avoidable by eliminating Zombie during migration process (i.e., delete WF on target if migration is incomplete). <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** Unit test & cluster tests. <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** Low, the feature is disabled by default and only affect force replication workflow. <!-- Is this PR a hotfix candidate or require that a notification be sent to the broader community? (Yes/No) --> **Is hotfix candidate?** No.
What changed?
Divide GenerateAndVerifyReplicationTasks activity into two activities: GenerateReplicationTasks (reuse previous one) and VerifyReplicationTasks
Why?
Based on cluster tests, GenerateReplicationTasks is expensive (10ms latency for
GenerateLastHistoryReplicationTasks
call). In previous implementation, VerificationTasks runs after GenerateReplicationTasks and we only get ~60 RPS for GenerateAndVerifyReplicationTasks. By dividing the two, we can achieve ~100 RPS VerifyReplicationTasks for a single activity (bottleneck is still GenerateReplicationTasks because of 10ms latency).Also moved the special handling of WF not_found on target to VerifyReplicationTasks, which reduced # of
DescribeMutableState
call on source cluster. In previous implementation,DescribeMutableState
is called for every replication task. Now we only callDescribeMutableState
if WF was not found on target (which should be rare for steady state). The downside is that we can potentially replicate Zombie WF from source to target. But it should be avoidable by eliminating Zombie during migration process (i.e., delete WF on target if migration is incomplete).How did you test it?
Unit test & cluster tests.
Potential risks
Low, the feature is disabled by default and only affect force replication workflow.
Is hotfix candidate?
No.