Unexpected lag between TimerStarted and TimerFired when switching a namespace's active cluster #433

dhiaayachi · 2024-09-05T12:46:26Z

Expected Behavior

When:

Making use of Workflow.sleep()
Using multi-cluster replication
Switching the active cluster for a namespace

Then the time between TimerStarted and TimerFired should be minimal.

Actual Behavior

I've observed the time between TimerStarted and TimerFired to be more than 10 minutes.

Steps to Reproduce the Problem

Have two Temporal clusters, cluster-a and cluster-b, with multi-cluster replication enabled
Have one Java service with two SDK clients, one for each cluster
Make a workflow with the following steps:
a. Execute an activity that returns immediately
b. Workflow.sleep(500)
c. Execute an activity that returns immediately
d. Workflow.sleep(500)
e. Execute an activity that returns immediately
f. Workflow.sleep(500)
g. Execute an activity that returns immediately
h. Workflow.sleep(500)
i. Execute an activity that returns immediately
j. Workflow.sleep(500)

Schedule that workflow on a short cron:

tctl --namespace sandbox workflow start --taskqueue sandbox --workflow_type DummyWorkflow --cron "@every 10s"

Change the active cluster for the namespace in the middle of workflow execution:
```
tctl --namespace sandbox namespace update --active_cluster cluster-b
```
Repeat the previous step until the workflow's event history appears to be stuck waiting for TimerFired (the last event in history is TimerStarted). Only repeat the step every ~60sec so the thrash isn't crippling.
Wait ~10 minutes
Observe that TimerFired did eventually fire

Specifications

Version: local temporal/auto-setup:1.19.0 with Java SDK v1.18.2
Platform: macOS Ventura v13.2.1 Intel, Docker v20.10.23

I'm hoping there is a simple answer to this behavior, such as a timeout I'm missing. I'm not setting explicit timeouts in the above tctl commands, and I'm not setting explicit activity timeouts in the workflow code. The UI doesn't show a timeout for timer tasks in the way it does for workflow tasks, so I'm not positive I can affect this behavior.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected lag between TimerStarted and TimerFired when switching a namespace's active cluster #433

Unexpected lag between TimerStarted and TimerFired when switching a namespace's active cluster #433

dhiaayachi commented Sep 5, 2024

Unexpected lag between TimerStarted and TimerFired when switching a namespace's active cluster #433

Unexpected lag between TimerStarted and TimerFired when switching a namespace's active cluster #433

Comments

dhiaayachi commented Sep 5, 2024

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications