Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected lag between TimerStarted and TimerFired when switching a namespace's active cluster #433

Open
dhiaayachi opened this issue Sep 5, 2024 · 0 comments

Comments

@dhiaayachi
Copy link
Owner

Expected Behavior

When:

  • Making use of Workflow.sleep()
  • Using multi-cluster replication
  • Switching the active cluster for a namespace

Then the time between TimerStarted and TimerFired should be minimal.

Actual Behavior

I've observed the time between TimerStarted and TimerFired to be more than 10 minutes.

Steps to Reproduce the Problem

  1. Have two Temporal clusters, cluster-a and cluster-b, with multi-cluster replication enabled

  2. Have one Java service with two SDK clients, one for each cluster

  3. Make a workflow with the following steps:
    a. Execute an activity that returns immediately
    b. Workflow.sleep(500)
    c. Execute an activity that returns immediately
    d. Workflow.sleep(500)
    e. Execute an activity that returns immediately
    f. Workflow.sleep(500)
    g. Execute an activity that returns immediately
    h. Workflow.sleep(500)
    i. Execute an activity that returns immediately
    j. Workflow.sleep(500)

  4. Schedule that workflow on a short cron:

    tctl --namespace sandbox workflow start --taskqueue sandbox --workflow_type DummyWorkflow --cron "@every 10s"
  5. Change the active cluster for the namespace in the middle of workflow execution:

    tctl --namespace sandbox namespace update --active_cluster cluster-b
  6. Repeat the previous step until the workflow's event history appears to be stuck waiting for TimerFired (the last event in history is TimerStarted). Only repeat the step every ~60sec so the thrash isn't crippling.

  7. Wait ~10 minutes

  8. Observe that TimerFired did eventually fire

Specifications

  • Version: local temporal/auto-setup:1.19.0 with Java SDK v1.18.2
  • Platform: macOS Ventura v13.2.1 Intel, Docker v20.10.23

I'm hoping there is a simple answer to this behavior, such as a timeout I'm missing. I'm not setting explicit timeouts in the above tctl commands, and I'm not setting explicit activity timeouts in the workflow code. The UI doesn't show a timeout for timer tasks in the way it does for workflow tasks, so I'm not positive I can affect this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant