From 1d93e55caf84eaab3a95142e7baec3dcbd08e0f6 Mon Sep 17 00:00:00 2001 From: sankari gopalakrishnan Date: Wed, 4 Dec 2024 13:20:45 +0100 Subject: [PATCH 01/14] [Wf-Diagnostics] add troubleshooting guide for activity and workflow retries --- src/.vuepress/config.js | 1 + .../01-timeouts.md | 21 ++++++++++-- .../08-workflow-troubleshooting/03-retries.md | 33 +++++++++++++++++++ 3 files changed, 53 insertions(+), 2 deletions(-) create mode 100644 src/docs/08-workflow-troubleshooting/03-retries.md diff --git a/src/.vuepress/config.js b/src/.vuepress/config.js index ce8ed8fc3..b0f37ea6e 100644 --- a/src/.vuepress/config.js +++ b/src/.vuepress/config.js @@ -178,6 +178,7 @@ module.exports = { '08-workflow-troubleshooting/', '08-workflow-troubleshooting/01-timeouts', '08-workflow-troubleshooting/02-activity-failures', + '08-workflow-troubleshooting/03-retries', ], }, { diff --git a/src/docs/08-workflow-troubleshooting/01-timeouts.md b/src/docs/08-workflow-troubleshooting/01-timeouts.md index cd8cf9c5f..e792ed902 100644 --- a/src/docs/08-workflow-troubleshooting/01-timeouts.md +++ b/src/docs/08-workflow-troubleshooting/01-timeouts.md @@ -27,7 +27,7 @@ Optionally you can also increase the number of pollers per worker by providing t [Link to options in go client](https://pkg.go.dev/go.uber.org/cadence@v1.2.9/internal#WorkerOptions) [Link to options in java client](https://github.com/uber/cadence-java-client/blob/master/src/main/java/com/uber/cadence/internal/worker/PollerOptions.java#L124) -## Timeouts without heartbeating enabled +## Timeouts without heartbeat timeout or retry policy configured Activities time out StartToClose or ScheduleToClose if the activity took longer than the configured timeout. @@ -35,12 +35,29 @@ Activities time out StartToClose or ScheduleToClose if the activity took longer For long running activities, while the activity is executing, the worker can die due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in. -Mitigation: Consider enabling heartbeating +Mitigation: Consider configuring heartbeat timeout and a retry policy [Configuring heartbeat timeout example](https://github.com/uber-common/cadence-samples/blob/df6f7bdba978d6565ad78e9f86d9cd31dfac9f78/cmd/samples/expense/workflow.go#L23) +[Check retry policy for activity](https://cadenceworkflow.io/docs/concepts/activities/#retries) For short running activities, heart beating is not required but maybe consider increasing the timeout value to suit the actual activity execution time. +## Timeouts without heartbeat timeout configured but a retry policy configured + +Retry policies are good to be configured so that activities can be retried after timeouts or failures. For long running activities, while the activity is executing, the worker can die due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in. The retry is attempted only after this timeout. Enabling heartbeating would cause the activity to timeout earlier and will be retried on another worker. + +Mitigation: Consider configuring heartbeat timeout + +[Configuring heartbeat timeout example](https://github.com/uber-common/cadence-samples/blob/df6f7bdba978d6565ad78e9f86d9cd31dfac9f78/cmd/samples/expense/workflow.go#L23) + +## Timeouts with heartbeating enabled but without a retry policy configured + +Heartbeat timeouts are used to detect when a worker died or restarted during deployments. With heartbeat timeout enabled, the activity will timeout faster. But without a retry policy, it will not be scheduled again on a healthy worker. + +Mitigation: Consider adding retry policy to an activity + +[Check retry policy for activity](https://cadenceworkflow.io/docs/concepts/activities/#retries) + ## Heartbeat Timeouts after enabling heartbeating Activity has enabled heart beating but the activity timed out with heart beat timeout. This is because the server did not receive a heart beat in the time interval configured as the heart beat timeout. diff --git a/src/docs/08-workflow-troubleshooting/03-retries.md b/src/docs/08-workflow-troubleshooting/03-retries.md new file mode 100644 index 000000000..861f4fe2e --- /dev/null +++ b/src/docs/08-workflow-troubleshooting/03-retries.md @@ -0,0 +1,33 @@ +--- +layout: default +title: Retries +permalink: /docs/workflow-troubleshooting/retries +--- + +# Retries + +Cadence has a retry feature where a retry policy can be configured so that an activity or a workflow can be retried when it fails or times out. + +Read more about [activity retries](https://cadenceworkflow.io/docs/concepts/activities/#retries) and [workflow retries](https://cadenceworkflow.io/docs/concepts/workflows/#workflow-retries) + +## Workflow execution history of retries + +One thing to note is how activity retries and workflow retries are shown in the Cadence Web UI. All the activity retries are not part of workflow execution history and only the last attempt is shown with the attempt number. + +Moreover, attempt number starts from 0, so Attempt:0 refers to the first and original attempt or Attempt:1 refers to the second attempt or first retried attempt. + +For workflow retries, when a workflow fails or times out and is retried, it completes the previous execution with a ContinuedAsNew event and a new execution is started with Attempt 1. The ContinuedAsNew event holds the details of the failure reason. + +## Configuration of activity retries and workflow retries + +Some of the configurable values could be misconfigured and a result will not have the intended behaviour. These are listed here. + +## MaximumAttempts set to 1 + +In both activity retries and workflow retries it is sufficient to mention a maximum number of attempts or an expiration interval. However, the maximum number of attempts counts the original attempt of the activity also. As a result, setting maximum number of attempts to 1 means the activity or workflow will not be retried. + +## ExpirationIntervalInSeconds less than InitialIntervalInSeconds + +In both activity retries and workflow retries it is sufficient to mention a maximum number of attempts or an expiration interval. The first retry attempt waits for the InitialIntervalInSeconds before starting and when an expiration interval is set lower than the initial interval, the retry policy becomes invalid and the activity or workflow will not be retried. + + From b1395b910084a8b6233d45e1a7a286993980bd92 Mon Sep 17 00:00:00 2001 From: sankari gopalakrishnan Date: Wed, 4 Dec 2024 13:35:12 +0100 Subject: [PATCH 02/14] Update 01-timeouts.md --- .../01-timeouts.md | 20 +++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/src/docs/08-workflow-troubleshooting/01-timeouts.md b/src/docs/08-workflow-troubleshooting/01-timeouts.md index e792ed902..89fbf8095 100644 --- a/src/docs/08-workflow-troubleshooting/01-timeouts.md +++ b/src/docs/08-workflow-troubleshooting/01-timeouts.md @@ -27,7 +27,7 @@ Optionally you can also increase the number of pollers per worker by providing t [Link to options in go client](https://pkg.go.dev/go.uber.org/cadence@v1.2.9/internal#WorkerOptions) [Link to options in java client](https://github.com/uber/cadence-java-client/blob/master/src/main/java/com/uber/cadence/internal/worker/PollerOptions.java#L124) -## Timeouts without heartbeat timeout or retry policy configured +## No heartbeat timeout or retry policy configured Activities time out StartToClose or ScheduleToClose if the activity took longer than the configured timeout. @@ -37,30 +37,30 @@ For long running activities, while the activity is executing, the worker can die Mitigation: Consider configuring heartbeat timeout and a retry policy -[Configuring heartbeat timeout example](https://github.com/uber-common/cadence-samples/blob/df6f7bdba978d6565ad78e9f86d9cd31dfac9f78/cmd/samples/expense/workflow.go#L23) +[Example](https://github.com/uber-common/cadence-samples/blob/df6f7bdba978d6565ad78e9f86d9cd31dfac9f78/cmd/samples/expense/workflow.go#L23) [Check retry policy for activity](https://cadenceworkflow.io/docs/concepts/activities/#retries) For short running activities, heart beating is not required but maybe consider increasing the timeout value to suit the actual activity execution time. -## Timeouts without heartbeat timeout configured but a retry policy configured +## Retry policy configured without setting heartbeat timeout -Retry policies are good to be configured so that activities can be retried after timeouts or failures. For long running activities, while the activity is executing, the worker can die due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in. The retry is attempted only after this timeout. Enabling heartbeating would cause the activity to timeout earlier and will be retried on another worker. +Retry policies are good to be configured so that activities can be retried after timeouts or failures. For long running activities, while the activity is executing, the worker can die due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in. The retry is attempted only after this timeout. Configuring heartbeat timeout would cause the activity to timeout earlier and will be retried on another worker. Mitigation: Consider configuring heartbeat timeout -[Configuring heartbeat timeout example](https://github.com/uber-common/cadence-samples/blob/df6f7bdba978d6565ad78e9f86d9cd31dfac9f78/cmd/samples/expense/workflow.go#L23) +[Example](https://github.com/uber-common/cadence-samples/blob/df6f7bdba978d6565ad78e9f86d9cd31dfac9f78/cmd/samples/expense/workflow.go#L23) -## Timeouts with heartbeating enabled but without a retry policy configured +## Heartbeat timeout configured without a retry policy -Heartbeat timeouts are used to detect when a worker died or restarted during deployments. With heartbeat timeout enabled, the activity will timeout faster. But without a retry policy, it will not be scheduled again on a healthy worker. +Heartbeat timeouts are used to detect when a worker died or restarted during deployments. With heartbeat timeout configured, the activity will timeout faster. But without a retry policy, it will not be scheduled again on a healthy worker. Mitigation: Consider adding retry policy to an activity [Check retry policy for activity](https://cadenceworkflow.io/docs/concepts/activities/#retries) -## Heartbeat Timeouts after enabling heartbeating +## Heartbeat Timeouts after configuring heartbeat timeout -Activity has enabled heart beating but the activity timed out with heart beat timeout. This is because the server did not receive a heart beat in the time interval configured as the heart beat timeout. +Activity has configured heartbeat timeout and the activity timed out with heart beat timeout. This is because the server did not receive a heart beat in the time interval configured as the heart beat timeout. Mitigation: Once heartbeat timeout is configured in activity options, you need to make sure the activity periodically sends a heart beat to the server to make sure the server is aware of the activity being alive. @@ -68,4 +68,4 @@ Mitigation: Once heartbeat timeout is configured in activity options, you need t In go client, there is an option to register the activity with auto heart beating so that it is done automatically -[Enabling auto heart beat during activity registration example](https://pkg.go.dev/go.uber.org/cadence@v1.2.9/internal#WorkerOptions) +[Configuring auto heart beat during activity registration example](https://pkg.go.dev/go.uber.org/cadence@v1.2.9/internal#WorkerOptions) From 54bc161bede8c3f9184fadd461883dd4bcb3c38b Mon Sep 17 00:00:00 2001 From: sankari gopalakrishnan Date: Wed, 4 Dec 2024 13:35:54 +0100 Subject: [PATCH 03/14] Update 01-timeouts.md --- src/docs/08-workflow-troubleshooting/01-timeouts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/docs/08-workflow-troubleshooting/01-timeouts.md b/src/docs/08-workflow-troubleshooting/01-timeouts.md index 89fbf8095..b15d38144 100644 --- a/src/docs/08-workflow-troubleshooting/01-timeouts.md +++ b/src/docs/08-workflow-troubleshooting/01-timeouts.md @@ -58,7 +58,7 @@ Mitigation: Consider adding retry policy to an activity [Check retry policy for activity](https://cadenceworkflow.io/docs/concepts/activities/#retries) -## Heartbeat Timeouts after configuring heartbeat timeout +## Heartbeat timeout seen after configuring heartbeat timeout Activity has configured heartbeat timeout and the activity timed out with heart beat timeout. This is because the server did not receive a heart beat in the time interval configured as the heart beat timeout. From d55d3b60ad30020002e96a2508f0cf956074fea1 Mon Sep 17 00:00:00 2001 From: sankari gopalakrishnan Date: Thu, 5 Dec 2024 11:32:27 +0100 Subject: [PATCH 04/14] Update src/docs/08-workflow-troubleshooting/01-timeouts.md Co-authored-by: Jakob Haahr Taankvist --- src/docs/08-workflow-troubleshooting/01-timeouts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/docs/08-workflow-troubleshooting/01-timeouts.md b/src/docs/08-workflow-troubleshooting/01-timeouts.md index b15d38144..f79ed9da5 100644 --- a/src/docs/08-workflow-troubleshooting/01-timeouts.md +++ b/src/docs/08-workflow-troubleshooting/01-timeouts.md @@ -44,7 +44,7 @@ For short running activities, heart beating is not required but maybe consider i ## Retry policy configured without setting heartbeat timeout -Retry policies are good to be configured so that activities can be retried after timeouts or failures. For long running activities, while the activity is executing, the worker can die due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in. The retry is attempted only after this timeout. Configuring heartbeat timeout would cause the activity to timeout earlier and will be retried on another worker. +Retry policies are configured so activities can be retried after timeouts or failures. For long-running activities, the worker can die while the activity is executing, e.g. due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in. The retry is attempted only after this timeout. Configuring heartbeat timeout would cause the activity to timeout earlier so it can be retried on another worker. Mitigation: Consider configuring heartbeat timeout From 798cb2e5188f2147cbe1a66aeb3f9e45703b0eec Mon Sep 17 00:00:00 2001 From: sankari gopalakrishnan Date: Thu, 5 Dec 2024 11:32:42 +0100 Subject: [PATCH 05/14] Update src/docs/08-workflow-troubleshooting/01-timeouts.md Co-authored-by: Jakob Haahr Taankvist --- src/docs/08-workflow-troubleshooting/01-timeouts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/docs/08-workflow-troubleshooting/01-timeouts.md b/src/docs/08-workflow-troubleshooting/01-timeouts.md index f79ed9da5..2e9365e48 100644 --- a/src/docs/08-workflow-troubleshooting/01-timeouts.md +++ b/src/docs/08-workflow-troubleshooting/01-timeouts.md @@ -52,7 +52,7 @@ Mitigation: Consider configuring heartbeat timeout ## Heartbeat timeout configured without a retry policy -Heartbeat timeouts are used to detect when a worker died or restarted during deployments. With heartbeat timeout configured, the activity will timeout faster. But without a retry policy, it will not be scheduled again on a healthy worker. +Heartbeat timeouts are used to detect when a worker died or restarted. With heartbeat timeout configured, the activity will timeout faster. But without a retry policy, it will not be scheduled again on a healthy worker. Mitigation: Consider adding retry policy to an activity From c7466aafe6b2085f6a0045ff3819bc831ea25cbe Mon Sep 17 00:00:00 2001 From: sankari gopalakrishnan Date: Thu, 5 Dec 2024 11:33:05 +0100 Subject: [PATCH 06/14] Update src/docs/08-workflow-troubleshooting/03-retries.md Co-authored-by: Jakob Haahr Taankvist --- src/docs/08-workflow-troubleshooting/03-retries.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/docs/08-workflow-troubleshooting/03-retries.md b/src/docs/08-workflow-troubleshooting/03-retries.md index 861f4fe2e..08abdd896 100644 --- a/src/docs/08-workflow-troubleshooting/03-retries.md +++ b/src/docs/08-workflow-troubleshooting/03-retries.md @@ -6,7 +6,7 @@ permalink: /docs/workflow-troubleshooting/retries # Retries -Cadence has a retry feature where a retry policy can be configured so that an activity or a workflow can be retried when it fails or times out. +Cadence has a retry feature where a retry policy can be configured so that an activity or a workflow will be retried when it fails or times out. Read more about [activity retries](https://cadenceworkflow.io/docs/concepts/activities/#retries) and [workflow retries](https://cadenceworkflow.io/docs/concepts/workflows/#workflow-retries) From 228e1348af5dd5c032d7208fb78f13217301cd73 Mon Sep 17 00:00:00 2001 From: sankari gopalakrishnan Date: Thu, 5 Dec 2024 11:33:17 +0100 Subject: [PATCH 07/14] Update src/docs/08-workflow-troubleshooting/03-retries.md Co-authored-by: Jakob Haahr Taankvist --- src/docs/08-workflow-troubleshooting/03-retries.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/docs/08-workflow-troubleshooting/03-retries.md b/src/docs/08-workflow-troubleshooting/03-retries.md index 08abdd896..e43f64615 100644 --- a/src/docs/08-workflow-troubleshooting/03-retries.md +++ b/src/docs/08-workflow-troubleshooting/03-retries.md @@ -8,7 +8,7 @@ permalink: /docs/workflow-troubleshooting/retries Cadence has a retry feature where a retry policy can be configured so that an activity or a workflow will be retried when it fails or times out. -Read more about [activity retries](https://cadenceworkflow.io/docs/concepts/activities/#retries) and [workflow retries](https://cadenceworkflow.io/docs/concepts/workflows/#workflow-retries) +Read more about [activity retries](https://cadenceworkflow.io/docs/concepts/activities/#retries) and [workflow retries](https://cadenceworkflow.io/docs/concepts/workflows/#workflow-retries). ## Workflow execution history of retries From 61136bbbcb18a81566536f9100d70c30d51811dd Mon Sep 17 00:00:00 2001 From: sankari gopalakrishnan Date: Thu, 5 Dec 2024 11:33:36 +0100 Subject: [PATCH 08/14] Update src/docs/08-workflow-troubleshooting/03-retries.md Co-authored-by: Jakob Haahr Taankvist --- src/docs/08-workflow-troubleshooting/03-retries.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/docs/08-workflow-troubleshooting/03-retries.md b/src/docs/08-workflow-troubleshooting/03-retries.md index e43f64615..810c6e17d 100644 --- a/src/docs/08-workflow-troubleshooting/03-retries.md +++ b/src/docs/08-workflow-troubleshooting/03-retries.md @@ -12,7 +12,7 @@ Read more about [activity retries](https://cadenceworkflow.io/docs/concepts/acti ## Workflow execution history of retries -One thing to note is how activity retries and workflow retries are shown in the Cadence Web UI. All the activity retries are not part of workflow execution history and only the last attempt is shown with the attempt number. +One thing to note is how activity retries and workflow retries are shown in the Cadence Web UI. Information about activity retries is not stored in Cadence. Only the last attempt is shown with the attempt number. Moreover, attempt number starts from 0, so Attempt:0 refers to the first and original attempt or Attempt:1 refers to the second attempt or first retried attempt. From 3019dfd3ececc5f3b25dcd0ecbbc17efa8588f2d Mon Sep 17 00:00:00 2001 From: sankari gopalakrishnan Date: Thu, 5 Dec 2024 11:33:48 +0100 Subject: [PATCH 09/14] Update src/docs/08-workflow-troubleshooting/03-retries.md Co-authored-by: Jakob Haahr Taankvist --- src/docs/08-workflow-troubleshooting/03-retries.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/docs/08-workflow-troubleshooting/03-retries.md b/src/docs/08-workflow-troubleshooting/03-retries.md index 810c6e17d..7ed4f0e75 100644 --- a/src/docs/08-workflow-troubleshooting/03-retries.md +++ b/src/docs/08-workflow-troubleshooting/03-retries.md @@ -14,7 +14,7 @@ Read more about [activity retries](https://cadenceworkflow.io/docs/concepts/acti One thing to note is how activity retries and workflow retries are shown in the Cadence Web UI. Information about activity retries is not stored in Cadence. Only the last attempt is shown with the attempt number. -Moreover, attempt number starts from 0, so Attempt:0 refers to the first and original attempt or Attempt:1 refers to the second attempt or first retried attempt. +Moreover, attempt number starts from 0, so `Attempt: 0` refers to the first and original attempt, `Attempt: 1` refers to the second attempt or first retried attempt. For workflow retries, when a workflow fails or times out and is retried, it completes the previous execution with a ContinuedAsNew event and a new execution is started with Attempt 1. The ContinuedAsNew event holds the details of the failure reason. From b01daa04aa77bdb5fe13278bc166722279dc9146 Mon Sep 17 00:00:00 2001 From: sankari gopalakrishnan Date: Thu, 5 Dec 2024 11:34:02 +0100 Subject: [PATCH 10/14] Update src/docs/08-workflow-troubleshooting/03-retries.md Co-authored-by: Jakob Haahr Taankvist --- src/docs/08-workflow-troubleshooting/03-retries.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/docs/08-workflow-troubleshooting/03-retries.md b/src/docs/08-workflow-troubleshooting/03-retries.md index 7ed4f0e75..219371305 100644 --- a/src/docs/08-workflow-troubleshooting/03-retries.md +++ b/src/docs/08-workflow-troubleshooting/03-retries.md @@ -20,7 +20,7 @@ For workflow retries, when a workflow fails or times out and is retried, it comp ## Configuration of activity retries and workflow retries -Some of the configurable values could be misconfigured and a result will not have the intended behaviour. These are listed here. +Some of the configurable values could be misconfigured and as a result will not have the intended behaviour. These are listed here. ## MaximumAttempts set to 1 From 31b3c4a088842f0dea0043b36ef73aeaedf2fe56 Mon Sep 17 00:00:00 2001 From: sankari gopalakrishnan Date: Thu, 5 Dec 2024 11:34:17 +0100 Subject: [PATCH 11/14] Update src/docs/08-workflow-troubleshooting/03-retries.md Co-authored-by: Jakob Haahr Taankvist --- src/docs/08-workflow-troubleshooting/03-retries.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/docs/08-workflow-troubleshooting/03-retries.md b/src/docs/08-workflow-troubleshooting/03-retries.md index 219371305..97088bc27 100644 --- a/src/docs/08-workflow-troubleshooting/03-retries.md +++ b/src/docs/08-workflow-troubleshooting/03-retries.md @@ -28,6 +28,6 @@ In both activity retries and workflow retries it is sufficient to mention a maxi ## ExpirationIntervalInSeconds less than InitialIntervalInSeconds -In both activity retries and workflow retries it is sufficient to mention a maximum number of attempts or an expiration interval. The first retry attempt waits for the InitialIntervalInSeconds before starting and when an expiration interval is set lower than the initial interval, the retry policy becomes invalid and the activity or workflow will not be retried. +In both activity retries and workflow retries it is sufficient to specify a maximum number of attempts or an expiration interval. The first retry attempt waits for the InitialIntervalInSeconds before starting and when an expiration interval is set lower than the initial interval, the retry policy becomes invalid and the activity or workflow will not be retried. From 7323e1dfc291c1f9b865e82be7be96bd8f7899d8 Mon Sep 17 00:00:00 2001 From: sankari gopalakrishnan Date: Thu, 5 Dec 2024 11:37:27 +0100 Subject: [PATCH 12/14] Update 01-timeouts.md --- src/docs/08-workflow-troubleshooting/01-timeouts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/docs/08-workflow-troubleshooting/01-timeouts.md b/src/docs/08-workflow-troubleshooting/01-timeouts.md index 2e9365e48..80432d1cd 100644 --- a/src/docs/08-workflow-troubleshooting/01-timeouts.md +++ b/src/docs/08-workflow-troubleshooting/01-timeouts.md @@ -60,7 +60,7 @@ Mitigation: Consider adding retry policy to an activity ## Heartbeat timeout seen after configuring heartbeat timeout -Activity has configured heartbeat timeout and the activity timed out with heart beat timeout. This is because the server did not receive a heart beat in the time interval configured as the heart beat timeout. +Activity has configured heartbeat timeout and the activity timed out with heart beat timeout. This is because the server did not receive a heart beat in the time interval configured as the heart beat timeout. This could happen if the activity is actually not executing or the activity is not sending periodic heartbeats. The first case is good since the activity now times out instead of being stuck until startToClose or scheduleToClose kicks in. The second case needs a fix. Mitigation: Once heartbeat timeout is configured in activity options, you need to make sure the activity periodically sends a heart beat to the server to make sure the server is aware of the activity being alive. From d05690aa0ca62d22b5909a9945f30516263168c6 Mon Sep 17 00:00:00 2001 From: sankari gopalakrishnan Date: Thu, 5 Dec 2024 11:38:20 +0100 Subject: [PATCH 13/14] Update src/docs/08-workflow-troubleshooting/01-timeouts.md Co-authored-by: Adhitya Mamallan --- src/docs/08-workflow-troubleshooting/01-timeouts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/docs/08-workflow-troubleshooting/01-timeouts.md b/src/docs/08-workflow-troubleshooting/01-timeouts.md index 80432d1cd..4519b9411 100644 --- a/src/docs/08-workflow-troubleshooting/01-timeouts.md +++ b/src/docs/08-workflow-troubleshooting/01-timeouts.md @@ -33,7 +33,7 @@ Activities time out StartToClose or ScheduleToClose if the activity took longer [Link to description of timeouts](https://cadenceworkflow.io/docs/concepts/activities/#timeouts) -For long running activities, while the activity is executing, the worker can die due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in. +For long running activities, while the activity is executing, the worker can die due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in. Mitigation: Consider configuring heartbeat timeout and a retry policy From c9cb2e2d549606f5f243f47d48f6cc2da0dea3da Mon Sep 17 00:00:00 2001 From: sankari gopalakrishnan Date: Thu, 5 Dec 2024 11:38:30 +0100 Subject: [PATCH 14/14] Update src/docs/08-workflow-troubleshooting/01-timeouts.md Co-authored-by: Adhitya Mamallan --- src/docs/08-workflow-troubleshooting/01-timeouts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/docs/08-workflow-troubleshooting/01-timeouts.md b/src/docs/08-workflow-troubleshooting/01-timeouts.md index 4519b9411..f1306c208 100644 --- a/src/docs/08-workflow-troubleshooting/01-timeouts.md +++ b/src/docs/08-workflow-troubleshooting/01-timeouts.md @@ -44,7 +44,7 @@ For short running activities, heart beating is not required but maybe consider i ## Retry policy configured without setting heartbeat timeout -Retry policies are configured so activities can be retried after timeouts or failures. For long-running activities, the worker can die while the activity is executing, e.g. due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in. The retry is attempted only after this timeout. Configuring heartbeat timeout would cause the activity to timeout earlier so it can be retried on another worker. +Retry policies are configured so activities can be retried after timeouts or failures. For long-running activities, the worker can die while the activity is executing, e.g. due to regular deployments or host restarts or failures. Cadence doesn't know about this and will wait for StartToClose or ScheduleToClose timeouts to kick in. The retry is attempted only after this timeout. Configuring heartbeat timeout would cause the activity to timeout earlier so it can be retried on another worker. Mitigation: Consider configuring heartbeat timeout