Improves history handler error metrics and logs #5438

davidporter-id-au · 2023-11-01T20:26:29Z

What changed?

Some bad data caused a tasklist to get stuck while trying to update a workflow which was in a bad state. This was more difficult to identify than it should be because the RunID was missing information from the error logs. This change includes this additional information and also protects some of the metrics emission from wrapped errors being misclassified.

This is not an attempt to do all error handling / wrapping correctly, it's just an incremental change, of which many more are needed.

Why?

How did you test it?

Unit tests

Potential risks

Release notes

Documentation Changes

taylanisikdemir · 2023-11-01T22:21:47Z

service/history/handler.go

+
+	} else if errors.As(err, &yarpcE) {
+
+		if yarpcE.Code() == yarpcerrors.CodeDeadlineExceeded {
 			scope.IncCounter(metrics.CadenceErrContextTimeoutCounter)
 		}
 		scope.IncCounter(metrics.CadenceFailures)


not related to your change: why do we increment both CadenceErrContextTimeoutCounter and CadenceFailures in this case? if there are dashboards looking at sum of these failures we will see extra numbers.

I don't have context there to be honest

@Groxx do we have dashboards summing these up and showing as total error count?

…flow#5438)" This reverts commit 4ece98a.

This reverts commit 4ece98a.

…nce-workflow#5438)" (cadence-workflow#5467)" This reverts commit 558780b.

@Groxx

This was originally added (and not working) with #5438 and this followup corrects it and adds some actual metrics tests to ensure such a miss doesn't happen again. The driver of this change, to reiterate was two things: - More concretely, there are some times of invalid data are annoyingly difficult to track down because it lacks runID information. Obviously the oncall can dig around in the DB for the workflow and guess, but it's operationally quite a lot of work in a fast-moving environment. Some problems with invalid workflows without a current runID in this state blocked a preproduction environment for quite a while. - Zooming out, a goal @Groxx and others have had, is to make errors able to be much richer by wrapping them. However, this requires more than case-switching on types in order to convey more useful information such as the stacktrace and debug info. However, to do so requires any logic which does type or equality matching on errors to start properly using errors.Is/As. This is a small part of that initiative (albeit with a few bumps)

davidporter-id-au added 2 commits November 1, 2023 13:24

Adding better debugging

71079e2

Fix obvious error

03608cd

davidporter-id-au marked this pull request as ready for review November 1, 2023 20:54

davidporter-id-au changed the title ~~Adding better debugging~~ Improves history handler error metrics and logs Nov 1, 2023

allenchen2244 approved these changes Nov 1, 2023

View reviewed changes

taylanisikdemir approved these changes Nov 1, 2023

View reviewed changes

davidporter-id-au merged commit 4ece98a into cadence-workflow:master Nov 1, 2023

davidporter-id-au deleted the bugfix/debugging-stuck-tasklist-ii branch November 1, 2023 22:51

davidporter-id-au added a commit to davidporter-id-au/cadence that referenced this pull request Dec 5, 2023

Revert "Improves history handler error metrics and logs (cadence-work…

990cb10

…flow#5438)" This reverts commit 4ece98a.

davidporter-id-au added a commit to davidporter-id-au/cadence that referenced this pull request Dec 5, 2023

Revert "Improves history handler error metrics and logs (cadence-work…

ff190df

…flow#5438)" This reverts commit 4ece98a.

davidporter-id-au added a commit that referenced this pull request Dec 6, 2023

Revert "Improves history handler error metrics and logs (#5438)" (#5467)

558780b

This reverts commit 4ece98a.

davidporter-id-au added a commit to davidporter-id-au/cadence that referenced this pull request Dec 6, 2023

Revert "Revert "Improves history handler error metrics and logs (cade…

77891c4

…nce-workflow#5438)" (cadence-workflow#5467)" This reverts commit 558780b.

davidporter-id-au mentioned this pull request Dec 6, 2023

Improves metric and error handling for history #5469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improves history handler error metrics and logs #5438

Improves history handler error metrics and logs #5438

davidporter-id-au commented Nov 1, 2023 •

edited

Loading

taylanisikdemir Nov 1, 2023

davidporter-id-au Nov 1, 2023

taylanisikdemir Nov 2, 2023

Improves history handler error metrics and logs #5438

Improves history handler error metrics and logs #5438

Conversation

davidporter-id-au commented Nov 1, 2023 • edited Loading

taylanisikdemir Nov 1, 2023

Choose a reason for hiding this comment

davidporter-id-au Nov 1, 2023

Choose a reason for hiding this comment

taylanisikdemir Nov 2, 2023

Choose a reason for hiding this comment

davidporter-id-au commented Nov 1, 2023 •

edited

Loading