-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improves history handler error metrics and logs #5438
Improves history handler error metrics and logs #5438
Conversation
|
||
} else if errors.As(err, &yarpcE) { | ||
|
||
if yarpcE.Code() == yarpcerrors.CodeDeadlineExceeded { | ||
scope.IncCounter(metrics.CadenceErrContextTimeoutCounter) | ||
} | ||
scope.IncCounter(metrics.CadenceFailures) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to your change: why do we increment both CadenceErrContextTimeoutCounter
and CadenceFailures
in this case? if there are dashboards looking at sum of these failures we will see extra numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have context there to be honest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Groxx do we have dashboards summing these up and showing as total error count?
…flow#5438)" This reverts commit 4ece98a.
…flow#5438)" This reverts commit 4ece98a.
This was originally added (and not working) with #5438 and this followup corrects it and adds some actual metrics tests to ensure such a miss doesn't happen again. The driver of this change, to reiterate was two things: - More concretely, there are some times of invalid data are annoyingly difficult to track down because it lacks runID information. Obviously the oncall can dig around in the DB for the workflow and guess, but it's operationally quite a lot of work in a fast-moving environment. Some problems with invalid workflows without a current runID in this state blocked a preproduction environment for quite a while. - Zooming out, a goal @Groxx and others have had, is to make errors able to be much richer by wrapping them. However, this requires more than case-switching on types in order to convey more useful information such as the stacktrace and debug info. However, to do so requires any logic which does type or equality matching on errors to start properly using errors.Is/As. This is a small part of that initiative (albeit with a few bumps)
What changed?
Some bad data caused a tasklist to get stuck while trying to update a workflow which was in a bad state. This was more difficult to identify than it should be because the RunID was missing information from the error logs. This change includes this additional information and also protects some of the metrics emission from wrapped errors being misclassified.
This is not an attempt to do all error handling / wrapping correctly, it's just an incremental change, of which many more are needed.
Why?
How did you test it?
Unit tests
Potential risks
Release notes
Documentation Changes