Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improves history handler error metrics and logs #5438

Conversation

davidporter-id-au
Copy link
Member

@davidporter-id-au davidporter-id-au commented Nov 1, 2023

What changed?

Some bad data caused a tasklist to get stuck while trying to update a workflow which was in a bad state. This was more difficult to identify than it should be because the RunID was missing information from the error logs. This change includes this additional information and also protects some of the metrics emission from wrapped errors being misclassified.

This is not an attempt to do all error handling / wrapping correctly, it's just an incremental change, of which many more are needed.

Why?

How did you test it?

Unit tests

Potential risks

Release notes

Documentation Changes

@davidporter-id-au davidporter-id-au marked this pull request as ready for review November 1, 2023 20:54
@davidporter-id-au davidporter-id-au changed the title Adding better debugging Improves history handler error metrics and logs Nov 1, 2023

} else if errors.As(err, &yarpcE) {

if yarpcE.Code() == yarpcerrors.CodeDeadlineExceeded {
scope.IncCounter(metrics.CadenceErrContextTimeoutCounter)
}
scope.IncCounter(metrics.CadenceFailures)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to your change: why do we increment both CadenceErrContextTimeoutCounter and CadenceFailures in this case? if there are dashboards looking at sum of these failures we will see extra numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have context there to be honest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Groxx do we have dashboards summing these up and showing as total error count?

@davidporter-id-au davidporter-id-au merged commit 4ece98a into cadence-workflow:master Nov 1, 2023
@davidporter-id-au davidporter-id-au deleted the bugfix/debugging-stuck-tasklist-ii branch November 1, 2023 22:51
davidporter-id-au added a commit to davidporter-id-au/cadence that referenced this pull request Dec 5, 2023
davidporter-id-au added a commit to davidporter-id-au/cadence that referenced this pull request Dec 5, 2023
davidporter-id-au added a commit to davidporter-id-au/cadence that referenced this pull request Dec 6, 2023
davidporter-id-au added a commit that referenced this pull request Dec 6, 2023
This was originally added (and not working) with #5438 and this followup corrects it and adds some actual metrics tests to ensure such a miss doesn't happen again.

The driver of this change, to reiterate was two things:
- More concretely, there are some times of invalid data are annoyingly difficult to track down because it lacks runID information. Obviously the oncall can dig around in the DB for the workflow and guess, but it's operationally quite a lot of work in a fast-moving environment. Some problems with invalid workflows without a current runID in this state blocked a preproduction environment for quite a while.
- Zooming out, a goal @Groxx and others have had, is to make errors able to be much richer by wrapping them. However, this requires more than case-switching on types in order to convey more useful information such as the stacktrace and debug info. However, to do so requires any logic which does type or equality matching on errors to start properly using errors.Is/As. This is a small part of that initiative (albeit with a few bumps)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants