Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps][flapping] change action behavior when flapping #147810

Merged

Conversation

doakalexi
Copy link
Contributor

@doakalexi doakalexi commented Dec 19, 2022

Resolves #143445

Summary

This pr changes the behavior of alerting actions respecting flapping.
To recover a flapping alert the alert must be reported as recovered for 4 (this will be configurable eventually) consecutive runs before the framework makes the alert as recovered.

Link to a spreadsheet that helps vizualize the changes: https://docs.google.com/spreadsheets/d/1amoUDFx-Sdxirnabd_E4OSnyur57sLhY6Ce79oDYgJA/edit#gid=0

Also Resolves #147205 to recover those alerts

Checklist

To verify

  • Create an alert and let it run until it gets into a flapping state (changes from active/recovered at least 4 times)
  • Once flapping, verify the following cases:
  • Verify that a recovered alert will be returned as an active notification 3 times before finally recovering
  • Verify that an alert that does not recover 3 times before an active alert is reported never reports as a recovered alert

@doakalexi doakalexi changed the title Alerting/change action behavior when flapping [ResponseOps][flapping] change action behavior when flapping Dec 20, 2022
@doakalexi doakalexi added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework release_note:skip Skip the PR/issue when compiling release notes labels Dec 20, 2022
@doakalexi doakalexi marked this pull request as ready for review January 12, 2023 18:06
@doakalexi doakalexi requested review from a team as code owners January 12, 2023 18:06
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@weltenwort weltenwort self-requested a review January 13, 2023 15:17
Copy link
Member

@weltenwort weltenwort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infra plugin code changes LGTM

Copy link
Contributor

@CoenWarmer CoenWarmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AO changes LGTM

ymao1 added a commit that referenced this pull request Jan 17, 2023
## Summary

As part of [this PR](#148751) to
encapsulate alert related functionality into a `LegacyAlertsClient`, we
removed some generics for `ActionGroupId` because they did not seem
necessary. However, they are needed to support scheduling actions, which
we add in [this PR](#147810). This
PR is just for restoring the generics for alert related functions.
Copy link
Contributor

@ersin-erdal ersin-erdal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Tested locally, works as expected.

@doakalexi doakalexi enabled auto-merge (squash) January 23, 2023 15:07
Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM

Only thing I'm curious about is the new pendingRecoveredCount property on the alert meta. Seems like this would be highly correlated with flappingHistory. Like, the number of true entries is probably close to the pendingRecoveredCount. But since we don't know which state was flipped to in the flapping history, just that it changed between active and recovered, we can't actually calculate the precise number that pendingRecoveredCount is tracking. Is that right?

@doakalexi
Copy link
Contributor Author

But since we don't know which state was flipped to in the flapping history, just that it changed between active and recovered, we can't actually calculate the precise number that pendingRecoveredCount is tracking. Is that right?

Yes that is correct! We also need to reset the value once it's reported active or finally recovered, and I don't think we can use the flapping history to do that.

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
alerting 40.7KB 40.7KB +31.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@doakalexi doakalexi merged commit 000ba78 into elastic:main Jan 23, 2023
@kibanamachine kibanamachine added v8.7.0 backport:skip This commit does not require backporting labels Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.7.0
Projects
None yet
8 participants