-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ResponseOps][flapping] trim collections of recovered alerts persisted for flapping #147205
Labels
bug
Fixes for quality problems that affect the customer experience
Feature:Alerting/RulesFramework
Issues related to the Alerting Rules Framework
Feature:Alerting
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
Comments
Pinging @elastic/response-ops (Team:ResponseOps) |
Here's the rule type I used to test it. Dropped it in lots_active_once.ts/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/
import { RuleType } from '@kbn/alerting-plugin/server';
import { ALERTING_EXAMPLE_APP_ID } from '../../common/constants';
const ACTIVE_COUNT = 1000;
export const alertType: RuleType<never, never, { runNumber?: number }, never, never, string> = {
id: 'example.lots-active-once',
name: 'Always firing',
actionGroups: [{ id: 'default', name: 'Default' }],
defaultActionGroupId: 'default',
minimumLicenseRequired: 'basic',
isExportable: true,
async executor({ services, state }) {
const runNumber = (state.runNumber ?? 0) + 1;
for (let i = 0; i < ACTIVE_COUNT; i++) {
const instNumber = `${i}`.padStart(4, '0');
const alertId = `run:${runNumber}--inst:${instNumber}`;
services.alertFactory.create(alertId).scheduleActions('default');
}
return {
runNumber,
};
},
producer: ALERTING_EXAMPLE_APP_ID,
};
/*
kbn-alert create example.lots-active-once 'lots-active-once' 1s '{}' '[]'
*/ |
10 tasks
1 task
1 task
doakalexi
added a commit
that referenced
this issue
Jan 23, 2023
Resolves #143445 ## Summary This pr changes the behavior of alerting actions respecting flapping. To recover a flapping alert the alert must be reported as recovered for 4 (this will be configurable eventually) consecutive runs before the framework makes the alert as recovered. Link to a spreadsheet that helps vizualize the changes: https://docs.google.com/spreadsheets/d/1amoUDFx-Sdxirnabd_E4OSnyur57sLhY6Ce79oDYgJA/edit#gid=0 Also Resolves #147205 to recover those alerts ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios ### To verify - Create an alert and let it run until it gets into a flapping state (changes from active/recovered at least 4 times) - Once flapping, verify the following cases: - Verify that a recovered alert will be returned as an active notification 3 times before finally recovering - Verify that an alert that does not recover 3 times before an active alert is reported never reports as a recovered alert Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Ying Mao <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
bug
Fixes for quality problems that affect the customer experience
Feature:Alerting/RulesFramework
Issues related to the Alerting Rules Framework
Feature:Alerting
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
I did some stress testing on flapping with a custom-built rule, which creates 1000 unique alerts each run, but they are only active for one run. Each means each run but the first, the previously active 1000 will be recovered. And added to the containers of recovered alerts we persist in the alert's task store.
The relevant data structures start at the task SO document, which has a field
task.state
which is a JSON-encoded string of the rule's task state. It contains both rule-type specific info in aalertTypeState
property, and framework-controlled info in thealertInstances
andalertRecoveredInstances
properties. Both are objects with keys of alert ids, which contain the state-change history of booleans added for flapping detection.For rule types using the Rule Registry, the
alertTypeState
is actually wrapped by them, into propertiestrackedAlerts
,trackedAlertsRecovered
andwrapped
. The first two are analogs ofalertInstances
andalertRecoveredInstances
(which we will eventually get rid of via issue ????) and the third is the "real" rule type state.When a rule becomes active, and then recovers, we move it from the "active" container (
alertInstances
ortrackedAlerts
) into the corresponding "recovered" one, so we can track it for flapping purposes. Unfortunately, we don't seem to have a limit set on these, like we do with the "active" alerts (existing circuit breakers). So the worst-case is a rule generates 1000 (current circuit breaker) unique alerts every run, and those alerts are only active once. The "recovered" containers will end up holding 20 (current flapping history length) * 1000 entries.I think we want to cap these, as the size ends up becoming problematic. A test run of this scenario got the task document up to 3MB.
Let's say we cap "recovered" to the existing "active" circuit breaker. Seems reasonable. So the processing would probably be:
This will need to be done for both the alerting framework and rule registry versions.
The text was updated successfully, but these errors were encountered: