Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps][flapping] trim collections of recovered alerts persisted for flapping #147205

Closed
pmuellr opened this issue Dec 7, 2022 · 2 comments · Fixed by #147810
Closed

[ResponseOps][flapping] trim collections of recovered alerts persisted for flapping #147205

pmuellr opened this issue Dec 7, 2022 · 2 comments · Fixed by #147810
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Dec 7, 2022

I did some stress testing on flapping with a custom-built rule, which creates 1000 unique alerts each run, but they are only active for one run. Each means each run but the first, the previously active 1000 will be recovered. And added to the containers of recovered alerts we persist in the alert's task store.

The relevant data structures start at the task SO document, which has a field task.state which is a JSON-encoded string of the rule's task state. It contains both rule-type specific info in a alertTypeState property, and framework-controlled info in the alertInstances and alertRecoveredInstances properties. Both are objects with keys of alert ids, which contain the state-change history of booleans added for flapping detection.

For rule types using the Rule Registry, the alertTypeState is actually wrapped by them, into properties trackedAlerts, trackedAlertsRecovered and wrapped. The first two are analogs of alertInstances and alertRecoveredInstances (which we will eventually get rid of via issue ????) and the third is the "real" rule type state.

When a rule becomes active, and then recovers, we move it from the "active" container (alertInstances or trackedAlerts) into the corresponding "recovered" one, so we can track it for flapping purposes. Unfortunately, we don't seem to have a limit set on these, like we do with the "active" alerts (existing circuit breakers). So the worst-case is a rule generates 1000 (current circuit breaker) unique alerts every run, and those alerts are only active once. The "recovered" containers will end up holding 20 (current flapping history length) * 1000 entries.

I think we want to cap these, as the size ends up becoming problematic. A test run of this scenario got the task document up to 3MB.

Let's say we cap "recovered" to the existing "active" circuit breaker. Seems reasonable. So the processing would probably be:

  • add the alerts that just recovered to the "recovered" container
  • trim the "recovered" container by removing the alerts that have been recovered longest
  • those "recovered-early" alerts would go through the same processing as other flapping alerts that eventually recover and then stop flapping
  • do we denote these in some special way? Do we need more flags like "this was flapping and seems like it's done but we had to stop looking at it because SO many alerts!"? I think we'd at least want a warning that we dropped X alerts because they exceeded some limit.

This will need to be done for both the alerting framework and rule registry versions.

@pmuellr pmuellr added bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Dec 7, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member Author

pmuellr commented Dec 7, 2022

Here's the rule type I used to test it. Dropped it in x-pack/examples/alerting_example/server/alert_types as lots_active_once.ts, and then also updated the plugin to register it via alerting.registerType(lotsActiveOnceAlert)

lots_active_once.ts
/*
 * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
 * or more contributor license agreements. Licensed under the Elastic License
 * 2.0; you may not use this file except in compliance with the Elastic License
 * 2.0.
 */

import { RuleType } from '@kbn/alerting-plugin/server';
import { ALERTING_EXAMPLE_APP_ID } from '../../common/constants';

const ACTIVE_COUNT = 1000;

export const alertType: RuleType<never, never, { runNumber?: number }, never, never, string> = {
  id: 'example.lots-active-once',
  name: 'Always firing',
  actionGroups: [{ id: 'default', name: 'Default' }],
  defaultActionGroupId: 'default',
  minimumLicenseRequired: 'basic',
  isExportable: true,
  async executor({ services, state }) {
    const runNumber = (state.runNumber ?? 0) + 1;
    for (let i = 0; i < ACTIVE_COUNT; i++) {
      const instNumber = `${i}`.padStart(4, '0');
      const alertId = `run:${runNumber}--inst:${instNumber}`;
      services.alertFactory.create(alertId).scheduleActions('default');
    }
    return {
      runNumber,
    };
  },
  producer: ALERTING_EXAMPLE_APP_ID,
};

/*
kbn-alert create example.lots-active-once 'lots-active-once' 1s '{}' '[]'
*/

@mikecote mikecote moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Dec 8, 2022
@doakalexi doakalexi self-assigned this Dec 12, 2022
@doakalexi doakalexi moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Dec 13, 2022
@doakalexi doakalexi moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Dec 15, 2022
doakalexi added a commit that referenced this issue Jan 23, 2023
Resolves #143445

## Summary

This pr changes the behavior of alerting actions respecting flapping. 
To recover a flapping alert the alert must be reported as recovered for
4 (this will be configurable eventually) consecutive runs before the
framework makes the alert as recovered.

Link to a spreadsheet that helps vizualize the changes:
https://docs.google.com/spreadsheets/d/1amoUDFx-Sdxirnabd_E4OSnyur57sLhY6Ce79oDYgJA/edit#gid=0

Also Resolves #147205 to recover
those alerts


### Checklist
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios


### To verify
- Create an alert and let it run until it gets into a flapping state
(changes from active/recovered at least 4 times)
- Once flapping, verify the following cases:
- Verify that a recovered alert will be returned as an active
notification 3 times before finally recovering
- Verify that an alert that does not recover 3 times before an active
alert is reported never reports as a recovered alert

Co-authored-by: kibanamachine <[email protected]>
Co-authored-by: Ying Mao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
3 participants