[ResponseOps][flapping] trim collections of recovered alerts persisted for flapping #147205

pmuellr · 2022-12-07T17:43:24Z

I did some stress testing on flapping with a custom-built rule, which creates 1000 unique alerts each run, but they are only active for one run. Each means each run but the first, the previously active 1000 will be recovered. And added to the containers of recovered alerts we persist in the alert's task store.

The relevant data structures start at the task SO document, which has a field task.state which is a JSON-encoded string of the rule's task state. It contains both rule-type specific info in a alertTypeState property, and framework-controlled info in the alertInstances and alertRecoveredInstances properties. Both are objects with keys of alert ids, which contain the state-change history of booleans added for flapping detection.

For rule types using the Rule Registry, the alertTypeState is actually wrapped by them, into properties trackedAlerts, trackedAlertsRecovered and wrapped. The first two are analogs of alertInstances and alertRecoveredInstances (which we will eventually get rid of via issue ????) and the third is the "real" rule type state.

When a rule becomes active, and then recovers, we move it from the "active" container (alertInstances or trackedAlerts) into the corresponding "recovered" one, so we can track it for flapping purposes. Unfortunately, we don't seem to have a limit set on these, like we do with the "active" alerts (existing circuit breakers). So the worst-case is a rule generates 1000 (current circuit breaker) unique alerts every run, and those alerts are only active once. The "recovered" containers will end up holding 20 (current flapping history length) * 1000 entries.

I think we want to cap these, as the size ends up becoming problematic. A test run of this scenario got the task document up to 3MB.

Let's say we cap "recovered" to the existing "active" circuit breaker. Seems reasonable. So the processing would probably be:

add the alerts that just recovered to the "recovered" container
trim the "recovered" container by removing the alerts that have been recovered longest
those "recovered-early" alerts would go through the same processing as other flapping alerts that eventually recover and then stop flapping
do we denote these in some special way? Do we need more flags like "this was flapping and seems like it's done but we had to stop looking at it because SO many alerts!"? I think we'd at least want a warning that we dropped X alerts because they exceeded some limit.

This will need to be done for both the alerting framework and rule registry versions.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-12-07T17:43:26Z

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr · 2022-12-07T18:09:30Z

Here's the rule type I used to test it. Dropped it in x-pack/examples/alerting_example/server/alert_types as lots_active_once.ts, and then also updated the plugin to register it via alerting.registerType(lotsActiveOnceAlert)

lots_active_once.ts

/*
 * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
 * or more contributor license agreements. Licensed under the Elastic License
 * 2.0; you may not use this file except in compliance with the Elastic License
 * 2.0.
 */

import { RuleType } from '@kbn/alerting-plugin/server';
import { ALERTING_EXAMPLE_APP_ID } from '../../common/constants';

const ACTIVE_COUNT = 1000;

export const alertType: RuleType<never, never, { runNumber?: number }, never, never, string> = {
  id: 'example.lots-active-once',
  name: 'Always firing',
  actionGroups: [{ id: 'default', name: 'Default' }],
  defaultActionGroupId: 'default',
  minimumLicenseRequired: 'basic',
  isExportable: true,
  async executor({ services, state }) {
    const runNumber = (state.runNumber ?? 0) + 1;
    for (let i = 0; i < ACTIVE_COUNT; i++) {
      const instNumber = `${i}`.padStart(4, '0');
      const alertId = `run:${runNumber}--inst:${instNumber}`;
      services.alertFactory.create(alertId).scheduleActions('default');
    }
    return {
      runNumber,
    };
  },
  producer: ALERTING_EXAMPLE_APP_ID,
};

/*
kbn-alert create example.lots-active-once 'lots-active-once' 1s '{}' '[]'
*/

Resolves #143445 ## Summary This pr changes the behavior of alerting actions respecting flapping. To recover a flapping alert the alert must be reported as recovered for 4 (this will be configurable eventually) consecutive runs before the framework makes the alert as recovered. Link to a spreadsheet that helps vizualize the changes: https://docs.google.com/spreadsheets/d/1amoUDFx-Sdxirnabd_E4OSnyur57sLhY6Ce79oDYgJA/edit#gid=0 Also Resolves #147205 to recover those alerts ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios ### To verify - Create an alert and let it run until it gets into a flapping state (changes from active/recovered at least 4 times) - Once flapping, verify the following cases: - Verify that a recovered alert will be returned as an active notification 3 times before finally recovering - Verify that an alert that does not recover 3 times before an active alert is reported never reports as a recovered alert Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Ying Mao <[email protected]>

pmuellr added bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Dec 7, 2022

pmuellr added this to AppEx: ResponseOps - Execution & Connectors Dec 7, 2022

pmuellr moved this to Awaiting Triage in AppEx: ResponseOps - Execution & Connectors Dec 7, 2022

mikecote moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Dec 8, 2022

doakalexi self-assigned this Dec 12, 2022

doakalexi moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Dec 13, 2022

pmuellr mentioned this issue Dec 13, 2022

[ResponseOps][meta] flapping alert detection #142876

Open

10 tasks

doakalexi mentioned this issue Dec 14, 2022

[ResponseOps][flapping] trim collections of recovered alerts persisted for flapping #147475

Closed

1 task

doakalexi moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Dec 15, 2022

doakalexi mentioned this issue Jan 12, 2023

[ResponseOps][flapping] change action behavior when flapping #147810

Merged

1 task

doakalexi closed this as completed in #147810 Jan 23, 2023

github-project-automation bot moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ResponseOps][flapping] trim collections of recovered alerts persisted for flapping #147205

[ResponseOps][flapping] trim collections of recovered alerts persisted for flapping #147205

pmuellr commented Dec 7, 2022

elasticmachine commented Dec 7, 2022

pmuellr commented Dec 7, 2022

[ResponseOps][flapping] trim collections of recovered alerts persisted for flapping #147205

[ResponseOps][flapping] trim collections of recovered alerts persisted for flapping #147205

Comments

pmuellr commented Dec 7, 2022

elasticmachine commented Dec 7, 2022

pmuellr commented Dec 7, 2022