Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate how alerting handles index pattern field changes and removals #93501

Closed
mikecote opened this issue Mar 3, 2021 · 7 comments
Closed
Assignees
Labels
chore Feature:Actions Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Mar 3, 2021

See #92753.

Investigate how alerting handles index pattern field changes and removals.

@mikecote mikecote added chore Feature:Alerting Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Mar 3, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Mar 9, 2021

I'm thinking this is about rule type authoring vs an alerting framework thing? So we need to make sure all the rule type executors are "safe" when it comes to dealing with runtime fields?

@mikecote
Copy link
Contributor Author

mikecote commented Mar 9, 2021

We should inform #92753 about other rule types.

@pmuellr
Copy link
Member

pmuellr commented Mar 9, 2021

I posted a comment #92753 (comment) regarding the other rule types.

@ymao1
Copy link
Contributor

ymao1 commented Mar 25, 2021

To investigate this, I used the es-apm-sys-sim to generate data and added the following runtime fields to the mapping:

{
  "runtime": {
    "second_timestamp": {
      "type": "date",
      "script": {
        "source": "emit(doc['@timestamp'].getValue().getMillis())"
      }
    },
    "free_memory": {
      "type": "double",
      "script": {
        "source": "emit(100 * doc['system.memory.actual.free'].value / doc['system.memory.total'].value)"
      }
    },
    "day_of_week": {
      "type": "keyword",
      "script": {
        "source": "emit(doc['@timestamp'].value.dayOfWeekEnum.getDisplayName(TextStyle.FULL, Locale.ROOT))"
      }
    }
  }
}

Index threshold rule type

There are 3 places to specify fields within the Index threshold rule: timestamp field, metric aggregation field and group by field.

On runtime field create
List of runtime fields showed up in create rule flyout with no additional intervention. The preview chart correctly populated when using runtime fields.

On runtime field update
After creating rules using the runtime fields, I updated the runtime field mapping definitions so that they each had a different type than they started out with:

  • Timestamp field - When selected runtime field was updated to be a non-date type, rule execution would fail with a search_phase_execution_exception error. The actual error from ES is a little more descriptive:

When updating from date type to double type:

{
  "type": "query_shard_exception",
  "reason": "failed to create query: For input string: \"2021-03-25T18:35:54.545Z\"",
  "index_uuid": "znV1kqQrTEuOWgad1KOBBw",
  "index": "es-apm-sys-sim",
  "caused_by": {
  "type": "number_format_exception",
    "reason": "For input string: \"2021-03-25T18:35:54.545Z\""
  }
}

When updating from date type to keyword type:

{
  "type": "illegal_argument_exception",
  "reason": "Field [second_timestamp] of type [keyword] does not support custom formats",
  "caused_by": {
    "type": "illegal_argument_exception",
    "reason": "Field [second_timestamp] of type [keyword] does not support custom formats"
  }
}
  • Metric agg field
    • When selected runtime field was updated from double type to date type, the rule execution would continue and the date as epoch millis would be used as the value for the metric agg.
    • When selected runtime field was updated from double type to keyword type, rule execution would fail with a search_phase_execution_exception error. The actual error from ES:
{
  "type": "illegal_argument_exception",
  "reason": "Field [free_memory] of type [keyword] is not supported for aggregation [avg]",
  "caused_by": {
    "type": "illegal_argument_exception",
    "reason": "Field [free_memory] of type [keyword] is not supported for aggregation [avg]"
  }
}
  • Group by field - When selected runtime field was updated to be a non-keyword type, rule execution would succeed by the group by buckets will be unexpected (epoch millis or a numeric value)

On runtime field delete
Rule execution looks normal, no errors in any logs. Query just doesn't return any results. This would lead to a confusing experience for the user since the rule is not failing but they would not be getting alerted when expected.

Elasticsearch query rule type

There are 2 places to specify fields within the Index threshold rule: timestamp field and within the query DSL. Using runtime fields in the timestamp exhibited the same behavior as described above for the Index threshold alert. Behavior of the DSL query itself depended on the content of the query and the type of runtime field mapping update. For example, I set up a rule with a range query on a numeric runtime field and then updated the mapping to be keyword and the query executed without errors (just no hits). I also set up a rule with a term query on a keyword runtime field and then updated the mapping to be numeric and received a search_phase_execution_exception where the underlying ES error was

{
  "type": "query_shard_exception",
  "reason": "failed to create query: For input string: \"Thursday\"",
  "index_uuid": "znV1kqQrTEuOWgad1KOBBw",
  "index": "es-apm-sys-sim",
  "caused_by": {
    "type": "number_format_exception",
    "reason": "For input string: \"Thursday\""
  }
}

@ymao1
Copy link
Contributor

ymao1 commented Mar 25, 2021

I think in the short term, we should be logging better error messages when rule execution fails due to this condition. search_phase_execution_exception is not very descriptive and it would be more helpful to capture the more descriptive error message from ES in the event log.

In the long term, would we want to validate the fields used in the query before executing the query? Seems like overkill to do it on each rule execution, but when/where would we want do this? Maybe this could be something that is done as part of the explain feature if we implement that? At least at that point, the user could see the underlying query that is run for a rule.

@ymao1
Copy link
Contributor

ymao1 commented Mar 26, 2021

Created #95523, #95520 and #95516 to capture work that might be done as an outcome of this investigation. Closing this investigation issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Feature:Actions Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

5 participants