Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Show more user friendly ES error message when executor fails #96254

Merged
merged 12 commits into from
Apr 15, 2021

Conversation

chrisronline
Copy link
Contributor

@chrisronline chrisronline commented Apr 5, 2021

Resolves to #95516

This PR attempts to identify the root cause of an ES exception and display that in the error message. Unfortunately, it does not appear as if error formats are not normalized across all ES APIs so we need to look in a few places for the correct root cause.

The logic in this PR is borrowed from work by the ES UI team -> https://github.com/elastic/kibana/blob/master/src/plugins/es_ui_shared/__packages_do_not_import__/errors/es_error_parser.ts. I decided to copy/modify as we aren't using all of the logic and we might need to change it over time for our own purposes.

Here is the difference in the server log and UI:

When a runtime field changes type to something incompatible with the existing query

server log [10:02:56.549] [error][alerting][alerting][plugins][plugins] Executing Alert "d7913b90-9633-11eb-ac94-471ab4f6ffdf" has resulted in Error: search_phase_execution_exception, caused by: "cannot implicitly cast def [org.elasticsearch.index.fielddata.ScriptDocValues.Dates] to long"

Screen Shot 2021-04-06 at 10 03 23 AM

When the provided query is invalid

server log [10:05:08.594] [error][alerting][alerting][plugins][plugins] Executing Alert "d7913b90-9633-11eb-ac94-471ab4f6ffdf" has resulted in Error: x_content_parse_exception, caused by: "unknown query [range2] did you mean [range]?,[1:119] unknown field [range2]"

Screen Shot 2021-04-06 at 10 05 22 AM

We do have a little more information available so please feel free to suggest how the messaging could be better and what other information we should show.

@chrisronline chrisronline self-assigned this Apr 5, 2021
@chrisronline chrisronline marked this pull request as ready for review April 6, 2021 14:39
@chrisronline chrisronline requested a review from a team as a code owner April 6, 2021 14:39
@chrisronline chrisronline added Feature:Alerting review Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.13.0 v8.0.0 labels Apr 6, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@chrisronline chrisronline added the release_note:skip Skip the PR/issue when compiling release notes label Apr 6, 2021
@chrisronline
Copy link
Contributor Author

@elasticmachine merge upstream

@ymao1
Copy link
Contributor

ymao1 commented Apr 7, 2021

I created an index threshold rule with a date type runtime field as the timestamp and then changed the type of the runtime field to keyword and I am just seeing

server    log   [15:58:48.385] [warning][plugins][stackAlerts] indexThreshold timeSeriesQuery: callCluster error: search_phase_execution_exception
server    log   [15:59:06.397] [warning][plugins][stackAlerts] indexThreshold timeSeriesQuery: callCluster error: search_phase_execution_exception
server    log   [15:59:24.408] [warning][plugins][stackAlerts] indexThreshold timeSeriesQuery: callCluster error: search_phase_execution_exception
server    log   [15:59:43.484] [warning][plugins][stackAlerts] indexThreshold timeSeriesQuery: callCluster error: search_phase_execution_exception

in the logs. It looks like the timeSeries query called by the index threshold rule executor does its own try catch logic for the ES query to handle and log errors without actually bubbling up the error to the framework, so you may need to update the error logging in there as well.

@chrisronline
Copy link
Contributor Author

@ymao1 Thank you! I'll look into that

@pmuellr
Copy link
Member

pmuellr commented Apr 8, 2021

I created an index threshold rule with a date type runtime field as the timestamp and then changed the type of the runtime field to keyword and I am just seeing

For some reason I'm thinking there is a built-in restriction in using a runtime field as the timestamp field in index patterns. That would obviously result in a table scan for every query that uses date ranges, which many of the alerts use, which is why I think it's disallowed.

So - it might be possible/better to special case this in alerting as well - if we can determine the field used for the date is actually a runtime field, we could potentially disallow that usage when validating the alert params during create/edit. And after the alert is created/edited, and the user later changes the field to a runtime field, we could detect this situation and provide an explicit error message, instead of trying to infer the error state from the elasticsearch response.

@chrisronline
Copy link
Contributor Author

@ymao1 All fixed up and ready for another review!

@chrisronline
Copy link
Contributor Author

@elasticmachine merge upstream

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just some comments about additional tests and a question about deduping the reason string.

import { getEsErrorMessage } from './es_error_parser';

describe('ES error parser', () => {
test('should return all the cause of the error', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for when there are nested caused_by reasons?

return getEsCause(obj.caused_by, updated);
}

if (obj.failed_shards && obj.failed_shards.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for this case? When there is no caused_by but there is an array of failed_shards

}

// Recursively find all the "caused by" reasons
return getEsCause(obj.caused_by, updated);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It looks like sometimes the recursively nested caused_by reasons can be pretty redundant. Could we possibly dedupe them?

Copy link
Contributor

@YulNaumenko YulNaumenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chrisronline
Copy link
Contributor Author

Thanks all. I'm finishing some last minute item and will get back to this PR later this week.

@chrisronline chrisronline force-pushed the alerting/es_error_parser branch from 94e000f to 0320100 Compare April 15, 2021 12:05
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @chrisronline

@chrisronline chrisronline merged commit 9a76d34 into elastic:master Apr 15, 2021
@chrisronline chrisronline deleted the alerting/es_error_parser branch April 15, 2021 14:17
chrisronline added a commit to chrisronline/kibana that referenced this pull request Apr 15, 2021
…ls (elastic#96254)

* WIP for ES error parser

* Fix tests

* Ensure the error shows up in the UI too

* wip

* Handle multiple types here

* Fix tests

* PR feedback

Co-authored-by: Kibana Machine <[email protected]>
chrisronline added a commit that referenced this pull request Apr 15, 2021
…ls (#96254) (#97259)

* WIP for ES error parser

* Fix tests

* Ensure the error shows up in the UI too

* wip

* Handle multiple types here

* Fix tests

* PR feedback

Co-authored-by: Kibana Machine <[email protected]>

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes review Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.13.0 v8.0.0
Projects
None yet
6 participants