-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Resolve Clusters API] Add option to configure cluster timeout #114020
Labels
>enhancement
:Search Foundations/CCS
Team:Search Foundations
Meta label for the Search Foundations team in Elasticsearch
Comments
Pinging @elastic/es-search (Team:Search) |
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
davismcphee
added a commit
to elastic/kibana
that referenced
this issue
Nov 20, 2024
…a to hang (#200476) ## Summary This PR mitigates an issue where the `has_es_data` check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue: - The `resolve/cluster` request in the `has_es_data` endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request. - In cases where there is no local data and the remote `resolve/cluster` request hangs, a new `data_views.hasEsDataTimeout` config has been added to `kibana.yml` (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs.  Fixes #200280. ### Notes - Modifying the existing version of the `has_es_data` endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review). - For a long term fix, the ES team is investigating the issue with `resolve/cluster` and will aim to have it behave like `resolve/index`, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: elastic/elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible. - If ES ends up providing another performant method for checking if indices exist instead of `resolve/cluster`, Kibana should migrate to that. More details in elastic/elasticsearch#112307. ### Testing notes To reproduce the issue locally, follow these steps: - Follow [these instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756) to set up a local CCS environment. - Stop the remote cluster process. - Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive cluster. See elastic/elasticsearch#32678 for more context. - Navigate to Discover and observe that the `has_es_data` request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast. ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [x] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] The PR description includes the appropriate Release Notes section, and the correct `release_node:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]>
kibanamachine
pushed a commit
to kibanamachine/kibana
that referenced
this issue
Nov 20, 2024
…a to hang (elastic#200476) ## Summary This PR mitigates an issue where the `has_es_data` check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue: - The `resolve/cluster` request in the `has_es_data` endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request. - In cases where there is no local data and the remote `resolve/cluster` request hangs, a new `data_views.hasEsDataTimeout` config has been added to `kibana.yml` (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs.  Fixes elastic#200280. ### Notes - Modifying the existing version of the `has_es_data` endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review). - For a long term fix, the ES team is investigating the issue with `resolve/cluster` and will aim to have it behave like `resolve/index`, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: elastic/elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible. - If ES ends up providing another performant method for checking if indices exist instead of `resolve/cluster`, Kibana should migrate to that. More details in elastic/elasticsearch#112307. ### Testing notes To reproduce the issue locally, follow these steps: - Follow [these instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756) to set up a local CCS environment. - Stop the remote cluster process. - Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive cluster. See elastic/elasticsearch#32678 for more context. - Navigate to Discover and observe that the `has_es_data` request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast. ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [x] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] The PR description includes the appropriate Release Notes section, and the correct `release_node:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]> (cherry picked from commit 96fd4b6)
kibanamachine
pushed a commit
to kibanamachine/kibana
that referenced
this issue
Nov 20, 2024
…a to hang (elastic#200476) ## Summary This PR mitigates an issue where the `has_es_data` check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue: - The `resolve/cluster` request in the `has_es_data` endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request. - In cases where there is no local data and the remote `resolve/cluster` request hangs, a new `data_views.hasEsDataTimeout` config has been added to `kibana.yml` (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs.  Fixes elastic#200280. ### Notes - Modifying the existing version of the `has_es_data` endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review). - For a long term fix, the ES team is investigating the issue with `resolve/cluster` and will aim to have it behave like `resolve/index`, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: elastic/elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible. - If ES ends up providing another performant method for checking if indices exist instead of `resolve/cluster`, Kibana should migrate to that. More details in elastic/elasticsearch#112307. ### Testing notes To reproduce the issue locally, follow these steps: - Follow [these instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756) to set up a local CCS environment. - Stop the remote cluster process. - Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive cluster. See elastic/elasticsearch#32678 for more context. - Navigate to Discover and observe that the `has_es_data` request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast. ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [x] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] The PR description includes the appropriate Release Notes section, and the correct `release_node:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]> (cherry picked from commit 96fd4b6)
kibanamachine
pushed a commit
to kibanamachine/kibana
that referenced
this issue
Nov 20, 2024
…a to hang (elastic#200476) ## Summary This PR mitigates an issue where the `has_es_data` check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue: - The `resolve/cluster` request in the `has_es_data` endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request. - In cases where there is no local data and the remote `resolve/cluster` request hangs, a new `data_views.hasEsDataTimeout` config has been added to `kibana.yml` (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs.  Fixes elastic#200280. ### Notes - Modifying the existing version of the `has_es_data` endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review). - For a long term fix, the ES team is investigating the issue with `resolve/cluster` and will aim to have it behave like `resolve/index`, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: elastic/elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible. - If ES ends up providing another performant method for checking if indices exist instead of `resolve/cluster`, Kibana should migrate to that. More details in elastic/elasticsearch#112307. ### Testing notes To reproduce the issue locally, follow these steps: - Follow [these instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756) to set up a local CCS environment. - Stop the remote cluster process. - Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive cluster. See elastic/elasticsearch#32678 for more context. - Navigate to Discover and observe that the `has_es_data` request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast. ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [x] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] The PR description includes the appropriate Release Notes section, and the correct `release_node:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]> (cherry picked from commit 96fd4b6)
kibanamachine
added a commit
to elastic/kibana
that referenced
this issue
Nov 20, 2024
… can cause Kibana to hang (#200476) (#201025) # Backport This will backport the following commits from `main` to `8.x`: - [[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)](#200476) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Davis McPhee","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-11-20T18:52:47Z","message":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the `has_es_data` check can hang when\r\nsome remote clusters are unresponsive, leaving users stuck in a loading\r\nstate in some apps (e.g. Discover and Dashboard) until the request times\r\nout. There are two main changes that help mitigate this issue:\r\n- The `resolve/cluster` request in the `has_es_data` endpoint has been\r\nsplit into two requests -- one for local data first, then another for\r\nremote data second. In cases where remote clusters are unresponsive but\r\nthere is data available in the local cluster, the remote check is never\r\nperformed and the check completes quickly. This likely resolves the\r\nmajority of cases and is also likely faster in general than checking\r\nboth local and remote clusters in a single request.\r\n- In cases where there is no local data and the remote `resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout` config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to abort the request after a\r\nshort delay. This scenario is handled in the front end by displaying an\r\nerror toast to the user informing them of the issue, and assuming there\r\nis data available to avoid blocking them. When this occurs, a warning is\r\nalso logged to the Kibana server logs.\r\n\r\n\r\n\r\nFixes #200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the `has_es_data` endpoint in this\r\nway should be backward compatible since the behaviour should remain\r\nunchanged from before when the client and server versions don't match\r\n(please validate if this seems accurate during review).\r\n- For a long term fix, the ES team is investigating the issue with\r\n`resolve/cluster` and will aim to have it behave like `resolve/index`,\r\nwhich fails quickly when remote clusters are unresponsive. They may also\r\nimplement other mitigations like a configurable timeout in ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The purpose of\r\nthis PR is to provide an immediate solution in Kibana that mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing another performant method for checking if\r\nindices exist instead of `resolve/cluster`, Kibana should migrate to\r\nthat. More details in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n### Testing notes\r\n\r\nTo reproduce the issue locally, follow these steps:\r\n- Follow [these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto set up a local CCS environment.\r\n- Stop the remote cluster process.\r\n- Use Netcat on the remote cluster port to listen to requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive\r\ncluster. See elastic/elasticsearch#32678 for\r\nmore context.\r\n- Navigate to Discover and observe that the `has_es_data` request hangs.\r\nWhen testing in this PR branch, the request will only wait for 5 seconds\r\nbefore assuming data exists and displaying a toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing), uses\r\nsentence case text and includes [i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n- [ ]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas added for features that require explanation or tutorials\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] If a plugin configuration key changed, check if it needs to be\r\nallowlisted in the cloud and added to the [docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n- [x] This was checked for breaking HTTP API changes, and any breaking\r\nchanges have been approved by the breaking-change committee. The\r\n`release_note:breaking` label should be applied in these situations.\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] The PR description includes the appropriate Release Notes section,\r\nand the correct `release_node:*` label is applied per the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589","branchLabelMapping":{"^v9.0.0$":"main","^v8.17.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","v9.0.0","Team:DataDiscovery","backport:prev-major"],"title":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang","number":200476,"url":"https://github.com/elastic/kibana/pull/200476","mergeCommit":{"message":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the `has_es_data` check can hang when\r\nsome remote clusters are unresponsive, leaving users stuck in a loading\r\nstate in some apps (e.g. Discover and Dashboard) until the request times\r\nout. There are two main changes that help mitigate this issue:\r\n- The `resolve/cluster` request in the `has_es_data` endpoint has been\r\nsplit into two requests -- one for local data first, then another for\r\nremote data second. In cases where remote clusters are unresponsive but\r\nthere is data available in the local cluster, the remote check is never\r\nperformed and the check completes quickly. This likely resolves the\r\nmajority of cases and is also likely faster in general than checking\r\nboth local and remote clusters in a single request.\r\n- In cases where there is no local data and the remote `resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout` config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to abort the request after a\r\nshort delay. This scenario is handled in the front end by displaying an\r\nerror toast to the user informing them of the issue, and assuming there\r\nis data available to avoid blocking them. When this occurs, a warning is\r\nalso logged to the Kibana server logs.\r\n\r\n\r\n\r\nFixes #200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the `has_es_data` endpoint in this\r\nway should be backward compatible since the behaviour should remain\r\nunchanged from before when the client and server versions don't match\r\n(please validate if this seems accurate during review).\r\n- For a long term fix, the ES team is investigating the issue with\r\n`resolve/cluster` and will aim to have it behave like `resolve/index`,\r\nwhich fails quickly when remote clusters are unresponsive. They may also\r\nimplement other mitigations like a configurable timeout in ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The purpose of\r\nthis PR is to provide an immediate solution in Kibana that mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing another performant method for checking if\r\nindices exist instead of `resolve/cluster`, Kibana should migrate to\r\nthat. More details in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n### Testing notes\r\n\r\nTo reproduce the issue locally, follow these steps:\r\n- Follow [these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto set up a local CCS environment.\r\n- Stop the remote cluster process.\r\n- Use Netcat on the remote cluster port to listen to requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive\r\ncluster. See elastic/elasticsearch#32678 for\r\nmore context.\r\n- Navigate to Discover and observe that the `has_es_data` request hangs.\r\nWhen testing in this PR branch, the request will only wait for 5 seconds\r\nbefore assuming data exists and displaying a toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing), uses\r\nsentence case text and includes [i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n- [ ]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas added for features that require explanation or tutorials\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] If a plugin configuration key changed, check if it needs to be\r\nallowlisted in the cloud and added to the [docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n- [x] This was checked for breaking HTTP API changes, and any breaking\r\nchanges have been approved by the breaking-change committee. The\r\n`release_note:breaking` label should be applied in these situations.\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] The PR description includes the appropriate Release Notes section,\r\nand the correct `release_node:*` label is applied per the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/200476","number":200476,"mergeCommit":{"message":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the `has_es_data` check can hang when\r\nsome remote clusters are unresponsive, leaving users stuck in a loading\r\nstate in some apps (e.g. Discover and Dashboard) until the request times\r\nout. There are two main changes that help mitigate this issue:\r\n- The `resolve/cluster` request in the `has_es_data` endpoint has been\r\nsplit into two requests -- one for local data first, then another for\r\nremote data second. In cases where remote clusters are unresponsive but\r\nthere is data available in the local cluster, the remote check is never\r\nperformed and the check completes quickly. This likely resolves the\r\nmajority of cases and is also likely faster in general than checking\r\nboth local and remote clusters in a single request.\r\n- In cases where there is no local data and the remote `resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout` config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to abort the request after a\r\nshort delay. This scenario is handled in the front end by displaying an\r\nerror toast to the user informing them of the issue, and assuming there\r\nis data available to avoid blocking them. When this occurs, a warning is\r\nalso logged to the Kibana server logs.\r\n\r\n\r\n\r\nFixes #200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the `has_es_data` endpoint in this\r\nway should be backward compatible since the behaviour should remain\r\nunchanged from before when the client and server versions don't match\r\n(please validate if this seems accurate during review).\r\n- For a long term fix, the ES team is investigating the issue with\r\n`resolve/cluster` and will aim to have it behave like `resolve/index`,\r\nwhich fails quickly when remote clusters are unresponsive. They may also\r\nimplement other mitigations like a configurable timeout in ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The purpose of\r\nthis PR is to provide an immediate solution in Kibana that mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing another performant method for checking if\r\nindices exist instead of `resolve/cluster`, Kibana should migrate to\r\nthat. More details in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n### Testing notes\r\n\r\nTo reproduce the issue locally, follow these steps:\r\n- Follow [these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto set up a local CCS environment.\r\n- Stop the remote cluster process.\r\n- Use Netcat on the remote cluster port to listen to requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive\r\ncluster. See elastic/elasticsearch#32678 for\r\nmore context.\r\n- Navigate to Discover and observe that the `has_es_data` request hangs.\r\nWhen testing in this PR branch, the request will only wait for 5 seconds\r\nbefore assuming data exists and displaying a toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing), uses\r\nsentence case text and includes [i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n- [ ]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas added for features that require explanation or tutorials\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] If a plugin configuration key changed, check if it needs to be\r\nallowlisted in the cloud and added to the [docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n- [x] This was checked for breaking HTTP API changes, and any breaking\r\nchanges have been approved by the breaking-change committee. The\r\n`release_note:breaking` label should be applied in these situations.\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] The PR description includes the appropriate Release Notes section,\r\nand the correct `release_node:*` label is applied per the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589"}}]}] BACKPORT--> Co-authored-by: Davis McPhee <[email protected]>
kibanamachine
added a commit
to elastic/kibana
that referenced
this issue
Nov 20, 2024
…k can cause Kibana to hang (#200476) (#201024) # Backport This will backport the following commits from `main` to `8.16`: - [[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)](#200476) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Davis McPhee","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-11-20T18:52:47Z","message":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the `has_es_data` check can hang when\r\nsome remote clusters are unresponsive, leaving users stuck in a loading\r\nstate in some apps (e.g. Discover and Dashboard) until the request times\r\nout. There are two main changes that help mitigate this issue:\r\n- The `resolve/cluster` request in the `has_es_data` endpoint has been\r\nsplit into two requests -- one for local data first, then another for\r\nremote data second. In cases where remote clusters are unresponsive but\r\nthere is data available in the local cluster, the remote check is never\r\nperformed and the check completes quickly. This likely resolves the\r\nmajority of cases and is also likely faster in general than checking\r\nboth local and remote clusters in a single request.\r\n- In cases where there is no local data and the remote `resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout` config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to abort the request after a\r\nshort delay. This scenario is handled in the front end by displaying an\r\nerror toast to the user informing them of the issue, and assuming there\r\nis data available to avoid blocking them. When this occurs, a warning is\r\nalso logged to the Kibana server logs.\r\n\r\n\r\n\r\nFixes #200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the `has_es_data` endpoint in this\r\nway should be backward compatible since the behaviour should remain\r\nunchanged from before when the client and server versions don't match\r\n(please validate if this seems accurate during review).\r\n- For a long term fix, the ES team is investigating the issue with\r\n`resolve/cluster` and will aim to have it behave like `resolve/index`,\r\nwhich fails quickly when remote clusters are unresponsive. They may also\r\nimplement other mitigations like a configurable timeout in ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The purpose of\r\nthis PR is to provide an immediate solution in Kibana that mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing another performant method for checking if\r\nindices exist instead of `resolve/cluster`, Kibana should migrate to\r\nthat. More details in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n### Testing notes\r\n\r\nTo reproduce the issue locally, follow these steps:\r\n- Follow [these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto set up a local CCS environment.\r\n- Stop the remote cluster process.\r\n- Use Netcat on the remote cluster port to listen to requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive\r\ncluster. See elastic/elasticsearch#32678 for\r\nmore context.\r\n- Navigate to Discover and observe that the `has_es_data` request hangs.\r\nWhen testing in this PR branch, the request will only wait for 5 seconds\r\nbefore assuming data exists and displaying a toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing), uses\r\nsentence case text and includes [i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n- [ ]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas added for features that require explanation or tutorials\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] If a plugin configuration key changed, check if it needs to be\r\nallowlisted in the cloud and added to the [docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n- [x] This was checked for breaking HTTP API changes, and any breaking\r\nchanges have been approved by the breaking-change committee. The\r\n`release_note:breaking` label should be applied in these situations.\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] The PR description includes the appropriate Release Notes section,\r\nand the correct `release_node:*` label is applied per the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589","branchLabelMapping":{"^v9.0.0$":"main","^v8.17.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","v9.0.0","Team:DataDiscovery","backport:prev-major"],"title":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang","number":200476,"url":"https://github.com/elastic/kibana/pull/200476","mergeCommit":{"message":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the `has_es_data` check can hang when\r\nsome remote clusters are unresponsive, leaving users stuck in a loading\r\nstate in some apps (e.g. Discover and Dashboard) until the request times\r\nout. There are two main changes that help mitigate this issue:\r\n- The `resolve/cluster` request in the `has_es_data` endpoint has been\r\nsplit into two requests -- one for local data first, then another for\r\nremote data second. In cases where remote clusters are unresponsive but\r\nthere is data available in the local cluster, the remote check is never\r\nperformed and the check completes quickly. This likely resolves the\r\nmajority of cases and is also likely faster in general than checking\r\nboth local and remote clusters in a single request.\r\n- In cases where there is no local data and the remote `resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout` config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to abort the request after a\r\nshort delay. This scenario is handled in the front end by displaying an\r\nerror toast to the user informing them of the issue, and assuming there\r\nis data available to avoid blocking them. When this occurs, a warning is\r\nalso logged to the Kibana server logs.\r\n\r\n\r\n\r\nFixes #200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the `has_es_data` endpoint in this\r\nway should be backward compatible since the behaviour should remain\r\nunchanged from before when the client and server versions don't match\r\n(please validate if this seems accurate during review).\r\n- For a long term fix, the ES team is investigating the issue with\r\n`resolve/cluster` and will aim to have it behave like `resolve/index`,\r\nwhich fails quickly when remote clusters are unresponsive. They may also\r\nimplement other mitigations like a configurable timeout in ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The purpose of\r\nthis PR is to provide an immediate solution in Kibana that mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing another performant method for checking if\r\nindices exist instead of `resolve/cluster`, Kibana should migrate to\r\nthat. More details in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n### Testing notes\r\n\r\nTo reproduce the issue locally, follow these steps:\r\n- Follow [these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto set up a local CCS environment.\r\n- Stop the remote cluster process.\r\n- Use Netcat on the remote cluster port to listen to requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive\r\ncluster. See elastic/elasticsearch#32678 for\r\nmore context.\r\n- Navigate to Discover and observe that the `has_es_data` request hangs.\r\nWhen testing in this PR branch, the request will only wait for 5 seconds\r\nbefore assuming data exists and displaying a toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing), uses\r\nsentence case text and includes [i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n- [ ]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas added for features that require explanation or tutorials\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] If a plugin configuration key changed, check if it needs to be\r\nallowlisted in the cloud and added to the [docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n- [x] This was checked for breaking HTTP API changes, and any breaking\r\nchanges have been approved by the breaking-change committee. The\r\n`release_note:breaking` label should be applied in these situations.\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] The PR description includes the appropriate Release Notes section,\r\nand the correct `release_node:*` label is applied per the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/200476","number":200476,"mergeCommit":{"message":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the `has_es_data` check can hang when\r\nsome remote clusters are unresponsive, leaving users stuck in a loading\r\nstate in some apps (e.g. Discover and Dashboard) until the request times\r\nout. There are two main changes that help mitigate this issue:\r\n- The `resolve/cluster` request in the `has_es_data` endpoint has been\r\nsplit into two requests -- one for local data first, then another for\r\nremote data second. In cases where remote clusters are unresponsive but\r\nthere is data available in the local cluster, the remote check is never\r\nperformed and the check completes quickly. This likely resolves the\r\nmajority of cases and is also likely faster in general than checking\r\nboth local and remote clusters in a single request.\r\n- In cases where there is no local data and the remote `resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout` config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to abort the request after a\r\nshort delay. This scenario is handled in the front end by displaying an\r\nerror toast to the user informing them of the issue, and assuming there\r\nis data available to avoid blocking them. When this occurs, a warning is\r\nalso logged to the Kibana server logs.\r\n\r\n\r\n\r\nFixes #200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the `has_es_data` endpoint in this\r\nway should be backward compatible since the behaviour should remain\r\nunchanged from before when the client and server versions don't match\r\n(please validate if this seems accurate during review).\r\n- For a long term fix, the ES team is investigating the issue with\r\n`resolve/cluster` and will aim to have it behave like `resolve/index`,\r\nwhich fails quickly when remote clusters are unresponsive. They may also\r\nimplement other mitigations like a configurable timeout in ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The purpose of\r\nthis PR is to provide an immediate solution in Kibana that mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing another performant method for checking if\r\nindices exist instead of `resolve/cluster`, Kibana should migrate to\r\nthat. More details in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n### Testing notes\r\n\r\nTo reproduce the issue locally, follow these steps:\r\n- Follow [these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto set up a local CCS environment.\r\n- Stop the remote cluster process.\r\n- Use Netcat on the remote cluster port to listen to requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive\r\ncluster. See elastic/elasticsearch#32678 for\r\nmore context.\r\n- Navigate to Discover and observe that the `has_es_data` request hangs.\r\nWhen testing in this PR branch, the request will only wait for 5 seconds\r\nbefore assuming data exists and displaying a toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing), uses\r\nsentence case text and includes [i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n- [ ]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas added for features that require explanation or tutorials\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] If a plugin configuration key changed, check if it needs to be\r\nallowlisted in the cloud and added to the [docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n- [x] This was checked for breaking HTTP API changes, and any breaking\r\nchanges have been approved by the breaking-change committee. The\r\n`release_note:breaking` label should be applied in these situations.\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] The PR description includes the appropriate Release Notes section,\r\nand the correct `release_node:*` label is applied per the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589"}}]}] BACKPORT--> Co-authored-by: Davis McPhee <[email protected]>
kibanamachine
added a commit
to elastic/kibana
that referenced
this issue
Nov 20, 2024
…k can cause Kibana to hang (#200476) (#201023) # Backport This will backport the following commits from `main` to `8.15`: - [[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)](#200476) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Davis McPhee","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-11-20T18:52:47Z","message":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the `has_es_data` check can hang when\r\nsome remote clusters are unresponsive, leaving users stuck in a loading\r\nstate in some apps (e.g. Discover and Dashboard) until the request times\r\nout. There are two main changes that help mitigate this issue:\r\n- The `resolve/cluster` request in the `has_es_data` endpoint has been\r\nsplit into two requests -- one for local data first, then another for\r\nremote data second. In cases where remote clusters are unresponsive but\r\nthere is data available in the local cluster, the remote check is never\r\nperformed and the check completes quickly. This likely resolves the\r\nmajority of cases and is also likely faster in general than checking\r\nboth local and remote clusters in a single request.\r\n- In cases where there is no local data and the remote `resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout` config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to abort the request after a\r\nshort delay. This scenario is handled in the front end by displaying an\r\nerror toast to the user informing them of the issue, and assuming there\r\nis data available to avoid blocking them. When this occurs, a warning is\r\nalso logged to the Kibana server logs.\r\n\r\n\r\n\r\nFixes #200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the `has_es_data` endpoint in this\r\nway should be backward compatible since the behaviour should remain\r\nunchanged from before when the client and server versions don't match\r\n(please validate if this seems accurate during review).\r\n- For a long term fix, the ES team is investigating the issue with\r\n`resolve/cluster` and will aim to have it behave like `resolve/index`,\r\nwhich fails quickly when remote clusters are unresponsive. They may also\r\nimplement other mitigations like a configurable timeout in ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The purpose of\r\nthis PR is to provide an immediate solution in Kibana that mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing another performant method for checking if\r\nindices exist instead of `resolve/cluster`, Kibana should migrate to\r\nthat. More details in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n### Testing notes\r\n\r\nTo reproduce the issue locally, follow these steps:\r\n- Follow [these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto set up a local CCS environment.\r\n- Stop the remote cluster process.\r\n- Use Netcat on the remote cluster port to listen to requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive\r\ncluster. See elastic/elasticsearch#32678 for\r\nmore context.\r\n- Navigate to Discover and observe that the `has_es_data` request hangs.\r\nWhen testing in this PR branch, the request will only wait for 5 seconds\r\nbefore assuming data exists and displaying a toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing), uses\r\nsentence case text and includes [i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n- [ ]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas added for features that require explanation or tutorials\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] If a plugin configuration key changed, check if it needs to be\r\nallowlisted in the cloud and added to the [docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n- [x] This was checked for breaking HTTP API changes, and any breaking\r\nchanges have been approved by the breaking-change committee. The\r\n`release_note:breaking` label should be applied in these situations.\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] The PR description includes the appropriate Release Notes section,\r\nand the correct `release_node:*` label is applied per the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589","branchLabelMapping":{"^v9.0.0$":"main","^v8.17.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","v9.0.0","Team:DataDiscovery","backport:prev-major"],"title":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang","number":200476,"url":"https://github.com/elastic/kibana/pull/200476","mergeCommit":{"message":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the `has_es_data` check can hang when\r\nsome remote clusters are unresponsive, leaving users stuck in a loading\r\nstate in some apps (e.g. Discover and Dashboard) until the request times\r\nout. There are two main changes that help mitigate this issue:\r\n- The `resolve/cluster` request in the `has_es_data` endpoint has been\r\nsplit into two requests -- one for local data first, then another for\r\nremote data second. In cases where remote clusters are unresponsive but\r\nthere is data available in the local cluster, the remote check is never\r\nperformed and the check completes quickly. This likely resolves the\r\nmajority of cases and is also likely faster in general than checking\r\nboth local and remote clusters in a single request.\r\n- In cases where there is no local data and the remote `resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout` config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to abort the request after a\r\nshort delay. This scenario is handled in the front end by displaying an\r\nerror toast to the user informing them of the issue, and assuming there\r\nis data available to avoid blocking them. When this occurs, a warning is\r\nalso logged to the Kibana server logs.\r\n\r\n\r\n\r\nFixes #200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the `has_es_data` endpoint in this\r\nway should be backward compatible since the behaviour should remain\r\nunchanged from before when the client and server versions don't match\r\n(please validate if this seems accurate during review).\r\n- For a long term fix, the ES team is investigating the issue with\r\n`resolve/cluster` and will aim to have it behave like `resolve/index`,\r\nwhich fails quickly when remote clusters are unresponsive. They may also\r\nimplement other mitigations like a configurable timeout in ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The purpose of\r\nthis PR is to provide an immediate solution in Kibana that mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing another performant method for checking if\r\nindices exist instead of `resolve/cluster`, Kibana should migrate to\r\nthat. More details in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n### Testing notes\r\n\r\nTo reproduce the issue locally, follow these steps:\r\n- Follow [these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto set up a local CCS environment.\r\n- Stop the remote cluster process.\r\n- Use Netcat on the remote cluster port to listen to requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive\r\ncluster. See elastic/elasticsearch#32678 for\r\nmore context.\r\n- Navigate to Discover and observe that the `has_es_data` request hangs.\r\nWhen testing in this PR branch, the request will only wait for 5 seconds\r\nbefore assuming data exists and displaying a toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing), uses\r\nsentence case text and includes [i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n- [ ]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas added for features that require explanation or tutorials\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] If a plugin configuration key changed, check if it needs to be\r\nallowlisted in the cloud and added to the [docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n- [x] This was checked for breaking HTTP API changes, and any breaking\r\nchanges have been approved by the breaking-change committee. The\r\n`release_note:breaking` label should be applied in these situations.\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] The PR description includes the appropriate Release Notes section,\r\nand the correct `release_node:*` label is applied per the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/200476","number":200476,"mergeCommit":{"message":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the `has_es_data` check can hang when\r\nsome remote clusters are unresponsive, leaving users stuck in a loading\r\nstate in some apps (e.g. Discover and Dashboard) until the request times\r\nout. There are two main changes that help mitigate this issue:\r\n- The `resolve/cluster` request in the `has_es_data` endpoint has been\r\nsplit into two requests -- one for local data first, then another for\r\nremote data second. In cases where remote clusters are unresponsive but\r\nthere is data available in the local cluster, the remote check is never\r\nperformed and the check completes quickly. This likely resolves the\r\nmajority of cases and is also likely faster in general than checking\r\nboth local and remote clusters in a single request.\r\n- In cases where there is no local data and the remote `resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout` config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to abort the request after a\r\nshort delay. This scenario is handled in the front end by displaying an\r\nerror toast to the user informing them of the issue, and assuming there\r\nis data available to avoid blocking them. When this occurs, a warning is\r\nalso logged to the Kibana server logs.\r\n\r\n\r\n\r\nFixes #200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the `has_es_data` endpoint in this\r\nway should be backward compatible since the behaviour should remain\r\nunchanged from before when the client and server versions don't match\r\n(please validate if this seems accurate during review).\r\n- For a long term fix, the ES team is investigating the issue with\r\n`resolve/cluster` and will aim to have it behave like `resolve/index`,\r\nwhich fails quickly when remote clusters are unresponsive. They may also\r\nimplement other mitigations like a configurable timeout in ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The purpose of\r\nthis PR is to provide an immediate solution in Kibana that mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing another performant method for checking if\r\nindices exist instead of `resolve/cluster`, Kibana should migrate to\r\nthat. More details in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n### Testing notes\r\n\r\nTo reproduce the issue locally, follow these steps:\r\n- Follow [these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto set up a local CCS environment.\r\n- Stop the remote cluster process.\r\n- Use Netcat on the remote cluster port to listen to requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive\r\ncluster. See elastic/elasticsearch#32678 for\r\nmore context.\r\n- Navigate to Discover and observe that the `has_es_data` request hangs.\r\nWhen testing in this PR branch, the request will only wait for 5 seconds\r\nbefore assuming data exists and displaying a toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing), uses\r\nsentence case text and includes [i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n- [ ]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas added for features that require explanation or tutorials\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] If a plugin configuration key changed, check if it needs to be\r\nallowlisted in the cloud and added to the [docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n- [x] This was checked for breaking HTTP API changes, and any breaking\r\nchanges have been approved by the breaking-change committee. The\r\n`release_note:breaking` label should be applied in these situations.\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] The PR description includes the appropriate Release Notes section,\r\nand the correct `release_node:*` label is applied per the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589"}}]}] BACKPORT--> --------- Co-authored-by: Davis McPhee <[email protected]>
TattdCodeMonkey
pushed a commit
to TattdCodeMonkey/kibana
that referenced
this issue
Nov 21, 2024
…a to hang (elastic#200476) ## Summary This PR mitigates an issue where the `has_es_data` check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue: - The `resolve/cluster` request in the `has_es_data` endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request. - In cases where there is no local data and the remote `resolve/cluster` request hangs, a new `data_views.hasEsDataTimeout` config has been added to `kibana.yml` (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs.  Fixes elastic#200280. ### Notes - Modifying the existing version of the `has_es_data` endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review). - For a long term fix, the ES team is investigating the issue with `resolve/cluster` and will aim to have it behave like `resolve/index`, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: elastic/elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible. - If ES ends up providing another performant method for checking if indices exist instead of `resolve/cluster`, Kibana should migrate to that. More details in elastic/elasticsearch#112307. ### Testing notes To reproduce the issue locally, follow these steps: - Follow [these instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756) to set up a local CCS environment. - Stop the remote cluster process. - Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive cluster. See elastic/elasticsearch#32678 for more context. - Navigate to Discover and observe that the `has_es_data` request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast. ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [x] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] The PR description includes the appropriate Release Notes section, and the correct `release_node:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]>
paulinashakirova
pushed a commit
to paulinashakirova/kibana
that referenced
this issue
Nov 26, 2024
…a to hang (elastic#200476) ## Summary This PR mitigates an issue where the `has_es_data` check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue: - The `resolve/cluster` request in the `has_es_data` endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request. - In cases where there is no local data and the remote `resolve/cluster` request hangs, a new `data_views.hasEsDataTimeout` config has been added to `kibana.yml` (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs.  Fixes elastic#200280. ### Notes - Modifying the existing version of the `has_es_data` endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review). - For a long term fix, the ES team is investigating the issue with `resolve/cluster` and will aim to have it behave like `resolve/index`, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: elastic/elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible. - If ES ends up providing another performant method for checking if indices exist instead of `resolve/cluster`, Kibana should migrate to that. More details in elastic/elasticsearch#112307. ### Testing notes To reproduce the issue locally, follow these steps: - Follow [these instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756) to set up a local CCS environment. - Stop the remote cluster process. - Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive cluster. See elastic/elasticsearch#32678 for more context. - Navigate to Discover and observe that the `has_es_data` request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast. ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [x] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] The PR description includes the appropriate Release Notes section, and the correct `release_node:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]>
CAWilson94
pushed a commit
to CAWilson94/kibana
that referenced
this issue
Dec 12, 2024
…a to hang (elastic#200476) ## Summary This PR mitigates an issue where the `has_es_data` check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue: - The `resolve/cluster` request in the `has_es_data` endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request. - In cases where there is no local data and the remote `resolve/cluster` request hangs, a new `data_views.hasEsDataTimeout` config has been added to `kibana.yml` (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs.  Fixes elastic#200280. ### Notes - Modifying the existing version of the `has_es_data` endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review). - For a long term fix, the ES team is investigating the issue with `resolve/cluster` and will aim to have it behave like `resolve/index`, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: elastic/elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible. - If ES ends up providing another performant method for checking if indices exist instead of `resolve/cluster`, Kibana should migrate to that. More details in elastic/elasticsearch#112307. ### Testing notes To reproduce the issue locally, follow these steps: - Follow [these instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756) to set up a local CCS environment. - Stop the remote cluster process. - Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive cluster. See elastic/elasticsearch#32678 for more context. - Navigate to Discover and observe that the `has_es_data` request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast. ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [x] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] The PR description includes the appropriate Release Notes section, and the correct `release_node:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
>enhancement
:Search Foundations/CCS
Team:Search Foundations
Meta label for the Search Foundations team in Elasticsearch
Description
In Kibana, we recently implemented the Resolve Cluster API to provide more accurate connection status for clusters in the Remote Clusters UI. However, we've encountered an issue: when a host is unresponsive, the API waits for it to time out, which causes delays.
For example:
This becomes especially problematic when we make these API calls for 10 clusters simultaneously. The cumulative timeouts significantly increase the UI's loading time, making the loading screen display much longer than expected.
Is there any way we could have a way to configure the timeout used for each cluster?
The text was updated successfully, but these errors were encountered: