Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable _mb suffixed stack monitoring func tests #98354

Merged

Conversation

jasonrhodes
Copy link
Member

These tests were disabled temporarily in #98238 because of intermittent failures in master.

Details on running this locally with a subset of tests TBD.

Note: DO NOT MERGE just because this PR is green. The tests are flaky and we need to understand why before we merge.

These tests were disabled temporarily in elastic#98238 because of intermittent failures in master.
@jasonrhodes jasonrhodes self-assigned this Apr 26, 2021
@neptunian
Copy link
Contributor

neptunian commented May 25, 2021

@jasonrhodes the description in #96205 says elastic/elasticsearch#71233 needs to be merged before all functional test will consistently pass. Can we try merging that first?

@jasonrhodes
Copy link
Member Author

@neptunian wow, I did not see that at all. I don't totally understand how a user permission error would cause intermittent failures rather than consistent ones, but based on that note, we should definitely start there and then run the tests on this PR a few times and see what happens.

Looks like that PR requires one last thing?
Just a minor nit, please update ReservedRolesStoreTests#testMonitoringUserRole.

Let's merge it as-is since it's Chris's branch, and then would you mind submitting another PR with just that last requested update? It looks like they are asking for a new test or updated test?

@neptunian
Copy link
Contributor

@elasticmachine merge upstream

@neptunian neptunian added v7.14.0 v8.0.0 release_note:skip Skip the PR/issue when compiling release notes labels May 25, 2021
@jasonrhodes
Copy link
Member Author

jasonrhodes commented May 25, 2021

This failure that just happened is the one that caused all of the intermittent issues when we turned these tests on in master.

It seems like the page just doesn't load sometimes, so the selector never finds the overall ES wrapper on the page. It can happen on lots of different pages and it's always that same timeout selector error, and just for the _mb versions of the tests. I think we should probably spin the archive up and see if it loads fine or if it times out a lot, as a starting point?

@neptunian
Copy link
Contributor

@jasonrhodes Thanks, will look into it further. I'm able to reproduce it locally.

@neptunian
Copy link
Contributor

neptunian commented May 25, 2021

I think there does seem to be a permissions problem as this is the error when it fails. User sets the date range and click the submit button. Server responds with 403 forbidden. From the functional tests:

       │ debg Find.findByCssSelector('[data-test-subj="querySubmitButton"]') with timeout=10000
       │ debg Find.waitForElementStale with timeout=10000
       │ debg browser[DEBUG] http://localhost:5620/9007199254740991/bundles/kbn-ui-shared-deps/kbn-ui-shared-deps.js 2572:20 "click Metric -> (monitoring:query_submitted):"
       │ debg browser[DEBUG] http://localhost:5620/9007199254740991/bundles/kbn-ui-shared-deps/kbn-ui-shared-deps.js 2572:20 Object
       │ERROR browser[SEVERE] http://localhost:5620/api/monitoring/v1/elasticsearch_settings/check/cluster - Failed to load resource: the server responded with a status of 403 (Forbidden)
       │ERROR browser[SEVERE] http://localhost:5620/api/monitoring/v1/elasticsearch_settings/check/cluster - Failed to load resource: the server responded with a status of 403 (Forbidden)
       │ERROR browser[SEVERE] http://localhost:5620/api/monitoring/v1/elasticsearch_settings/check/nodes - Failed to load resource: the server responded with a status of 403 (Forbidden)
       │ERROR browser[SEVERE] http://localhost:5620/api/monitoring/v1/elasticsearch_settings/check/cluster - Failed to load resource: the server responded with a status of 403 (Forbidden)
       │ERROR browser[SEVERE] http://localhost:5620/api/monitoring/v1/elasticsearch_settings/check/nodes - Failed to load resource: the server responded with a status of 403 (Forbidden)
       │ERROR browser[SEVERE] http://localhost:5620/api/monitoring/v1/elasticsearch_settings/check/nodes - Failed to load resource: the server responded with a status of 403 (Forbidden

Screen Shot 2021-05-25 at 3 58 30 PM

Full message:
message: "[security_exception] action [cluster:monitor/state] is unauthorized for user [basic_monitoring_user] with roles [monitoring_user,kibana_admin], this action is granted by the cluster privileges [read_ccr,transport_client,manage_ccr,monitor,manage,all]"

@jasonrhodes
Copy link
Member Author

@chrisronline hey would you mind taking a look at this error and see if you have some high level suggestions for how we should debug this? I'm wondering if we need to update the tests to take advantage of the new permissions role?

@neptunian
Copy link
Contributor

I always get 403 forbidden when I try to access stack monitoring with the role of monitoring_user.

GET /_security/role/monitoring_user does not yet have metricbeat-* listed when I try this locally. Perhaps the changes in elastic/elasticsearch#71233 haven't been reflected yet.

{
  "monitoring_user" : {
    "cluster" : [
      "cluster:monitor/main",
      "cluster:monitor/xpack/info",
      "cluster:monitor/remote/info"
    ],
    "indices" : [
      {
        "names" : [
          ".monitoring-*"
        ],
        "privileges" : [
          "read",
          "read_cross_cluster"
        ],
        "allow_restricted_indices" : false
      }
    ],
    "applications" : [
      {
        "application" : "kibana-*",
        "privileges" : [
          "reserved_monitoring"
        ],
        "resources" : [
          "*"
        ]
      }
    ],
    "run_as" : [ ],
    "metadata" : {
      "_reserved" : true
    },
    "transient_metadata" : {
      "enabled" : true
    }
  }
}

@jasonrhodes
Copy link
Member Author

@neptunian interesting, do you have any theories about how this would manifest in functional tests but only some of the time and not every time that user is used? My assumption is we use that user consistently in some tests?

@jasonrhodes
Copy link
Member Author

@elasticmachine merge upstream

@neptunian
Copy link
Contributor

neptunian commented May 26, 2021

I see now that getting the 403 forbidden on these requests isn't the cause of the failure. When the user submits the query a request is made to api/monitoring/v1/clusters with the timerange. Even the non mb test will often initially get a 403 forbidden but still be able to query and get the cluster. The actual reason it fails is because there is no data in the response when requesting api/monitoring/v1/clusters with the time range queries the user submits. When I test this locally with the basic_monitoring_user I also don't get data back. It probably does not get data back due to lack of access to metricbeat-*. When I switch to the elastic user I do get data back using the same date range. Why it sometimes works in the functional tests I'm still trying to figure out.

@neptunian
Copy link
Contributor

I have found that when the mb test DOES pass, it is not using basic_monitoring_user. It is for some reason using test_user which is a superuser and so it gains access to metricbeat-*. Looking into how this is happening.

@neptunian
Copy link
Contributor

@elasticmachine merge upstream

@jasonrhodes
Copy link
Member Author

Looks like we're getting consistent failures now while we wait for the ES snapshot to update, is that right? One question I have is that when these were failing in master, the failures were spread out over at least 5 or 6 of the _mb test files. It looks like only one test is failing in Jenkins right now. Do we understand why that would be?

@neptunian
Copy link
Contributor

neptunian commented May 27, 2021

Looks like we're getting consistent failures now while we wait for the ES snapshot to update, is that right?

Yes.

One question I have is that when these were failing in master, the failures were spread out over at least 5 or 6 of the _mb test files. It looks like only one test is failing in Jenkins right now. Do we understand why that would be?

I don't think it keeps running after it encounters a first failure? And it would be normal that different mb files would fail depending on whether it was logged in as a superuser or not. When I run it locally all the test failures are from mb tests and there are several.

@jasonrhodes
Copy link
Member Author

I don't think it keeps running after it encounters a first failure?

Ok, if this is true, then that makes sense. Thanks!

@neptunian
Copy link
Contributor

@elasticmachine merge upstream

@jasonrhodes
Copy link
Member Author

wow cat

@neptunian
Copy link
Contributor

@jasonrhodes I confirmed the ES snapshot was updated and tests are passing now as expected. Shall we run it a few more times and then merge?

@jasonrhodes
Copy link
Member Author

@jasonrhodes I confirmed the ES snapshot was updated and tests are passing now as expected.

Amazing! Great work!!!

Shall we run it a few more times and then merge?

Sounds like a good plan.

@neptunian
Copy link
Contributor

@elasticmachine merge upstream

@jasonrhodes
Copy link
Member Author

@neptunian can you approve this since I created it?

@neptunian
Copy link
Contributor

@elasticmachine merge upstream

@rashmivkulkarni
Copy link
Contributor

Screen Shot 2021-05-28 at 10 20 44 AM

Just FYI :

      await security.testUser.setRoles(['kibana_admin'], true);
    });

Setting the parameter to true- will refresh the page .

@neptunian neptunian force-pushed the reenable-metricbeat-monitoring-tests branch from cfce762 to 0c0d917 Compare May 28, 2021 23:42
@neptunian
Copy link
Contributor

@neptunian
Copy link
Contributor

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

References to deprecated APIs

id before after diff
canvas 29 25 -4
crossClusterReplication 8 6 -2
fleet 22 20 -2
globalSearch 4 2 -2
indexManagement 12 7 -5
infra 261 149 -112
lens 67 45 -22
licensing 18 15 -3
lists 239 236 -3
maps 286 208 -78
ml 121 115 -6
monitoring 109 56 -53
securitySolution 390 346 -44
stackAlerts 101 95 -6
total -342

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @jasonrhodes

@neptunian neptunian merged commit 8ed2add into elastic:master Jun 1, 2021
neptunian added a commit to neptunian/kibana that referenced this pull request Jun 1, 2021
* Reenabled _mb suffixed stack monitoring func tests

These tests were disabled temporarily in elastic#98238 because of intermittent failures in master.

* use test_user instead of basic_monitoring_user

* remove security service

* remove logout and cleanup

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: neptunian <[email protected]>
neptunian pushed a commit to neptunian/kibana that referenced this pull request Jun 1, 2021
* Reenabled _mb suffixed stack monitoring func tests

These tests were disabled temporarily in elastic#98238 because of intermittent failures in master.

* use test_user instead of basic_monitoring_user

* remove security service

* remove logout and cleanup

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: neptunian <[email protected]>
# Conflicts:
#	x-pack/test/functional/apps/monitoring/index.js
neptunian pushed a commit to neptunian/kibana that referenced this pull request Jun 1, 2021
* Reenabled _mb suffixed stack monitoring func tests

These tests were disabled temporarily in elastic#98238 because of intermittent failures in master.

* use test_user instead of basic_monitoring_user

* remove security service

* remove logout and cleanup

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: neptunian <[email protected]>
# Conflicts:
#	x-pack/test/functional/apps/monitoring/index.js
neptunian added a commit that referenced this pull request Jun 1, 2021
* Reenabled _mb suffixed stack monitoring func tests

These tests were disabled temporarily in #98238 because of intermittent failures in master.

* use test_user instead of basic_monitoring_user

* remove security service

* remove logout and cleanup

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: neptunian <[email protected]>

Co-authored-by: Jason Rhodes <[email protected]>
Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:skip Skip the PR/issue when compiling release notes v7.13.2 v7.14.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants