Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make REST API check stricter #882

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 19 additions & 8 deletions esrally/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,29 +128,40 @@ def create(self):
return elasticsearch.Elasticsearch(hosts=self.hosts, ssl_context=self.ssl_context, **self.client_options)


def wait_for_rest_layer(es, max_attempts=20):
def wait_for_rest_layer(es, max_attempts=40):
"""
Waits for ``max_attempts`` until Elasticsearch's REST API is available.

:param es: Elasticsearch client to use for connecting.
:param max_attempts: The maximum number of attempts to check whether the REST API is available.
:return: True iff Elasticsearch is available.
:return: True iff Elasticsearch's REST API is available.
"""
# assume that at least the hosts that we expect to contact should be available. Note that this is not 100%
# bullet-proof as a cluster could have e.g. dedicated masters which are not contained in our list of target hosts
# but this is still better than just checking for any random node's REST API being reachable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very useful comment to have.

Question: would it be dangerous to trigger a sniff of the eligible http hosts (e.g. via a helper method in our EsClientFactory invoking #elasticsearch.Transport#sniff_hosts())? I was thinking if we explicitly ask for a fresh list of hosts before the check, then no unavailable hosts should be reachable. Then the same call could be invoked before the load driver starts. The caveat with this approach would be that it could potentially override the explicitly list provided by --target-hosts. Thoughts?

Copy link
Member Author

@danielmitterdorfer danielmitterdorfer Jan 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd try not to build any smartness into this? Without reading all of the involved code I don't think we can reason what nodes will be returned by the sniff_hosts call on cluster bootstrap (let's assume not all nodes are up yet or not all of them might have opened the HTTP port). I was even considering exposing an explicit command line parameter but thought that this would be a good compromise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let's keep things simple here.

For the record I had a look what the elasticsearch py client does when sniff gets invoked here and it collects a list of eligible http hosts via /_nodes/_all/http.

expected_node_count = len(es.transport.hosts)
logger = logging.getLogger(__name__)
for attempt in range(max_attempts):
logger.debug("REST API is available after %s attempts", attempt)
import elasticsearch
try:
es.info()
# see also WaitForHttpResource in Elasticsearch tests. Contrary to the ES tests we consider the API also
# available when the cluster status is RED (as long as all required nodes are present)
es.cluster.health(wait_for_nodes=">={}".format(expected_node_count))
logger.info("REST API is available for >= [%s] nodes after [%s] attempts.", expected_node_count, attempt)
return True
except elasticsearch.ConnectionError as e:
if "SSL: UNKNOWN_PROTOCOL" in str(e):
raise exceptions.SystemSetupError("Could not connect to cluster via https. Is this an https endpoint?", e)
else:
time.sleep(1)
logger.debug("Got connection error on attempt [%s]. Sleeping...", attempt)
time.sleep(3)
except elasticsearch.TransportError as e:
if e.status_code == 503:
time.sleep(1)
elif e.status_code == 401:
time.sleep(1)
# cluster block, x-pack not initialized yet, our wait condition is not reached
if e.status_code in (503, 401, 408):
logger.debug("Got status code [%s] on attempt [%s]. Sleeping...", e.status_code, attempt)
time.sleep(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the default max_attempts=20 the worst case scenario would mean waiting 3*20=60s plus whatever time spent executing the 20 API calls. Given the (potential) difference in performance between different hosts building from source, should we increase this e.g. to 6 (2mins in total)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine increasing this although I'd opt for more retries instead of a larger sleep period.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More retries is fine by me too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've increased the number of retries to 40 now in cbf6dec.

else:
logger.warning("Got unexpected status code [%s] on attempt [%s].", e.status_code, attempt)
raise e
return False
29 changes: 20 additions & 9 deletions tests/client_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,43 +277,54 @@ def test_create_https_connection_unverified_certificate_present_client_certifica


class RestLayerTests(TestCase):
@mock.patch("elasticsearch.Elasticsearch", autospec=True)
@mock.patch("elasticsearch.Elasticsearch")
def test_successfully_waits_for_rest_layer(self, es):
es.transport.hosts = [
{"host": "node-a.example.org", "port": 9200},
{"host": "node-b.example.org", "port": 9200}
]

self.assertTrue(client.wait_for_rest_layer(es, max_attempts=3))

es.cluster.health.assert_has_calls([
mock.call(wait_for_nodes=">=2"),
])

# don't sleep in realtime
@mock.patch("time.sleep")
@mock.patch("elasticsearch.Elasticsearch", autospec=True)
@mock.patch("elasticsearch.Elasticsearch")
def test_retries_on_transport_errors(self, es, sleep):
import elasticsearch

es.info.side_effect = [
es.cluster.health.side_effect = [
elasticsearch.TransportError(503, "Service Unavailable"),
elasticsearch.TransportError(401, "Unauthorized"),
elasticsearch.TransportError(408, "Timed Out"),
elasticsearch.TransportError(408, "Timed Out"),
{
"version": {
"number": "5.0.0",
"build_hash": "abc123"
}
}
]
self.assertTrue(client.wait_for_rest_layer(es, max_attempts=3))
self.assertTrue(client.wait_for_rest_layer(es, max_attempts=5))

# don't sleep in realtime
@mock.patch("time.sleep")
@mock.patch("elasticsearch.Elasticsearch", autospec=True)
def test_dont_retries_eternally_on_transport_errors(self, es, sleep):
@mock.patch("elasticsearch.Elasticsearch")
def test_dont_retry_eternally_on_transport_errors(self, es, sleep):
import elasticsearch

es.info.side_effect = elasticsearch.TransportError(401, "Unauthorized")
es.cluster.health.side_effect = elasticsearch.TransportError(401, "Unauthorized")
self.assertFalse(client.wait_for_rest_layer(es, max_attempts=3))

@mock.patch("elasticsearch.Elasticsearch", autospec=True)
@mock.patch("elasticsearch.Elasticsearch")
def test_ssl_error(self, es):
import elasticsearch
import urllib3.exceptions

es.info.side_effect = elasticsearch.ConnectionError("N/A",
es.cluster.health.side_effect = elasticsearch.ConnectionError("N/A",
"[SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:719)",
urllib3.exceptions.SSLError(
"[SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:719)"))
Expand Down