Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic client update #34

Merged
merged 11 commits into from
Aug 16, 2022
Merged

Elastic client update #34

merged 11 commits into from
Aug 16, 2022

Conversation

GuyAv46
Copy link

@GuyAv46 GuyAv46 commented Jul 12, 2022

Adding to Elasticsearch client its HNSW support. Also requiring certifications for connecting to the server (if you think its redundant, let me know)
for using elasticsearch client, pass username and password with --user, --auth as before, or use environment variable ELASTIC_PASSWORD
for passing certification file, use ELASTIC_CA environment variable.
I'm assuming the we have a server with security settings. if not, this is redundant but it should work anyway.

@GuyAv46 GuyAv46 marked this pull request as ready for review July 17, 2022 16:08
@filipecosta90
Copy link

@GuyAv46 I noticed we're missing the elasticsearch requirement. Should we add elasticsearch>=8.3.1?

@filipecosta90 filipecosta90 self-requested a review July 18, 2022 13:03
@filipecosta90
Copy link

@GuyAv46 I suggest we simplify the host port usage due to:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/elastic_transport/_serializer.py", line 93, in loads
    return self.json_loads(data)
  File "/usr/local/lib/python3.10/dist-packages/elastic_transport/_serializer.py", line 89, in json_loads
    return json.loads(data)
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/ann-benchmarks/ann_benchmarks/main.py", line 40, in run_worker
    run(definition, args.dataset, args.count, args.runs, args.batch,
  File "/root/ann-benchmarks/ann_benchmarks/runner.py", line 102, in run
    algo = instantiate_algorithm(definition)
  File "/root/ann-benchmarks/ann_benchmarks/algorithms/definitions.py", line 24, in instantiate_algorithm
    return constructor(*definition.arguments)
  File "/root/ann-benchmarks/ann_benchmarks/algorithms/elasticsearch.py", line 62, in __init__
    es_wait(self.es)
  File "/root/ann-benchmarks/ann_benchmarks/algorithms/elasticsearch.py", line 29, in es_wait
    res = es.cluster.health(wait_for_status='yellow', timeout='1s')
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/utils.py", line 414, in wrapped
    return api(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/cluster.py", line 454, in health
    return self.perform_request(  # type: ignore[return-value]
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/_base.py", line 390, in perform_request
    return self._client.perform_request(
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/_base.py", line 286, in perform_request
    meta, resp_body = self.transport.perform_request(
  File "/usr/local/lib/python3.10/dist-packages/elastic_transport/_transport.py", line 348, in perform_request
    data = self.serializers.loads(raw_data, meta.mimetype)
  File "/usr/local/lib/python3.10/dist-packages/elastic_transport/_serializer.py", line 196, in loads
    return self.get_serializer(mimetype).loads(data)
  File "/usr/local/lib/python3.10/dist-packages/elastic_transport/_serializer.py", line 95, in loads
    raise SerializationError(
elastic_transport.SerializationError: Unable to deserialize as JSON: b'Client sent an HTTP request to an HTTPS server.\n'

lets simply use something like:

def __init__(self, metric: str, dimension: int, conn_params, method_param):
        self.name = f"elasticsearch-script-score-query_metric={metric}_dimension={dimension}_params{method_param}"
        self.metric = {"euclidean": 'l2_norm', "angular": 'cosine'}[metric]
        self.method_param = method_param
        self.dimension = dimension
        self.timeout = 60 * 60
        h = conn_params['host'] if conn_params['host'] is not None else 'localhost'
        p = conn_params['port'] if conn_params['port'] is not None else '9200'
        u = conn_params['user'] if conn_params['user'] is not None else 'elastic'
        a = conn_params['auth'] if conn_params['auth'] is not None else ''
        self.index = "ann_benchmark"
        self.es = Elasticsearch(f"{h}:{p}",  request_timeout=self.timeout, basic_auth=(u, a), ca_certs=environ.get('ELASTIC_CA', DEFAULT))
        self.batch_res = []
        es_wait(self.es)

Furthermore, starting to run the benchmark we reach timeouts after 1-2minutes:

root@ip-172-31-57-10:~/ann-benchmarks# python3 run.py --algorithm elasticsearch --dataset dbpedia-768 --runs 1 --run-group M-4 --host https://<AAAAAAAAA> --port 443 --auth <AAAAAAA> --local 
Changing the workdir to /root/ann-benchmarks
2022-07-18 13:15:19,517 - annb - INFO - running only elasticsearch algorithms
2022-07-18 13:15:19,517 - annb - INFO - running only M-4 run groups
2022-07-18 13:15:19,573 - annb - INFO - Order: [Definition(algorithm='elasticsearch', run_group='M-4', constructor='ElasticsearchScriptScoreQuery', module='ann_benchmarks.algorithms.elasticsearch', docker_tag='ann-benchmarks-elasticsearch', arguments=['angular', 768, {'host': 'https://vecsim.es.us-east-1.aws.found.io', 'port': 443, 'auth': 'Q0bna41oBWcZsuR785Dr891z', 'user': None, 'cluster': False, 'shards': '1'}, {'m': 4, 'ef_construction': 500, 'type': 'hnsw'}], query_argument_groups=[[10], [20], [40], [80], [120], [200], [400], [600], [800]], disabled=False)]
Trying to instantiate ann_benchmarks.algorithms.elasticsearch.ElasticsearchScriptScoreQuery(['angular', 768, {'host': 'https://vecsim.es.us-east-1.aws.found.io', 'port': 443, 'auth': 'Q0bna41oBWcZsuR785Dr891z', 'user': None, 'cluster': False, 'shards': '1'}, {'m': 4, 'ef_construction': 500, 'type': 'hnsw'}])
Waiting for elasticsearch health endpoint...
Elasticsearch is ready
got a train set of size (1000000 * 768)
2022-07-18 13:22:00,893 - elastic_transport.node_pool - WARNING - Node <Urllib3HttpNode(https://vecsim.es.us-east-1.aws.found.io:443)> has failed for 1 times in a row, putting on 1 second timeout
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/ann-benchmarks/ann_benchmarks/main.py", line 40, in run_worker
    run(definition, args.dataset, args.count, args.runs, args.batch,
  File "/root/ann-benchmarks/ann_benchmarks/runner.py", line 146, in run
    algo.fit(X_train, **fit_kwargs)
  File "/root/ann-benchmarks/ann_benchmarks/algorithms/elasticsearch.py", line 98, in fit
    (_, errors) = bulk(self.es, gen(), chunk_size=500, max_retries=9)
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/helpers/actions.py", line 524, in bulk
    for ok, item in streaming_bulk(
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/helpers/actions.py", line 438, in streaming_bulk
    for data, (ok, info) in zip(
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/helpers/actions.py", line 339, in _process_bulk_chunk
    resp = client.bulk(*args, operations=bulk_actions, **kwargs)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/utils.py", line 414, in wrapped
    return api(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/__init__.py", line 693, in bulk
    return self.perform_request(  # type: ignore[return-value]
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/_base.py", line 286, in perform_request
    meta, resp_body = self.transport.perform_request(
  File "/usr/local/lib/python3.10/dist-packages/elastic_transport/_transport.py", line 329, in perform_request
    meta, raw_data = node.perform_request(
  File "/usr/local/lib/python3.10/dist-packages/elastic_transport/_node/_http_urllib3.py", line 199, in perform_request
    raise err from None
elastic_transport.ConnectionTimeout: Connection timed out

so lets also try disable the index updates while we bulk load via refresh_interval changes and then enable it at the end.

@GuyAv46 GuyAv46 force-pushed the elastic_client_update branch from 92e2afa to e40f17c Compare July 19, 2022 07:56
@GuyAv46 GuyAv46 merged commit 3b5012d into multiclient_tool Aug 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants