Upgrade ES dependencies to match cluster version #3029

sarayourfriend · 2023-09-15T00:50:18Z

Fixes

Related to #3028 by @sarayourfriend
Fixes #2744 by @obulat

Description

Adds an explicit dependency on the Elasticsearch client library, elasticsearch-py that is fixed to our cluster version (8.8.2). This also updates the DSL library to the latest version 8.9, which is the first version to support ES 8. We hardly use the DSL in any significant way here anyway, so I don't think the version mismatch will be an issue.

I've done this in the API and the ingestion server. I've also updated the code to work with the new version. I read through the breaking changes and didn't see anything that would affect us, but I need to look again more closely for the ingestion server, with whose ES usage I'm less familiar.

https://www.elastic.co/guide/en/elasticsearch/client/python-api/master/migration.html

Testing Instructions

CI should pass. Build should pass and unit and integration tests. To test locally, you must build web and ingestion_server (just build web && just build ingestion_server) to get the new dependencies. Then run just api/up and just api/init and make searches and test the Django app generally. just api/init exercises the ingestion server in addition to what the explicit integration tests test.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
[N/A] I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

sarayourfriend · 2023-09-15T06:42:09Z

This is marked critical as part of the API memory leak issue. I don't know that this will fix the problem but I wouldn't be surprised, given the changes in ES, if a client version mismatch caused something like this.

I'm out of time today, but if someone else is able to reproduce the memory increase on main and then check if it's fixed here, that would be great. We need to do this update regardless, and deploying it does not need to happen now because any deployment causes the memory to drop back down. My thinking is that we can just deploy this at the start of the week like we usually do, see if the memory leak is fixed, and if not, start digging into other areas (if we aren't able to reproduce it locally and identify the root cause).

In any case, it needs to be reviewed "ASAP" for us to be able to deploy it early next week, so it is still "critical" in that respect. This also needs to be deployed before #3011 to avoid too many significant changes going in at once. That PR still needs infrastructure PR work anyway.

dhruvkb

Code-change look good so to me. I'll wait for CI to complete before. I also have two small nits.

Update: Accidentally hit "Approve" out of habit 🤦.

api/api/models/media.py

dhruvkb · 2023-09-15T06:38:09Z

api/test/unit/conftest.py

+    # If pook was activated by a test and not deactivated
+    # (usually because the test failed and something prevent
+    # pook from cleaning up after itself), disable here so that


I didn't know there was a possibility that pook could not clean up. That's quite bad for the isolation of the tests.

I've only seen it in some cases, and I wonder if it has to do with async stuff, but yes, I have seen this particular factory run and have errors logged at the end of the test run because pook was still intercepting requests. IIRC it was only on tests that relied on pook.use as a context manager, where the code inside the context raised an exception, so there may be some kind of bug in pook there.

ingestion_server/ingestion_server/es_helpers.py

api/api/models/media.py

sarayourfriend · 2023-09-15T06:45:18Z

api/conf/settings/elasticsearch.py

    es_url = config("ELASTICSEARCH_URL", default="localhost")
    es_port = config("ELASTICSEARCH_PORT", default=9200, cast=int)
    es_aws_region = config("ELASTICSEARCH_AWS_REGION", default="us-east-1")

+    es_endpoint = f"{es_scheme}{es_url}:{es_port}"


It is no longer possible to easily construct the endpoint from the Elasticsearch client internals. We do this in tests. That was hacky anyway. It's much clearer and cleaner to just save and export it from the place it's centrally built anyway.

Not for this PR, but will we need this endpoint for the ingestion server tests in the future? The functions seem so similar, and the return is one of the biggest differences.

Maybe? The ingestion server tests already constructs its own ES connection, it doesn't re-use the code's, so I don't know how relevant each implementation is to the other at the moment (not to mention we don't have the facility to share Python code between projects yet, I don't think).

sarayourfriend · 2023-09-15T06:46:13Z

api/test/factory/models/media.py

+        pook_active = pook.isactive()
+        if pook_active:
+            # Temporarily disable pook so that the calls to ES to create
+            # the factory document don't fail
+            pook.disable()


This change (and the change later on in this function) makes it so we don't need to juggle pook inside a function to avoid it catching factory requests. Now @pook.on can be applied to a whole test function without worry.

I included this in this PR because as part of the ES client's breaking changes, I had to make several updates to tests, mostly removing really flaky and ugly implementation-specific mocks and replacing them with pook matchers. Needing to juggle pook being on or off in all those functions was too tedious.

sarayourfriend · 2023-09-15T06:48:18Z

api/test/unit/controllers/test_search_controller.py

@@ -714,6 +716,7 @@ def test_post_process_results_recurses_as_needed(
        .body(re.compile('from":0'))
        .times(1)
        .reply(200)
+        .header("x-elastic-product", "Elasticsearch")


This really is necessary. The client checks for this header and throws an error about "not supporting an unknown product" 😅. It's an easy one trick, luckily!

api/api/models/media.py

obulat

So nice that the DSL library is finally updated!
Thank you for opening this PR and explaining the changes to the pook setup. My main blocker here is replacement of http_auth with basic_auth. It's deprecated, so probably still usable, but it's better to replace it now than wait until its completely removed.

obulat · 2023-09-18T13:07:56Z

api/conf/settings/elasticsearch.py

    es_url = config("ELASTICSEARCH_URL", default="localhost")
    es_port = config("ELASTICSEARCH_PORT", default=9200, cast=int)
    es_aws_region = config("ELASTICSEARCH_AWS_REGION", default="us-east-1")

+    es_endpoint = f"{es_scheme}{es_url}:{es_port}"


Not for this PR, but will we need this endpoint for the ingestion server tests in the future? The functions seem so similar, and the return is one of the biggest differences.

obulat · 2023-09-18T13:19:47Z

ingestion_server/test/integration_test.py

+        es = Elasticsearch(
+            endpoint,
+            node_class=RequestsHttpNode,
+            request_timeout=10,


Why was timeout changed to request_timeout here? These seem to be different values.

Also, the http_auth from line 475 is deprecated, and should be replaced with basic_auth (I don't know if it has any effect when the value is None, though)

timeout does not exist any more. request_timeout configures the length of time to wait for a request to ES before timing out: https://www.elastic.co/guide/en/elasticsearch/client/python-api/master/config.html#timeouts

Here's the documentation for the old timeout parameter, which appears to do the same thing, but by a different name: https://elasticsearch-py.readthedocs.io/en/v7.17.11/api.html?highlight=timeout#timeout

Both configure the global request timeout for all requests, as far as I can tell? Are you finding anything different that I'm missing?

Elasticsearch does not support AWS auth directly, and trying to pass it to basic_auth breaks the integration. I've found a nasty workaround, but it would probably be much better to find out if we can remove this bit altogether.

I'll go ahead and remove it and we can see whether staging is still able to communicate with Elasticsearch (I really can't see why it wouldn't, but I could very easily be missing something) and roll back to the workaround if staging fails.

Removed in f138fec

Sorry, you're right. I was reading the __init__ code of the Elasticsearch class, and saw both timeout and request_timeout there, but didn't see the check that shows that timeout is deprecated below that: "The 'timeout' parameter is deprecated in favor of 'request_timeout'".
Both seem to configure the global request timeouts for all requests at init time. It is also possible to set a different request_timeout for the specific request using options: es.options(request_timeout=5).search(): https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/config.html#timeouts

I'll go ahead and remove it and we can see whether staging is still able to communicate with Elasticsearch (I really can't see why it wouldn't, but I could very easily be missing something) and roll back to the workaround if staging fails.
So, the auth failure would be something to watch for when deploying this change, right?

obulat · 2023-09-18T13:22:51Z

ingestion_server/ingestion_server/es_helpers.py

-        port=elasticsearch_port,
-        connection_class=RequestsHttpConnection,
+        es_endpoint,
+        node_class=RequestsHttpNode,
        http_auth=auth,


http_auth from line 74 is deprecated, and should be replaced with basic_auth.

Thanks for catching that!

BTW, do you know if we still need this authentication stuff? Our clusters are not accessible on the public internet and we don't need any auth to make connections via our jumpbox, for example 🤔 This is old code, and I'm not sure how to determine whether it's necessary. Something to keep an eye on though in case these parameters keep changing or get more complex.

I've removed it in f138fec as per the other conversation about the parameter.

BTW, do you know if we still need this authentication stuff?

I really don't know, and your explanations sound reasonable. Let's try :)

openverse-bot · 2023-09-19T00:00:12Z

Based on the critical urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
@obulat
@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was ready for review 1 day(s) ago. PRs labelled with critical urgency are expected to be reviewed within 1 weekday(s)².

@sarayourfriend, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

We do not use this in production and Elasticsearch client no longer supports it directly; we will try and see if everything still works if we remove it entirely and if not, we can employ the nasty/hacky workaround

github-actions bot added the 🧱 stack: api Related to the Django API label Sep 15, 2023

openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Sep 15, 2023

sarayourfriend force-pushed the update/es-py-to-match-cluster branch from 2105137 to fac91a2 Compare September 15, 2023 01:10

github-actions bot added the 🧱 stack: ingestion server Related to the ingestion/data refresh server label Sep 15, 2023

sarayourfriend force-pushed the update/es-py-to-match-cluster branch from fac91a2 to edb9d30 Compare September 15, 2023 06:34

sarayourfriend marked this pull request as ready for review September 15, 2023 06:34

sarayourfriend requested a review from a team as a code owner September 15, 2023 06:34

sarayourfriend requested review from krysal and stacimc September 15, 2023 06:34

dhruvkb approved these changes Sep 15, 2023

View reviewed changes

sarayourfriend commented Sep 15, 2023

View reviewed changes

dhruvkb reviewed Sep 15, 2023

View reviewed changes

api/api/models/media.py Show resolved Hide resolved

sarayourfriend added 4 commits September 18, 2023 13:04

Upgrade ES dependencies to match cluster version

503ac82

Fix ingestion server test ES connection instantiation

55b1c3b

Remove unnecessary error catching

fa1501f

Reduce redundant variable names

a20c35a

sarayourfriend force-pushed the update/es-py-to-match-cluster branch from a440206 to a20c35a Compare September 18, 2023 03:05

This was referenced Sep 18, 2023

Django ASGI #2843

Closed

Run the app as ASGI #3011

Merged

obulat requested changes Sep 18, 2023

View reviewed changes

sarayourfriend requested a review from obulat September 18, 2023 23:20

Replace http_auth with basic_auth

8c96560

Workaround Elasticsearch client's lack of AWS auth support

4e33f83

sarayourfriend added 2 commits September 19, 2023 11:20

Remove IAM based authentication

f138fec

We do not use this in production and Elasticsearch client no longer supports it directly; we will try and see if everything still works if we remove it entirely and if not, we can employ the nasty/hacky workaround

Remove unnecessary transport node specification

60c3c0e

obulat approved these changes Sep 19, 2023

View reviewed changes

obulat merged commit 95b5911 into main Sep 19, 2023

obulat deleted the update/es-py-to-match-cluster branch September 19, 2023 03:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade ES dependencies to match cluster version #3029

Upgrade ES dependencies to match cluster version #3029

sarayourfriend commented Sep 15, 2023 •

edited

Loading

sarayourfriend commented Sep 15, 2023

dhruvkb left a comment •

edited

Loading

dhruvkb Sep 15, 2023

sarayourfriend Sep 15, 2023

sarayourfriend Sep 15, 2023

obulat Sep 18, 2023

sarayourfriend Sep 18, 2023

sarayourfriend Sep 15, 2023

sarayourfriend Sep 15, 2023

obulat left a comment

obulat Sep 18, 2023

obulat Sep 18, 2023

sarayourfriend Sep 18, 2023

sarayourfriend Sep 19, 2023 •

edited

Loading

sarayourfriend Sep 19, 2023

obulat Sep 19, 2023

obulat Sep 18, 2023

sarayourfriend Sep 18, 2023

sarayourfriend Sep 18, 2023

sarayourfriend Sep 19, 2023

obulat Sep 19, 2023

openverse-bot commented Sep 19, 2023

Upgrade ES dependencies to match cluster version #3029

Upgrade ES dependencies to match cluster version #3029

Conversation

sarayourfriend commented Sep 15, 2023 • edited Loading

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

sarayourfriend commented Sep 15, 2023

dhruvkb left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obulat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarayourfriend Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openverse-bot commented Sep 19, 2023

Footnotes

sarayourfriend commented Sep 15, 2023 •

edited

Loading

dhruvkb left a comment •

edited

Loading

sarayourfriend Sep 19, 2023 •

edited

Loading