Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade ES dependencies to match cluster version #3029

Merged
merged 8 commits into from
Sep 19, 2023

Conversation

sarayourfriend
Copy link
Collaborator

@sarayourfriend sarayourfriend commented Sep 15, 2023

Fixes

Related to #3028 by @sarayourfriend
Fixes #2744 by @obulat

Description

Adds an explicit dependency on the Elasticsearch client library, elasticsearch-py that is fixed to our cluster version (8.8.2). This also updates the DSL library to the latest version 8.9, which is the first version to support ES 8. We hardly use the DSL in any significant way here anyway, so I don't think the version mismatch will be an issue.

I've done this in the API and the ingestion server. I've also updated the code to work with the new version. I read through the breaking changes and didn't see anything that would affect us, but I need to look again more closely for the ingestion server, with whose ES usage I'm less familiar.

https://www.elastic.co/guide/en/elasticsearch/client/python-api/master/migration.html

Testing Instructions

CI should pass. Build should pass and unit and integration tests. To test locally, you must build web and ingestion_server (just build web && just build ingestion_server) to get the new dependencies. Then run just api/up and just api/init and make searches and test the Django app generally. just api/init exercises the ingestion server in addition to what the explicit integration tests test.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • [N/A] I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@github-actions github-actions bot added the 🧱 stack: api Related to the Django API label Sep 15, 2023
@openverse-bot openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Sep 15, 2023
@sarayourfriend sarayourfriend force-pushed the update/es-py-to-match-cluster branch from 2105137 to fac91a2 Compare September 15, 2023 01:10
@github-actions github-actions bot added the 🧱 stack: ingestion server Related to the ingestion/data refresh server label Sep 15, 2023
@sarayourfriend sarayourfriend force-pushed the update/es-py-to-match-cluster branch from fac91a2 to edb9d30 Compare September 15, 2023 06:34
@sarayourfriend sarayourfriend marked this pull request as ready for review September 15, 2023 06:34
@sarayourfriend sarayourfriend requested a review from a team as a code owner September 15, 2023 06:34
@sarayourfriend sarayourfriend added 🟥 priority: critical Must be addressed ASAP 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository dependencies Pull requests that update a dependency file and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Sep 15, 2023
@sarayourfriend
Copy link
Collaborator Author

This is marked critical as part of the API memory leak issue. I don't know that this will fix the problem but I wouldn't be surprised, given the changes in ES, if a client version mismatch caused something like this.

I'm out of time today, but if someone else is able to reproduce the memory increase on main and then check if it's fixed here, that would be great. We need to do this update regardless, and deploying it does not need to happen now because any deployment causes the memory to drop back down. My thinking is that we can just deploy this at the start of the week like we usually do, see if the memory leak is fixed, and if not, start digging into other areas (if we aren't able to reproduce it locally and identify the root cause).

In any case, it needs to be reviewed "ASAP" for us to be able to deploy it early next week, so it is still "critical" in that respect. This also needs to be deployed before #3011 to avoid too many significant changes going in at once. That PR still needs infrastructure PR work anyway.

Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-change look good so to me. I'll wait for CI to complete before. I also have two small nits.

Update: Accidentally hit "Approve" out of habit 🤦.

api/api/models/media.py Outdated Show resolved Hide resolved
Comment on lines +162 to +164
# If pook was activated by a test and not deactivated
# (usually because the test failed and something prevent
# pook from cleaning up after itself), disable here so that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know there was a possibility that pook could not clean up. That's quite bad for the isolation of the tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only seen it in some cases, and I wonder if it has to do with async stuff, but yes, I have seen this particular factory run and have errors logged at the end of the test run because pook was still intercepting requests. IIRC it was only on tests that relied on pook.use as a context manager, where the code inside the context raised an exception, so there may be some kind of bug in pook there.

ingestion_server/ingestion_server/es_helpers.py Outdated Show resolved Hide resolved
api/api/models/media.py Outdated Show resolved Hide resolved
es_url = config("ELASTICSEARCH_URL", default="localhost")
es_port = config("ELASTICSEARCH_PORT", default=9200, cast=int)
es_aws_region = config("ELASTICSEARCH_AWS_REGION", default="us-east-1")

es_endpoint = f"{es_scheme}{es_url}:{es_port}"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is no longer possible to easily construct the endpoint from the Elasticsearch client internals. We do this in tests. That was hacky anyway. It's much clearer and cleaner to just save and export it from the place it's centrally built anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but will we need this endpoint for the ingestion server tests in the future? The functions seem so similar, and the return is one of the biggest differences.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? The ingestion server tests already constructs its own ES connection, it doesn't re-use the code's, so I don't know how relevant each implementation is to the other at the moment (not to mention we don't have the facility to share Python code between projects yet, I don't think).

Comment on lines +99 to +103
pook_active = pook.isactive()
if pook_active:
# Temporarily disable pook so that the calls to ES to create
# the factory document don't fail
pook.disable()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change (and the change later on in this function) makes it so we don't need to juggle pook inside a function to avoid it catching factory requests. Now @pook.on can be applied to a whole test function without worry.

I included this in this PR because as part of the ES client's breaking changes, I had to make several updates to tests, mostly removing really flaky and ugly implementation-specific mocks and replacing them with pook matchers. Needing to juggle pook being on or off in all those functions was too tedious.

@@ -714,6 +716,7 @@ def test_post_process_results_recurses_as_needed(
.body(re.compile('from":0'))
.times(1)
.reply(200)
.header("x-elastic-product", "Elasticsearch")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really is necessary. The client checks for this header and throws an error about "not supporting an unknown product" 😅. It's an easy one trick, luckily!

@sarayourfriend sarayourfriend force-pushed the update/es-py-to-match-cluster branch from a440206 to a20c35a Compare September 18, 2023 03:05
This was referenced Sep 18, 2023
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So nice that the DSL library is finally updated!
Thank you for opening this PR and explaining the changes to the pook setup. My main blocker here is replacement of http_auth with basic_auth. It's deprecated, so probably still usable, but it's better to replace it now than wait until its completely removed.

es_url = config("ELASTICSEARCH_URL", default="localhost")
es_port = config("ELASTICSEARCH_PORT", default=9200, cast=int)
es_aws_region = config("ELASTICSEARCH_AWS_REGION", default="us-east-1")

es_endpoint = f"{es_scheme}{es_url}:{es_port}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but will we need this endpoint for the ingestion server tests in the future? The functions seem so similar, and the return is one of the biggest differences.

es = Elasticsearch(
endpoint,
node_class=RequestsHttpNode,
request_timeout=10,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was timeout changed to request_timeout here? These seem to be different values.

Also, the http_auth from line 475 is deprecated, and should be replaced with basic_auth (I don't know if it has any effect when the value is None, though)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout does not exist any more. request_timeout configures the length of time to wait for a request to ES before timing out: https://www.elastic.co/guide/en/elasticsearch/client/python-api/master/config.html#timeouts

Here's the documentation for the old timeout parameter, which appears to do the same thing, but by a different name: https://elasticsearch-py.readthedocs.io/en/v7.17.11/api.html?highlight=timeout#timeout

Both configure the global request timeout for all requests, as far as I can tell? Are you finding anything different that I'm missing?

Copy link
Collaborator Author

@sarayourfriend sarayourfriend Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elasticsearch does not support AWS auth directly, and trying to pass it to basic_auth breaks the integration. I've found a nasty workaround, but it would probably be much better to find out if we can remove this bit altogether.

I'll go ahead and remove it and we can see whether staging is still able to communicate with Elasticsearch (I really can't see why it wouldn't, but I could very easily be missing something) and roll back to the workaround if staging fails.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in f138fec

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're right. I was reading the __init__ code of the Elasticsearch class, and saw both timeout and request_timeout there, but didn't see the check that shows that timeout is deprecated below that: "The 'timeout' parameter is deprecated in favor of 'request_timeout'".
Both seem to configure the global request timeouts for all requests at init time. It is also possible to set a different request_timeout for the specific request using options: es.options(request_timeout=5).search(): https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/config.html#timeouts

I'll go ahead and remove it and we can see whether staging is still able to communicate with Elasticsearch (I really can't see why it wouldn't, but I could very easily be missing something) and roll back to the workaround if staging fails.
So, the auth failure would be something to watch for when deploying this change, right?

port=elasticsearch_port,
connection_class=RequestsHttpConnection,
es_endpoint,
node_class=RequestsHttpNode,
http_auth=auth,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http_auth from line 74 is deprecated, and should be replaced with basic_auth.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do you know if we still need this authentication stuff? Our clusters are not accessible on the public internet and we don't need any auth to make connections via our jumpbox, for example 🤔 This is old code, and I'm not sure how to determine whether it's necessary. Something to keep an eye on though in case these parameters keep changing or get more complex.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed it in f138fec as per the other conversation about the parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do you know if we still need this authentication stuff?

I really don't know, and your explanations sound reasonable. Let's try :)

@openverse-bot
Copy link
Collaborator

Based on the critical urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
@obulat
@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 1 day(s) ago. PRs labelled with critical urgency are expected to be reviewed within 1 weekday(s)2.

@sarayourfriend, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

We do not use this in production and Elasticsearch client no longer supports it directly; we will try and see if everything still works if we remove it entirely and if not, we can employ the nasty/hacky workaround
@obulat obulat merged commit 95b5911 into main Sep 19, 2023
@obulat obulat deleted the update/es-py-to-match-cluster branch September 19, 2023 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository dependencies Pull requests that update a dependency file 🛠 goal: fix Bug fix 🟥 priority: critical Must be addressed ASAP 🧱 stack: api Related to the Django API 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants