-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade ES dependencies to match cluster version #3029
Conversation
2105137
to
fac91a2
Compare
fac91a2
to
edb9d30
Compare
This is marked critical as part of the API memory leak issue. I don't know that this will fix the problem but I wouldn't be surprised, given the changes in ES, if a client version mismatch caused something like this. I'm out of time today, but if someone else is able to reproduce the memory increase on main and then check if it's fixed here, that would be great. We need to do this update regardless, and deploying it does not need to happen now because any deployment causes the memory to drop back down. My thinking is that we can just deploy this at the start of the week like we usually do, see if the memory leak is fixed, and if not, start digging into other areas (if we aren't able to reproduce it locally and identify the root cause). In any case, it needs to be reviewed "ASAP" for us to be able to deploy it early next week, so it is still "critical" in that respect. This also needs to be deployed before #3011 to avoid too many significant changes going in at once. That PR still needs infrastructure PR work anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code-change look good so to me. I'll wait for CI to complete before. I also have two small nits.
Update: Accidentally hit "Approve" out of habit 🤦.
# If pook was activated by a test and not deactivated | ||
# (usually because the test failed and something prevent | ||
# pook from cleaning up after itself), disable here so that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know there was a possibility that pook could not clean up. That's quite bad for the isolation of the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only seen it in some cases, and I wonder if it has to do with async stuff, but yes, I have seen this particular factory run and have errors logged at the end of the test run because pook was still intercepting requests. IIRC it was only on tests that relied on pook.use
as a context manager, where the code inside the context raised an exception, so there may be some kind of bug in pook there.
es_url = config("ELASTICSEARCH_URL", default="localhost") | ||
es_port = config("ELASTICSEARCH_PORT", default=9200, cast=int) | ||
es_aws_region = config("ELASTICSEARCH_AWS_REGION", default="us-east-1") | ||
|
||
es_endpoint = f"{es_scheme}{es_url}:{es_port}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is no longer possible to easily construct the endpoint from the Elasticsearch client internals. We do this in tests. That was hacky anyway. It's much clearer and cleaner to just save and export it from the place it's centrally built anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but will we need this endpoint for the ingestion server tests in the future? The functions seem so similar, and the return is one of the biggest differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? The ingestion server tests already constructs its own ES connection, it doesn't re-use the code's, so I don't know how relevant each implementation is to the other at the moment (not to mention we don't have the facility to share Python code between projects yet, I don't think).
pook_active = pook.isactive() | ||
if pook_active: | ||
# Temporarily disable pook so that the calls to ES to create | ||
# the factory document don't fail | ||
pook.disable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change (and the change later on in this function) makes it so we don't need to juggle pook inside a function to avoid it catching factory requests. Now @pook.on
can be applied to a whole test function without worry.
I included this in this PR because as part of the ES client's breaking changes, I had to make several updates to tests, mostly removing really flaky and ugly implementation-specific mocks and replacing them with pook matchers. Needing to juggle pook being on or off in all those functions was too tedious.
@@ -714,6 +716,7 @@ def test_post_process_results_recurses_as_needed( | |||
.body(re.compile('from":0')) | |||
.times(1) | |||
.reply(200) | |||
.header("x-elastic-product", "Elasticsearch") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really is necessary. The client checks for this header and throws an error about "not supporting an unknown product" 😅. It's an easy one trick, luckily!
a440206
to
a20c35a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So nice that the DSL library is finally updated!
Thank you for opening this PR and explaining the changes to the pook setup. My main blocker here is replacement of http_auth
with basic_auth
. It's deprecated, so probably still usable, but it's better to replace it now than wait until its completely removed.
es_url = config("ELASTICSEARCH_URL", default="localhost") | ||
es_port = config("ELASTICSEARCH_PORT", default=9200, cast=int) | ||
es_aws_region = config("ELASTICSEARCH_AWS_REGION", default="us-east-1") | ||
|
||
es_endpoint = f"{es_scheme}{es_url}:{es_port}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but will we need this endpoint for the ingestion server tests in the future? The functions seem so similar, and the return is one of the biggest differences.
es = Elasticsearch( | ||
endpoint, | ||
node_class=RequestsHttpNode, | ||
request_timeout=10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was timeout
changed to request_timeout
here? These seem to be different values.
Also, the http_auth
from line 475 is deprecated, and should be replaced with basic_auth
(I don't know if it has any effect when the value is None, though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout
does not exist any more. request_timeout
configures the length of time to wait for a request to ES before timing out: https://www.elastic.co/guide/en/elasticsearch/client/python-api/master/config.html#timeouts
Here's the documentation for the old timeout
parameter, which appears to do the same thing, but by a different name: https://elasticsearch-py.readthedocs.io/en/v7.17.11/api.html?highlight=timeout#timeout
Both configure the global request timeout for all requests, as far as I can tell? Are you finding anything different that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elasticsearch does not support AWS auth directly, and trying to pass it to basic_auth
breaks the integration. I've found a nasty workaround, but it would probably be much better to find out if we can remove this bit altogether.
I'll go ahead and remove it and we can see whether staging is still able to communicate with Elasticsearch (I really can't see why it wouldn't, but I could very easily be missing something) and roll back to the workaround if staging fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in f138fec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you're right. I was reading the __init__
code of the Elasticsearch class, and saw both timeout
and request_timeout
there, but didn't see the check that shows that timeout
is deprecated below that: "The 'timeout' parameter is deprecated in favor of 'request_timeout'"
.
Both seem to configure the global request timeouts for all requests at init
time. It is also possible to set a different request_timeout
for the specific request using options
: es.options(request_timeout=5).search()
: https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/config.html#timeouts
I'll go ahead and remove it and we can see whether staging is still able to communicate with Elasticsearch (I really can't see why it wouldn't, but I could very easily be missing something) and roll back to the workaround if staging fails.
So, theauth
failure would be something to watch for when deploying this change, right?
port=elasticsearch_port, | ||
connection_class=RequestsHttpConnection, | ||
es_endpoint, | ||
node_class=RequestsHttpNode, | ||
http_auth=auth, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http_auth
from line 74 is deprecated, and should be replaced with basic_auth
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, do you know if we still need this authentication stuff? Our clusters are not accessible on the public internet and we don't need any auth to make connections via our jumpbox, for example 🤔 This is old code, and I'm not sure how to determine whether it's necessary. Something to keep an eye on though in case these parameters keep changing or get more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed it in f138fec as per the other conversation about the parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, do you know if we still need this authentication stuff?
I really don't know, and your explanations sound reasonable. Let's try :)
Based on the critical urgency of this PR, the following reviewers are being gently reminded to review this PR: @krysal Excluding weekend1 days, this PR was ready for review 1 day(s) ago. PRs labelled with critical urgency are expected to be reviewed within 1 weekday(s)2. @sarayourfriend, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
We do not use this in production and Elasticsearch client no longer supports it directly; we will try and see if everything still works if we remove it entirely and if not, we can employ the nasty/hacky workaround
Fixes
Related to #3028 by @sarayourfriend
Fixes #2744 by @obulat
Description
Adds an explicit dependency on the Elasticsearch client library,
elasticsearch-py
that is fixed to our cluster version (8.8.2). This also updates the DSL library to the latest version 8.9, which is the first version to support ES 8. We hardly use the DSL in any significant way here anyway, so I don't think the version mismatch will be an issue.I've done this in the API and the ingestion server. I've also updated the code to work with the new version. I read through the breaking changes and didn't see anything that would affect us, but I need to look again more closely for the ingestion server, with whose ES usage I'm less familiar.
https://www.elastic.co/guide/en/elasticsearch/client/python-api/master/migration.html
Testing Instructions
CI should pass. Build should pass and unit and integration tests. To test locally, you must build
web
andingestion_server
(just build web && just build ingestion_server
) to get the new dependencies. Then runjust api/up
andjust api/init
and make searches and test the Django app generally.just api/init
exercises the ingestion server in addition to what the explicit integration tests test.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin