-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry on transient error codes in preload #5982
Conversation
distributed/preloading.py
Outdated
response = urllib.request.urlopen(request) | ||
source = response.read().decode() | ||
duration = 0.2 | ||
for i in range(10): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would probably work but i'd really like to adjust how persistent it would be. For our application we'd probably want about 2-3 minutes of retrying if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this could be made configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to do whatever. I'm curious though, under what conditions would a web application not respond under 3 minutes? That seems like a long time to me. If that application is Coiled, then I'd love to see higher standards applied there if possible rather than making Dask more lenient than is typical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have hard data, but I suspect there's a long tail of events not-entirely-under-our-control that would justify retrying.
E.g. AWS networking has a hiccup for a little while, etc.
I'm not sure if that really justifies long retries, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to reuse existing logic (new dependency) https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html#urllib3.util.Retry
urllib3 is a very common library (e.g. requests, boto, jupyterlab, conda depend on it). Chances are high this is installed regardless. We'd need to promote it to a proper dependency, of course
Cool. Should I still take this on or is someone else able to get this out quickly? I'm at a conference today but I can squeeze this in. I'm not particularly better at this than the rest of the team though, so I'm inclined to offload to someone else with more experience if they're available. If this is going to take another person 24 hours to spin up though then I'll just handle it. |
I'm totally happy to use urllib3 if it's useful. My sense is that the Retry object that @fjetter points to would give us ...
It wouldn't give us some of the other exceptions that @shughes-uk mentioned. That's fine, we can keep the custom logic. I'm fine going either way. Folks seem excited about urllib3. Currently, for the small scope that we need, I don't currently observe a large difference, but possibly there's more in urllib3 that would help us to handle the current situation. If I were to go ahead solo right now I would probably just continue down the current path in this PR, extending it and making it configurble. I'm not an expert here though, and am happy to go another direction though. |
I might've missed something but what custom logic would be required? The object below should cover everything. urllib3.util.Retry(
status_forcelist=[429, 504, 503, 502],
backoff_factor=0.2,
) My sense is also that this should be configurable. Probably only few people actually care. However, I've been in enough arguments about what kind of error codes to retry that I know some will. For instance, should 500 be included in the retry? Similar argument about the duration. With this default and backoff of 0.2, this may retry for up to ~200s That is relatively long In [1]: def backoff_factor(attempt, base=0.1):
...: return base * 2**attempt
In [2]: [backoff_factor(attempt, 0.2) for attempt in range(10)]
Out[2]: [0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8, 25.6, 51.2, 102.4]
In [3]: sum([backoff_factor(attempt, 0.2) for attempt in range(10)])
Out[3]: 204.60000000000002
In [4]: sum([backoff_factor(attempt, 0.2) for attempt in range(10)]) / 60
Out[4]: 3.4100000000000006 I'm also ok with the current version of the code although I don't have any concerns about adding urllib3 as a dependency |
I threw up a version with urllib3 so that we could see it. It isn't yet configurable. Tests seem to be unhappy about a lingering thread. That might be unrelated though. If this works well then I'll go ahead with this and add configuration. If tests are unhappy for some reason then I'll revert to using urllib and custom logic. |
Configuration here is an intersting question. We can either call out specific options in our config, or we can say "we'll pass through any options to @shughes-uk is there someone who should try this out on your end to make sure that things are happy? |
A Coiled-internal issue for details of testing on our end: https://gitlab.com/coiled/cloud/-/issues/4628 |
I'm happy with this PR/solution. I've used almost the exact same code for a different "phone home" and validated my similar code under load testing (which was failing before on account of transient errors). |
OK, I'm planning to merge this as-is if there are no further objections today. If folks end up wanting configuration then this will be easy to add in the future. |
Closes #5981
cc @jacobtomlinson if you have a moment
also cc @shughes-uk