-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jobs] backoff cluster teardown #4562
Conversation
/quicktest-core |
/smoke-test managed_jobs |
3666879
to
cd3f394
Compare
/quicktest-core |
/smoke-test managed_jobs |
will fix the tests tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505! LGTM.
sky/jobs/utils.py
Outdated
# Each attempt may take around 10s, so we back off longer than that. | ||
initial_backoff=15, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if we really need to sleep longer than a normal attempt, it feels fine to me to sleep much shorter to ensure better performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key value for throttling is the interval between requests, but only the backoff amount (sleep time) is increased. E.g. with initial_backoff=1
, assuming each attempt takes 10s:
- Attempt 1 begins at t=0
- backoff 1s at t=10
- Attempt 2 begins at t=11 (interval: 11)
- backoff 1.6s at t=21
- Attempt 3 begins at t=22.6 (interval: 11.6s)
- backoff 2.56 at t=32.6
- Attempt 4 begins at t=35.2 (interval 12.56s)
- backoff 4.1s at t=45.2
- Attempt 5 begins at t=49.3 (interval 14.1s)
... you get the idea. The backoff doesn't really work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the main issue with the in parallel teardown is that multiple cloud requests are triggered at the exact same time, and the exponential backoff with random jitter is to have those requests to be triggered at a different timestamp? Why a smaller initial backoff not effective in the example above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think the issue is just load/throttling. We naturally get a bit of jitter because the controller processes will pick up the cancellation/success/failure at slightly different times anyway.
smoke tests look flaky |
Update the managed jobs
terminate_cluster
function to retry more, and back off exponentially. This should help to mitigate issues we see in high concurrency environments likebotocore.exceptions.NoCredentialsError: Unable to locate credentials
, which we suspect are due to high cloud API load/throttling.Tested (run the relevant ones):
bash format.sh
sky jobs launch echo hi
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh