Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jobs] backoff cluster teardown #4562

Merged
merged 6 commits into from
Jan 16, 2025
Merged

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Jan 15, 2025

Update the managed jobs terminate_cluster function to retry more, and back off exponentially. This should help to mitigate issues we see in high concurrency environments like botocore.exceptions.NoCredentialsError: Unable to locate credentials, which we suspect are due to high cloud API load/throttling.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Normal sky jobs launch echo hi
    • Cancelling a job
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
    • /quicktest-core
    • /smoke-test managed_jobs
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@cg505 cg505 requested a review from Michaelvll January 15, 2025 03:09
@cg505
Copy link
Collaborator Author

cg505 commented Jan 15, 2025

/quicktest-core

@cg505
Copy link
Collaborator Author

cg505 commented Jan 15, 2025

/smoke-test managed_jobs

@cg505 cg505 force-pushed the managed-jobs-backoff branch from 3666879 to cd3f394 Compare January 15, 2025 03:15
@cg505
Copy link
Collaborator Author

cg505 commented Jan 15, 2025

/quicktest-core

@cg505
Copy link
Collaborator Author

cg505 commented Jan 15, 2025

/smoke-test managed_jobs

@cg505
Copy link
Collaborator Author

cg505 commented Jan 15, 2025

will fix the tests tomorrow

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505! LGTM.

Comment on lines 90 to 91
# Each attempt may take around 10s, so we back off longer than that.
initial_backoff=15,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if we really need to sleep longer than a normal attempt, it feels fine to me to sleep much shorter to ensure better performance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key value for throttling is the interval between requests, but only the backoff amount (sleep time) is increased. E.g. with initial_backoff=1, assuming each attempt takes 10s:

  • Attempt 1 begins at t=0
  • backoff 1s at t=10
  • Attempt 2 begins at t=11 (interval: 11)
  • backoff 1.6s at t=21
  • Attempt 3 begins at t=22.6 (interval: 11.6s)
  • backoff 2.56 at t=32.6
  • Attempt 4 begins at t=35.2 (interval 12.56s)
  • backoff 4.1s at t=45.2
  • Attempt 5 begins at t=49.3 (interval 14.1s)

... you get the idea. The backoff doesn't really work as expected.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the main issue with the in parallel teardown is that multiple cloud requests are triggered at the exact same time, and the exponential backoff with random jitter is to have those requests to be triggered at a different timestamp? Why a smaller initial backoff not effective in the example above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think the issue is just load/throttling. We naturally get a bit of jitter because the controller processes will pick up the cancellation/success/failure at slightly different times anyway.

@cg505
Copy link
Collaborator Author

cg505 commented Jan 15, 2025

smoke tests look flaky

@cg505 cg505 requested a review from Michaelvll January 16, 2025 18:34
@cg505 cg505 enabled auto-merge (squash) January 16, 2025 23:28
@cg505 cg505 merged commit 38ba39f into skypilot-org:master Jan 16, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants