[jobs] backoff cluster teardown #4562

cg505 · 2025-01-15T03:09:42Z

Update the managed jobs terminate_cluster function to retry more, and back off exponentially. This should help to mitigate issues we see in high concurrency environments like botocore.exceptions.NoCredentialsError: Unable to locate credentials, which we suspect are due to high cloud API load/throttling.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- Normal sky jobs launch echo hi
- Cancelling a job
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
- /quicktest-core
- /smoke-test managed_jobs
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

cg505 · 2025-01-15T03:10:00Z

/quicktest-core

cg505 · 2025-01-15T03:10:21Z

/smoke-test managed_jobs

cg505 · 2025-01-15T03:17:41Z

/quicktest-core

cg505 · 2025-01-15T03:17:46Z

/smoke-test managed_jobs

cg505 · 2025-01-15T03:26:25Z

will fix the tests tomorrow

Michaelvll

Thanks @cg505! LGTM.

Michaelvll · 2025-01-15T19:46:20Z

sky/jobs/utils.py

+        # Each attempt may take around 10s, so we back off longer than that.
+        initial_backoff=15,


Just wondering if we really need to sleep longer than a normal attempt, it feels fine to me to sleep much shorter to ensure better performance.

The key value for throttling is the interval between requests, but only the backoff amount (sleep time) is increased. E.g. with initial_backoff=1, assuming each attempt takes 10s:

Attempt 1 begins at t=0

backoff 1s at t=10

Attempt 2 begins at t=11 (interval: 11)

backoff 1.6s at t=21

Attempt 3 begins at t=22.6 (interval: 11.6s)

backoff 2.56 at t=32.6

Attempt 4 begins at t=35.2 (interval 12.56s)

backoff 4.1s at t=45.2

Attempt 5 begins at t=49.3 (interval 14.1s)

... you get the idea. The backoff doesn't really work as expected.

I thought the main issue with the in parallel teardown is that multiple cloud requests are triggered at the exact same time, and the exponential backoff with random jitter is to have those requests to be triggered at a different timestamp? Why a smaller initial backoff not effective in the example above?

No, I think the issue is just load/throttling. We naturally get a bit of jitter because the controller processes will pick up the cancellation/success/failure at slightly different times anyway.

cg505 · 2025-01-15T23:30:49Z

smoke tests look flaky

cg505 requested a review from Michaelvll January 15, 2025 03:09

cg505 mentioned this pull request Jan 15, 2025

[jobs] revamp scheduling for managed jobs #4485

Merged

9 tasks

cg505 added 3 commits January 14, 2025 19:15

move terminate_cluster into utils

8e49454

[jobs] backoff cluster teardown

f0ff233

use terminate_cluster in update_managed_job_status

cd3f394

cg505 force-pushed the managed-jobs-backoff branch from 3666879 to cd3f394 Compare January 15, 2025 03:15

Michaelvll approved these changes Jan 15, 2025

View reviewed changes

fix unit test

2a42f10

cg505 requested a review from Michaelvll January 16, 2025 18:34

cg505 added 2 commits January 16, 2025 15:18

Merge branch 'master' into managed-jobs-backoff

5730eea

add details on backoff

f4dafec

cg505 enabled auto-merge (squash) January 16, 2025 23:28

cg505 merged commit 38ba39f into skypilot-org:master Jan 16, 2025
18 checks passed

This was referenced Jan 21, 2025

[jobs] make status updates robust when controller dies #4602

Merged

[AWS] Explicitly check credential and refresh if needed #4281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jobs] backoff cluster teardown #4562

[jobs] backoff cluster teardown #4562

cg505 commented Jan 15, 2025 •

edited

Loading

cg505 commented Jan 15, 2025

cg505 commented Jan 15, 2025

cg505 commented Jan 15, 2025

cg505 commented Jan 15, 2025

cg505 commented Jan 15, 2025

Michaelvll left a comment

Michaelvll Jan 15, 2025

cg505 Jan 15, 2025

Michaelvll Jan 16, 2025

cg505 Jan 16, 2025

cg505 commented Jan 15, 2025

		# Each attempt may take around 10s, so we back off longer than that.
		initial_backoff=15,

[jobs] backoff cluster teardown #4562

[jobs] backoff cluster teardown #4562

Conversation

cg505 commented Jan 15, 2025 • edited Loading

cg505 commented Jan 15, 2025

cg505 commented Jan 15, 2025

cg505 commented Jan 15, 2025

cg505 commented Jan 15, 2025

cg505 commented Jan 15, 2025

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Jan 15, 2025

Choose a reason for hiding this comment

cg505 Jan 15, 2025

Choose a reason for hiding this comment

Michaelvll Jan 16, 2025

Choose a reason for hiding this comment

cg505 Jan 16, 2025

Choose a reason for hiding this comment

cg505 commented Jan 15, 2025

cg505 commented Jan 15, 2025 •

edited

Loading