-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Enable Scaling Down for Multi-Host TPU Replicas #43470
Conversation
17b52b1
to
df0994b
Compare
d4756ed
to
99c2ff5
Compare
This is on a critical code path. We should have more testing. Let's discuss it in today's sync. |
627fcb2
to
7cfe9db
Compare
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
This PR was manually tested as follows: Prerequisites:
Testing:
|
Signed-off-by: Ryan O'Leary <[email protected]>
Sure, I edited the comment to include more detail. |
@ryanaoleary could you also rebase your branch to fix the CI error? Thanks! |
Signed-off-by: Ryan O'Leary <[email protected]>
@can-anyscale could you retry the failed test? It is unrelated to this PR. Thanks! |
The RLLib tests fail after retry, but I don't think that is related to this PR because this PR is only for KubeRay. cc @jjyao @can-anyscale |
Why are these changes needed?
Adds support for Ray autoscaler and Kuberay NodeProvider to scale-down TPU podslices. TPU podslices are atomic, so it is necessary to scale down all Ray nodes belonging to a TPU podslice together. This PR associates nodes with the
replica
(representing a podslice) of the TPU worker group they belong to using areplicaIndex
Pod label which is set through a GKE webhook. When a TPU node is deleted, other nodes in that replica (tracked through a mapping) are scheduled to delete as well.Related PR: #45105
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.