Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray TPU Webhook Autoscaling Support #740

Merged
merged 6 commits into from
Jul 19, 2024

Conversation

ryanaoleary
Copy link
Collaborator

This PR adds support for autoscaling RayClusters by generating TPU_WORKER_HOSTNAMES when intercepting each Pod, rather than when intercepting the RayCluster CR.

This PR has been tested as follows:

@ryanaoleary ryanaoleary requested a review from spencer-p July 15, 2024 23:14
@ryanaoleary ryanaoleary self-assigned this Jul 15, 2024
@ryanaoleary ryanaoleary requested a review from andrewsykim July 15, 2024 23:24
@andrewsykim
Copy link
Collaborator

This PR adds support for autoscaling RayClusters by generating TPU_WORKER_HOSTNAMES when intercepting each Pod, rather than when intercepting the RayCluster CR.

Will existing pods need to have the TPU_WORKER_HOSTNAMES env var updated when new replicas are added or does it only apply for new pods added by the autoscaler?

@ryanaoleary
Copy link
Collaborator Author

This PR adds support for autoscaling RayClusters by generating TPU_WORKER_HOSTNAMES when intercepting each Pod, rather than when intercepting the RayCluster CR.

Will existing pods need to have the TPU_WORKER_HOSTNAMES env var updated when new replicas are added or does it only apply for new pods added by the autoscaler?

We don't need to update existing Pods. We assume that TPU PodSlices are scaled atomically so the TPU_WORKER_HOSTNAMES we add to each Pod include the DNS hostnames for all the Pods in the slice, even if they haven't been created yet.

@ryanaoleary ryanaoleary requested a review from andrewsykim July 19, 2024 18:06
@andrewsykim
Copy link
Collaborator

andrewsykim commented Jul 19, 2024

I think this is good to merge, but let's ensure #723 is also merged before cutting a new tag with the autoscaling changes

@ryanaoleary ryanaoleary merged commit dc3a615 into GoogleCloudPlatform:main Jul 19, 2024
5 checks passed
ryanaoleary added a commit to ryanaoleary/ai-on-gke that referenced this pull request Jul 19, 2024
* Generate hostnames at Pod creation

Signed-off-by: Ryan O'Leary <[email protected]>

* Should not fatal log in deletePod

Signed-off-by: Ryan O'Leary <[email protected]>

* deletePod admission always succeeds

Signed-off-by: Ryan O'Leary <[email protected]>

* Remove unused tests make command

Signed-off-by: Ryan O'Leary <[email protected]>

* Update tests and add error checking

Signed-off-by: Ryan O'Leary <[email protected]>

* Just return an error instead of logging

Signed-off-by: Ryan O'Leary <[email protected]>

---------

Signed-off-by: Ryan O'Leary <[email protected]>
ryanaoleary added a commit that referenced this pull request Jul 25, 2024
* Generate hostnames at Pod creation

Signed-off-by: Ryan O'Leary <[email protected]>

* Update tests and add error checking

Signed-off-by: Ryan O'Leary <[email protected]>

* Make webhook stateless in between mutate calls

Signed-off-by: Ryan O'Leary <[email protected]>

* Formatting changes

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix bug causing incorrect IDs

* Add cluster role and log formatting changes

Signed-off-by: Ryan O'Leary <[email protected]>

* Filter pods by Ray worker group label

Signed-off-by: Ryan O'Leary <[email protected]>

* Vulnerability fixes

Signed-off-by: Ryan O'Leary <[email protected]>

* Better names and add ServiceAccount

Signed-off-by: Ryan O'Leary <[email protected]>

* Change version back to v1.1

Signed-off-by: Ryan O'Leary <[email protected]>

* Change implementation to use PodInformer

Signed-off-by: Ryan O'Leary <[email protected]>

* Use PodLister

Signed-off-by: Ryan O'Leary <[email protected]>

* updateSliceToWorkerIDs returns error

Signed-off-by: Ryan O'Leary <[email protected]>

* Use mutex lock in updateSliceToWorkerIDs

Signed-off-by: Ryan O'Leary <[email protected]>

* Update unit tests and fix comments

Signed-off-by: Ryan O'Leary <[email protected]>

* Remove global client var

Signed-off-by: Ryan O'Leary <[email protected]>

* Just return err instead of logging

Signed-off-by: Ryan O'Leary <[email protected]>

* TODO comment

Signed-off-by: Ryan O'Leary <[email protected]>

* Lock when reading from shared sliceToWorkerIDs mapping

Signed-off-by: Ryan O'Leary <[email protected]>

* Switch to using RWMutex

Signed-off-by: Ryan O'Leary <[email protected]>

* Ray TPU Webhook Autoscaling Support (#740)

* Generate hostnames at Pod creation

Signed-off-by: Ryan O'Leary <[email protected]>

* Should not fatal log in deletePod

Signed-off-by: Ryan O'Leary <[email protected]>

* deletePod admission always succeeds

Signed-off-by: Ryan O'Leary <[email protected]>

* Remove unused tests make command

Signed-off-by: Ryan O'Leary <[email protected]>

* Update tests and add error checking

Signed-off-by: Ryan O'Leary <[email protected]>

* Just return an error instead of logging

Signed-off-by: Ryan O'Leary <[email protected]>

---------

Signed-off-by: Ryan O'Leary <[email protected]>

* Generate hostnames at Pod creation

Signed-off-by: Ryan O'Leary <[email protected]>

* Update tests and add error checking

Signed-off-by: Ryan O'Leary <[email protected]>

* Close stop channel on webhook termination

Signed-off-by: Ryan O'Leary <[email protected]>

* Refactor webhook to avoid using global vars

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix comments

Signed-off-by: Ryan O'Leary <[email protected]>

* Change service account name

Signed-off-by: Ryan O'Leary <[email protected]>

* Return BadRequest if invalid kind

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix comments

Signed-off-by: Ryan O'Leary <[email protected]>

* Change error messages

Signed-off-by: Ryan O'Leary <[email protected]>

* Fatal log in main

Signed-off-by: Ryan O'Leary <[email protected]>

* Update function comments

Signed-off-by: Ryan O'Leary <[email protected]>

* Refactor to minimize indentations

Signed-off-by: Ryan O'Leary <[email protected]>

* Change sliceToWorkerIDs nil check to use len

Signed-off-by: Ryan O'Leary <[email protected]>

* Write http.Error to header

Signed-off-by: Ryan O'Leary <[email protected]>

* Don't fatal log in validateRayCluster

Signed-off-by: Ryan O'Leary <[email protected]>

* Check for nil admission request

Signed-off-by: Ryan O'Leary <[email protected]>

* Add doc comment

Signed-off-by: Ryan O'Leary <[email protected]>

* Update expected errors

Signed-off-by: Ryan O'Leary <[email protected]>

* Better getNextWorkerID logic

Signed-off-by: Ryan O'Leary <[email protected]>

* Update replicaIndex and nextWorkerID tests

Signed-off-by: Ryan O'Leary <[email protected]>

* Refactor webhook unit tests

Signed-off-by: Ryan O'Leary <[email protected]>

* Create numOfHosts pods for Pod List

Signed-off-by: Ryan O'Leary <[email protected]>

* Log admission request object name

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix nits and go vet output

Signed-off-by: Ryan O'Leary <[email protected]>

* Initial cloudbuil commit

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix vet command

Signed-off-by: Ryan O'Leary <[email protected]>

* Update cloudbuild

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix cloudbuild errors

Signed-off-by: Ryan O'Leary <[email protected]>

* Add dir

Signed-off-by: Ryan O'Leary <[email protected]>

* Remove arg

Signed-off-by: Ryan O'Leary <[email protected]>

* Change to bash command

Signed-off-by: Ryan O'Leary <[email protected]>

* increase timeout time

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix validateRayCluster test

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix nits for cloudbuild

Signed-off-by: Ryan O'Leary <[email protected]>

* Break early in validateRayCluster

Signed-off-by: Ryan O'Leary <[email protected]>

* Remove unnecessary args from validateRayCluster test

Signed-off-by: Ryan O'Leary <[email protected]>

* Change break to continue

Signed-off-by: Ryan O'Leary <[email protected]>

* Remove unused vars from webhook tests and add edge cases

Signed-off-by: Ryan O'Leary <[email protected]>

* Update helm chart

Signed-off-by: Ryan O'Leary <[email protected]>

---------

Signed-off-by: Ryan O'Leary <[email protected]>
leroyjb pushed a commit to leroyjb/ai-on-gke that referenced this pull request Jan 24, 2025
* Generate hostnames at Pod creation

Signed-off-by: Ryan O'Leary <[email protected]>

* Should not fatal log in deletePod

Signed-off-by: Ryan O'Leary <[email protected]>

* deletePod admission always succeeds

Signed-off-by: Ryan O'Leary <[email protected]>

* Remove unused tests make command

Signed-off-by: Ryan O'Leary <[email protected]>

* Update tests and add error checking

Signed-off-by: Ryan O'Leary <[email protected]>

* Just return an error instead of logging

Signed-off-by: Ryan O'Leary <[email protected]>

---------

Signed-off-by: Ryan O'Leary <[email protected]>
leroyjb pushed a commit to leroyjb/ai-on-gke that referenced this pull request Jan 24, 2025
* Generate hostnames at Pod creation

Signed-off-by: Ryan O'Leary <[email protected]>

* Update tests and add error checking

Signed-off-by: Ryan O'Leary <[email protected]>

* Make webhook stateless in between mutate calls

Signed-off-by: Ryan O'Leary <[email protected]>

* Formatting changes

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix bug causing incorrect IDs

* Add cluster role and log formatting changes

Signed-off-by: Ryan O'Leary <[email protected]>

* Filter pods by Ray worker group label

Signed-off-by: Ryan O'Leary <[email protected]>

* Vulnerability fixes

Signed-off-by: Ryan O'Leary <[email protected]>

* Better names and add ServiceAccount

Signed-off-by: Ryan O'Leary <[email protected]>

* Change version back to v1.1

Signed-off-by: Ryan O'Leary <[email protected]>

* Change implementation to use PodInformer

Signed-off-by: Ryan O'Leary <[email protected]>

* Use PodLister

Signed-off-by: Ryan O'Leary <[email protected]>

* updateSliceToWorkerIDs returns error

Signed-off-by: Ryan O'Leary <[email protected]>

* Use mutex lock in updateSliceToWorkerIDs

Signed-off-by: Ryan O'Leary <[email protected]>

* Update unit tests and fix comments

Signed-off-by: Ryan O'Leary <[email protected]>

* Remove global client var

Signed-off-by: Ryan O'Leary <[email protected]>

* Just return err instead of logging

Signed-off-by: Ryan O'Leary <[email protected]>

* TODO comment

Signed-off-by: Ryan O'Leary <[email protected]>

* Lock when reading from shared sliceToWorkerIDs mapping

Signed-off-by: Ryan O'Leary <[email protected]>

* Switch to using RWMutex

Signed-off-by: Ryan O'Leary <[email protected]>

* Ray TPU Webhook Autoscaling Support (GoogleCloudPlatform#740)

* Generate hostnames at Pod creation

Signed-off-by: Ryan O'Leary <[email protected]>

* Should not fatal log in deletePod

Signed-off-by: Ryan O'Leary <[email protected]>

* deletePod admission always succeeds

Signed-off-by: Ryan O'Leary <[email protected]>

* Remove unused tests make command

Signed-off-by: Ryan O'Leary <[email protected]>

* Update tests and add error checking

Signed-off-by: Ryan O'Leary <[email protected]>

* Just return an error instead of logging

Signed-off-by: Ryan O'Leary <[email protected]>

---------

Signed-off-by: Ryan O'Leary <[email protected]>

* Generate hostnames at Pod creation

Signed-off-by: Ryan O'Leary <[email protected]>

* Update tests and add error checking

Signed-off-by: Ryan O'Leary <[email protected]>

* Close stop channel on webhook termination

Signed-off-by: Ryan O'Leary <[email protected]>

* Refactor webhook to avoid using global vars

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix comments

Signed-off-by: Ryan O'Leary <[email protected]>

* Change service account name

Signed-off-by: Ryan O'Leary <[email protected]>

* Return BadRequest if invalid kind

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix comments

Signed-off-by: Ryan O'Leary <[email protected]>

* Change error messages

Signed-off-by: Ryan O'Leary <[email protected]>

* Fatal log in main

Signed-off-by: Ryan O'Leary <[email protected]>

* Update function comments

Signed-off-by: Ryan O'Leary <[email protected]>

* Refactor to minimize indentations

Signed-off-by: Ryan O'Leary <[email protected]>

* Change sliceToWorkerIDs nil check to use len

Signed-off-by: Ryan O'Leary <[email protected]>

* Write http.Error to header

Signed-off-by: Ryan O'Leary <[email protected]>

* Don't fatal log in validateRayCluster

Signed-off-by: Ryan O'Leary <[email protected]>

* Check for nil admission request

Signed-off-by: Ryan O'Leary <[email protected]>

* Add doc comment

Signed-off-by: Ryan O'Leary <[email protected]>

* Update expected errors

Signed-off-by: Ryan O'Leary <[email protected]>

* Better getNextWorkerID logic

Signed-off-by: Ryan O'Leary <[email protected]>

* Update replicaIndex and nextWorkerID tests

Signed-off-by: Ryan O'Leary <[email protected]>

* Refactor webhook unit tests

Signed-off-by: Ryan O'Leary <[email protected]>

* Create numOfHosts pods for Pod List

Signed-off-by: Ryan O'Leary <[email protected]>

* Log admission request object name

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix nits and go vet output

Signed-off-by: Ryan O'Leary <[email protected]>

* Initial cloudbuil commit

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix vet command

Signed-off-by: Ryan O'Leary <[email protected]>

* Update cloudbuild

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix cloudbuild errors

Signed-off-by: Ryan O'Leary <[email protected]>

* Add dir

Signed-off-by: Ryan O'Leary <[email protected]>

* Remove arg

Signed-off-by: Ryan O'Leary <[email protected]>

* Change to bash command

Signed-off-by: Ryan O'Leary <[email protected]>

* increase timeout time

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix validateRayCluster test

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix nits for cloudbuild

Signed-off-by: Ryan O'Leary <[email protected]>

* Break early in validateRayCluster

Signed-off-by: Ryan O'Leary <[email protected]>

* Remove unnecessary args from validateRayCluster test

Signed-off-by: Ryan O'Leary <[email protected]>

* Change break to continue

Signed-off-by: Ryan O'Leary <[email protected]>

* Remove unused vars from webhook tests and add edge cases

Signed-off-by: Ryan O'Leary <[email protected]>

* Update helm chart

Signed-off-by: Ryan O'Leary <[email protected]>

---------

Signed-off-by: Ryan O'Leary <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants