-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray TPU Webhook Autoscaling Support #740
Ray TPU Webhook Autoscaling Support #740
Conversation
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Will existing pods need to have the |
We don't need to update existing Pods. We assume that TPU PodSlices are scaled atomically so the |
Signed-off-by: Ryan O'Leary <[email protected]>
I think this is good to merge, but let's ensure #723 is also merged before cutting a new tag with the autoscaling changes |
* Generate hostnames at Pod creation Signed-off-by: Ryan O'Leary <[email protected]> * Should not fatal log in deletePod Signed-off-by: Ryan O'Leary <[email protected]> * deletePod admission always succeeds Signed-off-by: Ryan O'Leary <[email protected]> * Remove unused tests make command Signed-off-by: Ryan O'Leary <[email protected]> * Update tests and add error checking Signed-off-by: Ryan O'Leary <[email protected]> * Just return an error instead of logging Signed-off-by: Ryan O'Leary <[email protected]> --------- Signed-off-by: Ryan O'Leary <[email protected]>
* Generate hostnames at Pod creation Signed-off-by: Ryan O'Leary <[email protected]> * Update tests and add error checking Signed-off-by: Ryan O'Leary <[email protected]> * Make webhook stateless in between mutate calls Signed-off-by: Ryan O'Leary <[email protected]> * Formatting changes Signed-off-by: Ryan O'Leary <[email protected]> * Fix bug causing incorrect IDs * Add cluster role and log formatting changes Signed-off-by: Ryan O'Leary <[email protected]> * Filter pods by Ray worker group label Signed-off-by: Ryan O'Leary <[email protected]> * Vulnerability fixes Signed-off-by: Ryan O'Leary <[email protected]> * Better names and add ServiceAccount Signed-off-by: Ryan O'Leary <[email protected]> * Change version back to v1.1 Signed-off-by: Ryan O'Leary <[email protected]> * Change implementation to use PodInformer Signed-off-by: Ryan O'Leary <[email protected]> * Use PodLister Signed-off-by: Ryan O'Leary <[email protected]> * updateSliceToWorkerIDs returns error Signed-off-by: Ryan O'Leary <[email protected]> * Use mutex lock in updateSliceToWorkerIDs Signed-off-by: Ryan O'Leary <[email protected]> * Update unit tests and fix comments Signed-off-by: Ryan O'Leary <[email protected]> * Remove global client var Signed-off-by: Ryan O'Leary <[email protected]> * Just return err instead of logging Signed-off-by: Ryan O'Leary <[email protected]> * TODO comment Signed-off-by: Ryan O'Leary <[email protected]> * Lock when reading from shared sliceToWorkerIDs mapping Signed-off-by: Ryan O'Leary <[email protected]> * Switch to using RWMutex Signed-off-by: Ryan O'Leary <[email protected]> * Ray TPU Webhook Autoscaling Support (#740) * Generate hostnames at Pod creation Signed-off-by: Ryan O'Leary <[email protected]> * Should not fatal log in deletePod Signed-off-by: Ryan O'Leary <[email protected]> * deletePod admission always succeeds Signed-off-by: Ryan O'Leary <[email protected]> * Remove unused tests make command Signed-off-by: Ryan O'Leary <[email protected]> * Update tests and add error checking Signed-off-by: Ryan O'Leary <[email protected]> * Just return an error instead of logging Signed-off-by: Ryan O'Leary <[email protected]> --------- Signed-off-by: Ryan O'Leary <[email protected]> * Generate hostnames at Pod creation Signed-off-by: Ryan O'Leary <[email protected]> * Update tests and add error checking Signed-off-by: Ryan O'Leary <[email protected]> * Close stop channel on webhook termination Signed-off-by: Ryan O'Leary <[email protected]> * Refactor webhook to avoid using global vars Signed-off-by: Ryan O'Leary <[email protected]> * Fix comments Signed-off-by: Ryan O'Leary <[email protected]> * Change service account name Signed-off-by: Ryan O'Leary <[email protected]> * Return BadRequest if invalid kind Signed-off-by: Ryan O'Leary <[email protected]> * Fix comments Signed-off-by: Ryan O'Leary <[email protected]> * Change error messages Signed-off-by: Ryan O'Leary <[email protected]> * Fatal log in main Signed-off-by: Ryan O'Leary <[email protected]> * Update function comments Signed-off-by: Ryan O'Leary <[email protected]> * Refactor to minimize indentations Signed-off-by: Ryan O'Leary <[email protected]> * Change sliceToWorkerIDs nil check to use len Signed-off-by: Ryan O'Leary <[email protected]> * Write http.Error to header Signed-off-by: Ryan O'Leary <[email protected]> * Don't fatal log in validateRayCluster Signed-off-by: Ryan O'Leary <[email protected]> * Check for nil admission request Signed-off-by: Ryan O'Leary <[email protected]> * Add doc comment Signed-off-by: Ryan O'Leary <[email protected]> * Update expected errors Signed-off-by: Ryan O'Leary <[email protected]> * Better getNextWorkerID logic Signed-off-by: Ryan O'Leary <[email protected]> * Update replicaIndex and nextWorkerID tests Signed-off-by: Ryan O'Leary <[email protected]> * Refactor webhook unit tests Signed-off-by: Ryan O'Leary <[email protected]> * Create numOfHosts pods for Pod List Signed-off-by: Ryan O'Leary <[email protected]> * Log admission request object name Signed-off-by: Ryan O'Leary <[email protected]> * Fix nits and go vet output Signed-off-by: Ryan O'Leary <[email protected]> * Initial cloudbuil commit Signed-off-by: Ryan O'Leary <[email protected]> * Fix vet command Signed-off-by: Ryan O'Leary <[email protected]> * Update cloudbuild Signed-off-by: Ryan O'Leary <[email protected]> * Fix cloudbuild errors Signed-off-by: Ryan O'Leary <[email protected]> * Add dir Signed-off-by: Ryan O'Leary <[email protected]> * Remove arg Signed-off-by: Ryan O'Leary <[email protected]> * Change to bash command Signed-off-by: Ryan O'Leary <[email protected]> * increase timeout time Signed-off-by: Ryan O'Leary <[email protected]> * Fix validateRayCluster test Signed-off-by: Ryan O'Leary <[email protected]> * Fix nits for cloudbuild Signed-off-by: Ryan O'Leary <[email protected]> * Break early in validateRayCluster Signed-off-by: Ryan O'Leary <[email protected]> * Remove unnecessary args from validateRayCluster test Signed-off-by: Ryan O'Leary <[email protected]> * Change break to continue Signed-off-by: Ryan O'Leary <[email protected]> * Remove unused vars from webhook tests and add edge cases Signed-off-by: Ryan O'Leary <[email protected]> * Update helm chart Signed-off-by: Ryan O'Leary <[email protected]> --------- Signed-off-by: Ryan O'Leary <[email protected]>
* Generate hostnames at Pod creation Signed-off-by: Ryan O'Leary <[email protected]> * Should not fatal log in deletePod Signed-off-by: Ryan O'Leary <[email protected]> * deletePod admission always succeeds Signed-off-by: Ryan O'Leary <[email protected]> * Remove unused tests make command Signed-off-by: Ryan O'Leary <[email protected]> * Update tests and add error checking Signed-off-by: Ryan O'Leary <[email protected]> * Just return an error instead of logging Signed-off-by: Ryan O'Leary <[email protected]> --------- Signed-off-by: Ryan O'Leary <[email protected]>
* Generate hostnames at Pod creation Signed-off-by: Ryan O'Leary <[email protected]> * Update tests and add error checking Signed-off-by: Ryan O'Leary <[email protected]> * Make webhook stateless in between mutate calls Signed-off-by: Ryan O'Leary <[email protected]> * Formatting changes Signed-off-by: Ryan O'Leary <[email protected]> * Fix bug causing incorrect IDs * Add cluster role and log formatting changes Signed-off-by: Ryan O'Leary <[email protected]> * Filter pods by Ray worker group label Signed-off-by: Ryan O'Leary <[email protected]> * Vulnerability fixes Signed-off-by: Ryan O'Leary <[email protected]> * Better names and add ServiceAccount Signed-off-by: Ryan O'Leary <[email protected]> * Change version back to v1.1 Signed-off-by: Ryan O'Leary <[email protected]> * Change implementation to use PodInformer Signed-off-by: Ryan O'Leary <[email protected]> * Use PodLister Signed-off-by: Ryan O'Leary <[email protected]> * updateSliceToWorkerIDs returns error Signed-off-by: Ryan O'Leary <[email protected]> * Use mutex lock in updateSliceToWorkerIDs Signed-off-by: Ryan O'Leary <[email protected]> * Update unit tests and fix comments Signed-off-by: Ryan O'Leary <[email protected]> * Remove global client var Signed-off-by: Ryan O'Leary <[email protected]> * Just return err instead of logging Signed-off-by: Ryan O'Leary <[email protected]> * TODO comment Signed-off-by: Ryan O'Leary <[email protected]> * Lock when reading from shared sliceToWorkerIDs mapping Signed-off-by: Ryan O'Leary <[email protected]> * Switch to using RWMutex Signed-off-by: Ryan O'Leary <[email protected]> * Ray TPU Webhook Autoscaling Support (GoogleCloudPlatform#740) * Generate hostnames at Pod creation Signed-off-by: Ryan O'Leary <[email protected]> * Should not fatal log in deletePod Signed-off-by: Ryan O'Leary <[email protected]> * deletePod admission always succeeds Signed-off-by: Ryan O'Leary <[email protected]> * Remove unused tests make command Signed-off-by: Ryan O'Leary <[email protected]> * Update tests and add error checking Signed-off-by: Ryan O'Leary <[email protected]> * Just return an error instead of logging Signed-off-by: Ryan O'Leary <[email protected]> --------- Signed-off-by: Ryan O'Leary <[email protected]> * Generate hostnames at Pod creation Signed-off-by: Ryan O'Leary <[email protected]> * Update tests and add error checking Signed-off-by: Ryan O'Leary <[email protected]> * Close stop channel on webhook termination Signed-off-by: Ryan O'Leary <[email protected]> * Refactor webhook to avoid using global vars Signed-off-by: Ryan O'Leary <[email protected]> * Fix comments Signed-off-by: Ryan O'Leary <[email protected]> * Change service account name Signed-off-by: Ryan O'Leary <[email protected]> * Return BadRequest if invalid kind Signed-off-by: Ryan O'Leary <[email protected]> * Fix comments Signed-off-by: Ryan O'Leary <[email protected]> * Change error messages Signed-off-by: Ryan O'Leary <[email protected]> * Fatal log in main Signed-off-by: Ryan O'Leary <[email protected]> * Update function comments Signed-off-by: Ryan O'Leary <[email protected]> * Refactor to minimize indentations Signed-off-by: Ryan O'Leary <[email protected]> * Change sliceToWorkerIDs nil check to use len Signed-off-by: Ryan O'Leary <[email protected]> * Write http.Error to header Signed-off-by: Ryan O'Leary <[email protected]> * Don't fatal log in validateRayCluster Signed-off-by: Ryan O'Leary <[email protected]> * Check for nil admission request Signed-off-by: Ryan O'Leary <[email protected]> * Add doc comment Signed-off-by: Ryan O'Leary <[email protected]> * Update expected errors Signed-off-by: Ryan O'Leary <[email protected]> * Better getNextWorkerID logic Signed-off-by: Ryan O'Leary <[email protected]> * Update replicaIndex and nextWorkerID tests Signed-off-by: Ryan O'Leary <[email protected]> * Refactor webhook unit tests Signed-off-by: Ryan O'Leary <[email protected]> * Create numOfHosts pods for Pod List Signed-off-by: Ryan O'Leary <[email protected]> * Log admission request object name Signed-off-by: Ryan O'Leary <[email protected]> * Fix nits and go vet output Signed-off-by: Ryan O'Leary <[email protected]> * Initial cloudbuil commit Signed-off-by: Ryan O'Leary <[email protected]> * Fix vet command Signed-off-by: Ryan O'Leary <[email protected]> * Update cloudbuild Signed-off-by: Ryan O'Leary <[email protected]> * Fix cloudbuild errors Signed-off-by: Ryan O'Leary <[email protected]> * Add dir Signed-off-by: Ryan O'Leary <[email protected]> * Remove arg Signed-off-by: Ryan O'Leary <[email protected]> * Change to bash command Signed-off-by: Ryan O'Leary <[email protected]> * increase timeout time Signed-off-by: Ryan O'Leary <[email protected]> * Fix validateRayCluster test Signed-off-by: Ryan O'Leary <[email protected]> * Fix nits for cloudbuild Signed-off-by: Ryan O'Leary <[email protected]> * Break early in validateRayCluster Signed-off-by: Ryan O'Leary <[email protected]> * Remove unnecessary args from validateRayCluster test Signed-off-by: Ryan O'Leary <[email protected]> * Change break to continue Signed-off-by: Ryan O'Leary <[email protected]> * Remove unused vars from webhook tests and add edge cases Signed-off-by: Ryan O'Leary <[email protected]> * Update helm chart Signed-off-by: Ryan O'Leary <[email protected]> --------- Signed-off-by: Ryan O'Leary <[email protected]>
This PR adds support for autoscaling RayClusters by generating
TPU_WORKER_HOSTNAMES
when intercepting each Pod, rather than when intercepting the RayCluster CR.This PR has been tested as follows: