-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Make Task Exit with System Error When free_objects Receives IOError #48636
Conversation
…riting to local object store Signed-off-by: Mengjin Yan <[email protected]>
Signed-off-by: Mengjin Yan <[email protected]>
Signed-off-by: Mengjin Yan <[email protected]>
Signed-off-by: Mengjin Yan <[email protected]>
Signed-off-by: Mengjin Yan <[email protected]>
python/ray/_raylet.pyx
Outdated
free_ids, local_only)) | ||
status = CCoreWorkerProcess.GetCoreWorker().Delete(free_ids, local_only) | ||
if status.IsIOError(): | ||
check_status(CRayStatus.UnexpectedSystemExit(status.ToString())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of inspecting it here, can we do if is IOError then UnexpectedSystemExit, else as-is
in CoreWorker::Delete
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts: if there's another code path calling Delete and get broken pipe, it should all be considered as UnexpectedSystemExit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced offline. It is better to add the logic in the CoreWorker::Delete
because it can cover the case for other languages as well. The CoreWorker::Delete
function is intended to be open to all languages to call and is not called in other ray internal code paths.
…urate Signed-off-by: Mengjin Yan <[email protected]>
Signed-off-by: Mengjin Yan <[email protected]>
Signed-off-by: Mengjin Yan <[email protected]>
@@ -575,14 +575,20 @@ def test_disable_driver_logs_breakpoint(): | |||
@ray.remote | |||
def f(): | |||
while True: | |||
time.sleep(1) | |||
start_time = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test failed on one of the commits. I was not able to reproduce locally on my Mac but I think it is because time.sleep()
could be inaccurate on a VM. So the update here is to use time.time()
to make the sleeping duration more accurate.
…Error (ray-project#48636) In a recent investigation, we found that when we call the `ray._private.internal_api.free()` from a task the same time as a Raylet is gracefully shutting down, the task might fail with application level Broken pipe IOError. This resulted in job failure without any task retries. However, as the Broken pipe happens because the unhealthiness of the local Raylet, the error should be a system level error and should be retried automatically. Updated changes in commit [01f5f11](ray-project@01f5f11): This PR add the logic for the above bahvior: * When IOError is received in the `CoreWorker::Delete`, throw a system error exception so that the task can retry Why not add the exception check in the `free_objects` function? * It is better to add the logic in the `CoreWorker::Delete` because it can cover the case for other languages as well. * The `CoreWorker::Delete` function is intended to be open to all languages to call and is not called in other ray internal code paths. Why not crash the worker when IOError is encountered in the `WriteMessage` function? * `QuickExit()` function will directly exit the process without executing any shutdown logic for the worker. Directly calling the function in the task execution might potentially causing resource leak * At the same time, the write message function is called also on the graceful shutdown scenario and it is possible during the graceful shutdown process that the local Raylet is unreachable. Therefore, in the graceful shutdown scenario, we shouldn't exit early but let the shutdown logic finish. * At the same time, it is not clear in the code regarding the behavior of the graceful vs force shutdown. We might need some effort to make them clear. The todo is added in the PR. Updated changes in commit [2029d36](ray-project@2029d36): > This PR add the logic for the above behavior: > * When IOError is received in the `free_objects()` function, throw a system error exception so that the task can retry Changes in commit ([9d57b29](ray-project@9d57b29)) : > This PR add the logic for the above behavior: > * Today, the internal `free` API deletes the objects from the local Raylet object store by writing a message through a socket > * When the write failed because the local Raylet is terminated, there is already logic to quick exit the task > * However, the current termination check didn't cover the case where the local Raylet process is a Zombie process and IOError happens during write messages. > * This fix update the check criteria and fail the task when the Raylet process is terminated or the write message function returns an IOError~ Signed-off-by: Mengjin Yan <[email protected]>
…Error (ray-project#48636) In a recent investigation, we found that when we call the `ray._private.internal_api.free()` from a task the same time as a Raylet is gracefully shutting down, the task might fail with application level Broken pipe IOError. This resulted in job failure without any task retries. However, as the Broken pipe happens because the unhealthiness of the local Raylet, the error should be a system level error and should be retried automatically. Updated changes in commit [01f5f11](ray-project@01f5f11): This PR add the logic for the above bahvior: * When IOError is received in the `CoreWorker::Delete`, throw a system error exception so that the task can retry Why not add the exception check in the `free_objects` function? * It is better to add the logic in the `CoreWorker::Delete` because it can cover the case for other languages as well. * The `CoreWorker::Delete` function is intended to be open to all languages to call and is not called in other ray internal code paths. Why not crash the worker when IOError is encountered in the `WriteMessage` function? * `QuickExit()` function will directly exit the process without executing any shutdown logic for the worker. Directly calling the function in the task execution might potentially causing resource leak * At the same time, the write message function is called also on the graceful shutdown scenario and it is possible during the graceful shutdown process that the local Raylet is unreachable. Therefore, in the graceful shutdown scenario, we shouldn't exit early but let the shutdown logic finish. * At the same time, it is not clear in the code regarding the behavior of the graceful vs force shutdown. We might need some effort to make them clear. The todo is added in the PR. Updated changes in commit [2029d36](ray-project@2029d36): > This PR add the logic for the above behavior: > * When IOError is received in the `free_objects()` function, throw a system error exception so that the task can retry Changes in commit ([9d57b29](ray-project@9d57b29)) : > This PR add the logic for the above behavior: > * Today, the internal `free` API deletes the objects from the local Raylet object store by writing a message through a socket > * When the write failed because the local Raylet is terminated, there is already logic to quick exit the task > * However, the current termination check didn't cover the case where the local Raylet process is a Zombie process and IOError happens during write messages. > * This fix update the check criteria and fail the task when the Raylet process is terminated or the write message function returns an IOError~ Signed-off-by: Mengjin Yan <[email protected]> Signed-off-by: mohitjain2504 <[email protected]>
…Error (ray-project#48636) In a recent investigation, we found that when we call the `ray._private.internal_api.free()` from a task the same time as a Raylet is gracefully shutting down, the task might fail with application level Broken pipe IOError. This resulted in job failure without any task retries. However, as the Broken pipe happens because the unhealthiness of the local Raylet, the error should be a system level error and should be retried automatically. Updated changes in commit [01f5f11](ray-project@01f5f11): This PR add the logic for the above bahvior: * When IOError is received in the `CoreWorker::Delete`, throw a system error exception so that the task can retry Why not add the exception check in the `free_objects` function? * It is better to add the logic in the `CoreWorker::Delete` because it can cover the case for other languages as well. * The `CoreWorker::Delete` function is intended to be open to all languages to call and is not called in other ray internal code paths. Why not crash the worker when IOError is encountered in the `WriteMessage` function? * `QuickExit()` function will directly exit the process without executing any shutdown logic for the worker. Directly calling the function in the task execution might potentially causing resource leak * At the same time, the write message function is called also on the graceful shutdown scenario and it is possible during the graceful shutdown process that the local Raylet is unreachable. Therefore, in the graceful shutdown scenario, we shouldn't exit early but let the shutdown logic finish. * At the same time, it is not clear in the code regarding the behavior of the graceful vs force shutdown. We might need some effort to make them clear. The todo is added in the PR. Updated changes in commit [2029d36](ray-project@2029d36): > This PR add the logic for the above behavior: > * When IOError is received in the `free_objects()` function, throw a system error exception so that the task can retry Changes in commit ([9d57b29](ray-project@9d57b29)) : > This PR add the logic for the above behavior: > * Today, the internal `free` API deletes the objects from the local Raylet object store by writing a message through a socket > * When the write failed because the local Raylet is terminated, there is already logic to quick exit the task > * However, the current termination check didn't cover the case where the local Raylet process is a Zombie process and IOError happens during write messages. > * This fix update the check criteria and fail the task when the Raylet process is terminated or the write message function returns an IOError~ Signed-off-by: Mengjin Yan <[email protected]> Signed-off-by: hjiang <[email protected]>
Why are these changes needed?
In a recent investigation, we found that when we call the
ray._private.internal_api.free()
from a task the same time as a Raylet is gracefully shutting down, the task might fail with application level Broken pipe IOError. This resulted in job failure without any task retries.However, as the Broken pipe happens because the unhealthiness of the local Raylet, the error should be a system level error and should be retried automatically.
Updated changes in commit 01f5f11:
This PR add the logic for the above bahvior:
CoreWorker::Delete
, throw a system error exception so that the task can retryWhy not add the exception check in the
free_objects
function?CoreWorker::Delete
because it can cover the case for other languages as well.CoreWorker::Delete
function is intended to be open to all languages to call and is not called in other ray internal code paths.Why not crash the worker when IOError is encountered in the
WriteMessage
function?QuickExit()
function will directly exit the process without executing any shutdown logic for the worker. Directly calling the function in the task execution might potentially causing resource leakUpdated changes in commit 2029d36:
Changes in commit (9d57b29) :
Related issue number
Closes #48628
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.