[Core] Make Task Exit with System Error When free_objects Receives IOError #48636

MengjinYan · 2024-11-07T22:21:09Z

Why are these changes needed?

In a recent investigation, we found that when we call the ray._private.internal_api.free() from a task the same time as a Raylet is gracefully shutting down, the task might fail with application level Broken pipe IOError. This resulted in job failure without any task retries.

However, as the Broken pipe happens because the unhealthiness of the local Raylet, the error should be a system level error and should be retried automatically.

Updated changes in commit 01f5f11:
This PR add the logic for the above bahvior:

When IOError is received in the CoreWorker::Delete, throw a system error exception so that the task can retry

Why not add the exception check in the free_objects function?

It is better to add the logic in the CoreWorker::Delete because it can cover the case for other languages as well.
The CoreWorker::Delete function is intended to be open to all languages to call and is not called in other ray internal code paths.

Why not crash the worker when IOError is encountered in the WriteMessage function?

QuickExit() function will directly exit the process without executing any shutdown logic for the worker. Directly calling the function in the task execution might potentially causing resource leak
At the same time, the write message function is called also on the graceful shutdown scenario and it is possible during the graceful shutdown process that the local Raylet is unreachable. Therefore, in the graceful shutdown scenario, we shouldn't exit early but let the shutdown logic finish.
At the same time, it is not clear in the code regarding the behavior of the graceful vs force shutdown. We might need some effort to make them clear. The todo is added in the PR.

Updated changes in commit 2029d36:

This PR add the logic for the above behavior:

When IOError is received in the free_objects() function, throw a system error exception so that the task can retry

Changes in commit (9d57b29) :

This PR add the logic for the above behavior:

Today, the internal free API deletes the objects from the local Raylet object store by writing a message through a socket

When the write failed because the local Raylet is terminated, there is already logic to quick exit the task

However, the current termination check didn't cover the case where the local Raylet process is a Zombie process and IOError happens during write messages.

This fix update the check criteria and fail the task when the Raylet process is terminated or the write message function returns an IOError~

Related issue number

Closes #48628

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…riting to local object store Signed-off-by: Mengjin Yan <[email protected]>

Signed-off-by: Mengjin Yan <[email protected]>

rynewang · 2024-11-12T21:49:09Z

python/ray/_raylet.pyx

-                free_ids, local_only))
+            status = CCoreWorkerProcess.GetCoreWorker().Delete(free_ids, local_only)
+            if status.IsIOError():
+                check_status(CRayStatus.UnexpectedSystemExit(status.ToString()))


Instead of inspecting it here, can we do if is IOError then UnexpectedSystemExit, else as-is in CoreWorker::Delete ?

Thoughts: if there's another code path calling Delete and get broken pipe, it should all be considered as UnexpectedSystemExit

Synced offline. It is better to add the logic in the CoreWorker::Delete because it can cover the case for other languages as well. The CoreWorker::Delete function is intended to be open to all languages to call and is not called in other ray internal code paths.

…urate Signed-off-by: Mengjin Yan <[email protected]>

Signed-off-by: Mengjin Yan <[email protected]>

rynewang · 2024-11-13T17:43:23Z

python/ray/tests/test_output.py

@@ -575,14 +575,20 @@ def test_disable_driver_logs_breakpoint():
 @ray.remote
 def f():
    while True:
-        time.sleep(1)
+        start_time = time.time()


what is this change?

This test failed on one of the commits. I was not able to reproduce locally on my Mac but I think it is because time.sleep() could be inaccurate on a VM. So the update here is to use time.time() to make the sleeping duration more accurate.

…Error (ray-project#48636) In a recent investigation, we found that when we call the `ray._private.internal_api.free()` from a task the same time as a Raylet is gracefully shutting down, the task might fail with application level Broken pipe IOError. This resulted in job failure without any task retries. However, as the Broken pipe happens because the unhealthiness of the local Raylet, the error should be a system level error and should be retried automatically. Updated changes in commit [01f5f11](ray-project@01f5f11): This PR add the logic for the above bahvior: * When IOError is received in the `CoreWorker::Delete`, throw a system error exception so that the task can retry Why not add the exception check in the `free_objects` function? * It is better to add the logic in the `CoreWorker::Delete` because it can cover the case for other languages as well. * The `CoreWorker::Delete` function is intended to be open to all languages to call and is not called in other ray internal code paths. Why not crash the worker when IOError is encountered in the `WriteMessage` function? * `QuickExit()` function will directly exit the process without executing any shutdown logic for the worker. Directly calling the function in the task execution might potentially causing resource leak * At the same time, the write message function is called also on the graceful shutdown scenario and it is possible during the graceful shutdown process that the local Raylet is unreachable. Therefore, in the graceful shutdown scenario, we shouldn't exit early but let the shutdown logic finish. * At the same time, it is not clear in the code regarding the behavior of the graceful vs force shutdown. We might need some effort to make them clear. The todo is added in the PR. Updated changes in commit [2029d36](ray-project@2029d36): > This PR add the logic for the above behavior: > * When IOError is received in the `free_objects()` function, throw a system error exception so that the task can retry Changes in commit ([9d57b29](ray-project@9d57b29)) : > This PR add the logic for the above behavior: > * Today, the internal `free` API deletes the objects from the local Raylet object store by writing a message through a socket > * When the write failed because the local Raylet is terminated, there is already logic to quick exit the task > * However, the current termination check didn't cover the case where the local Raylet process is a Zombie process and IOError happens during write messages. > * This fix update the check criteria and fail the task when the Raylet process is terminated or the write message function returns an IOError~ Signed-off-by: Mengjin Yan <[email protected]>

…Error (ray-project#48636) In a recent investigation, we found that when we call the `ray._private.internal_api.free()` from a task the same time as a Raylet is gracefully shutting down, the task might fail with application level Broken pipe IOError. This resulted in job failure without any task retries. However, as the Broken pipe happens because the unhealthiness of the local Raylet, the error should be a system level error and should be retried automatically. Updated changes in commit [01f5f11](ray-project@01f5f11): This PR add the logic for the above bahvior: * When IOError is received in the `CoreWorker::Delete`, throw a system error exception so that the task can retry Why not add the exception check in the `free_objects` function? * It is better to add the logic in the `CoreWorker::Delete` because it can cover the case for other languages as well. * The `CoreWorker::Delete` function is intended to be open to all languages to call and is not called in other ray internal code paths. Why not crash the worker when IOError is encountered in the `WriteMessage` function? * `QuickExit()` function will directly exit the process without executing any shutdown logic for the worker. Directly calling the function in the task execution might potentially causing resource leak * At the same time, the write message function is called also on the graceful shutdown scenario and it is possible during the graceful shutdown process that the local Raylet is unreachable. Therefore, in the graceful shutdown scenario, we shouldn't exit early but let the shutdown logic finish. * At the same time, it is not clear in the code regarding the behavior of the graceful vs force shutdown. We might need some effort to make them clear. The todo is added in the PR. Updated changes in commit [2029d36](ray-project@2029d36): > This PR add the logic for the above behavior: > * When IOError is received in the `free_objects()` function, throw a system error exception so that the task can retry Changes in commit ([9d57b29](ray-project@9d57b29)) : > This PR add the logic for the above behavior: > * Today, the internal `free` API deletes the objects from the local Raylet object store by writing a message through a socket > * When the write failed because the local Raylet is terminated, there is already logic to quick exit the task > * However, the current termination check didn't cover the case where the local Raylet process is a Zombie process and IOError happens during write messages. > * This fix update the check criteria and fail the task when the Raylet process is terminated or the write message function returns an IOError~ Signed-off-by: Mengjin Yan <[email protected]> Signed-off-by: mohitjain2504 <[email protected]>

…Error (ray-project#48636) In a recent investigation, we found that when we call the `ray._private.internal_api.free()` from a task the same time as a Raylet is gracefully shutting down, the task might fail with application level Broken pipe IOError. This resulted in job failure without any task retries. However, as the Broken pipe happens because the unhealthiness of the local Raylet, the error should be a system level error and should be retried automatically. Updated changes in commit [01f5f11](ray-project@01f5f11): This PR add the logic for the above bahvior: * When IOError is received in the `CoreWorker::Delete`, throw a system error exception so that the task can retry Why not add the exception check in the `free_objects` function? * It is better to add the logic in the `CoreWorker::Delete` because it can cover the case for other languages as well. * The `CoreWorker::Delete` function is intended to be open to all languages to call and is not called in other ray internal code paths. Why not crash the worker when IOError is encountered in the `WriteMessage` function? * `QuickExit()` function will directly exit the process without executing any shutdown logic for the worker. Directly calling the function in the task execution might potentially causing resource leak * At the same time, the write message function is called also on the graceful shutdown scenario and it is possible during the graceful shutdown process that the local Raylet is unreachable. Therefore, in the graceful shutdown scenario, we shouldn't exit early but let the shutdown logic finish. * At the same time, it is not clear in the code regarding the behavior of the graceful vs force shutdown. We might need some effort to make them clear. The todo is added in the PR. Updated changes in commit [2029d36](ray-project@2029d36): > This PR add the logic for the above behavior: > * When IOError is received in the `free_objects()` function, throw a system error exception so that the task can retry Changes in commit ([9d57b29](ray-project@9d57b29)) : > This PR add the logic for the above behavior: > * Today, the internal `free` API deletes the objects from the local Raylet object store by writing a message through a socket > * When the write failed because the local Raylet is terminated, there is already logic to quick exit the task > * However, the current termination check didn't cover the case where the local Raylet process is a Zombie process and IOError happens during write messages. > * This fix update the check criteria and fail the task when the Raylet process is terminated or the write message function returns an IOError~ Signed-off-by: Mengjin Yan <[email protected]> Signed-off-by: hjiang <[email protected]>

MengjinYan added 5 commits November 7, 2024 10:21

Make core worker exit with system error when receiving IOError from w…

9d57b29

…riting to local object store Signed-off-by: Mengjin Yan <[email protected]>

Update the code to throw system exception upon IOError in free_objects

2029d36

Signed-off-by: Mengjin Yan <[email protected]>

fix lint

89dfa7c

Signed-off-by: Mengjin Yan <[email protected]>

revert comment change

d654c2f

Signed-off-by: Mengjin Yan <[email protected]>

fix lint

8148938

Signed-off-by: Mengjin Yan <[email protected]>

MengjinYan changed the title ~~[Core] Make Task Exit with System Error the Local Raylet is Unreachable~~ [Core] Make Task Exit with System Error When free_objects Receives IOError Nov 9, 2024

MengjinYan added the go add ONLY when ready to merge, run all tests label Nov 9, 2024

MengjinYan marked this pull request as ready for review November 9, 2024 23:03

jjyao assigned rynewang Nov 12, 2024

rynewang reviewed Nov 12, 2024

View reviewed changes

MengjinYan added 2 commits November 12, 2024 13:54

make time calculation in test_disable_driver_logs_breakpoint more acc…

eeba114

…urate Signed-off-by: Mengjin Yan <[email protected]>

Move the error check to core_worker.cc

01f5f11

Signed-off-by: Mengjin Yan <[email protected]>

rynewang approved these changes Nov 12, 2024

View reviewed changes

rynewang enabled auto-merge (squash) November 12, 2024 23:12

fix lint

f202ac7

Signed-off-by: Mengjin Yan <[email protected]>

github-actions bot disabled auto-merge November 13, 2024 01:51

MengjinYan added 2 commits November 12, 2024 18:05

Merge branch 'master' into issue-48628

07d4cf1

Merge branch 'master' into issue-48628

76fd34d

rynewang reviewed Nov 13, 2024

View reviewed changes

rynewang enabled auto-merge (squash) November 13, 2024 19:43

rynewang merged commit 5788c4b into master Nov 13, 2024
6 checks passed

rynewang deleted the issue-48628 branch November 13, 2024 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Make Task Exit with System Error When free_objects Receives IOError #48636

[Core] Make Task Exit with System Error When free_objects Receives IOError #48636

MengjinYan commented Nov 7, 2024 •

edited

Loading

rynewang Nov 12, 2024

rynewang Nov 12, 2024

MengjinYan Nov 12, 2024

rynewang Nov 13, 2024

MengjinYan Nov 13, 2024

[Core] Make Task Exit with System Error When free_objects Receives IOError #48636

[Core] Make Task Exit with System Error When free_objects Receives IOError #48636

Conversation

MengjinYan commented Nov 7, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

rynewang Nov 12, 2024

Choose a reason for hiding this comment

rynewang Nov 12, 2024

Choose a reason for hiding this comment

MengjinYan Nov 12, 2024

Choose a reason for hiding this comment

rynewang Nov 13, 2024

Choose a reason for hiding this comment

MengjinYan Nov 13, 2024

Choose a reason for hiding this comment

MengjinYan commented Nov 7, 2024 •

edited

Loading