Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Make Task Exit with System Error When free_objects Receives IOError #48636

Merged
merged 10 commits into from
Nov 13, 2024

Conversation

MengjinYan
Copy link
Collaborator

@MengjinYan MengjinYan commented Nov 7, 2024

Why are these changes needed?

In a recent investigation, we found that when we call the ray._private.internal_api.free() from a task the same time as a Raylet is gracefully shutting down, the task might fail with application level Broken pipe IOError. This resulted in job failure without any task retries.

However, as the Broken pipe happens because the unhealthiness of the local Raylet, the error should be a system level error and should be retried automatically.

Updated changes in commit 01f5f11:
This PR add the logic for the above bahvior:

  • When IOError is received in the CoreWorker::Delete, throw a system error exception so that the task can retry

Why not add the exception check in the free_objects function?

  • It is better to add the logic in the CoreWorker::Delete because it can cover the case for other languages as well.
  • The CoreWorker::Delete function is intended to be open to all languages to call and is not called in other ray internal code paths.

Why not crash the worker when IOError is encountered in the WriteMessage function?

  • QuickExit() function will directly exit the process without executing any shutdown logic for the worker. Directly calling the function in the task execution might potentially causing resource leak
  • At the same time, the write message function is called also on the graceful shutdown scenario and it is possible during the graceful shutdown process that the local Raylet is unreachable. Therefore, in the graceful shutdown scenario, we shouldn't exit early but let the shutdown logic finish.
  • At the same time, it is not clear in the code regarding the behavior of the graceful vs force shutdown. We might need some effort to make them clear. The todo is added in the PR.

Updated changes in commit 2029d36:

This PR add the logic for the above behavior:

  • When IOError is received in the free_objects() function, throw a system error exception so that the task can retry

Changes in commit (9d57b29) :

This PR add the logic for the above behavior:

  • Today, the internal free API deletes the objects from the local Raylet object store by writing a message through a socket
  • When the write failed because the local Raylet is terminated, there is already logic to quick exit the task
  • However, the current termination check didn't cover the case where the local Raylet process is a Zombie process and IOError happens during write messages.
  • This fix update the check criteria and fail the task when the Raylet process is terminated or the write message function returns an IOError~

Related issue number

Closes #48628

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@MengjinYan MengjinYan changed the title [Core] Make Task Exit with System Error the Local Raylet is Unreachable [Core] Make Task Exit with System Error When free_objects Receives IOError Nov 9, 2024
@MengjinYan MengjinYan added the go add ONLY when ready to merge, run all tests label Nov 9, 2024
@MengjinYan MengjinYan marked this pull request as ready for review November 9, 2024 23:03
free_ids, local_only))
status = CCoreWorkerProcess.GetCoreWorker().Delete(free_ids, local_only)
if status.IsIOError():
check_status(CRayStatus.UnexpectedSystemExit(status.ToString()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of inspecting it here, can we do if is IOError then UnexpectedSystemExit, else as-is in CoreWorker::Delete ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts: if there's another code path calling Delete and get broken pipe, it should all be considered as UnexpectedSystemExit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline. It is better to add the logic in the CoreWorker::Delete because it can cover the case for other languages as well. The CoreWorker::Delete function is intended to be open to all languages to call and is not called in other ray internal code paths.

@rynewang rynewang enabled auto-merge (squash) November 12, 2024 23:12
Signed-off-by: Mengjin Yan <[email protected]>
@github-actions github-actions bot disabled auto-merge November 13, 2024 01:51
@@ -575,14 +575,20 @@ def test_disable_driver_logs_breakpoint():
@ray.remote
def f():
while True:
time.sleep(1)
start_time = time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test failed on one of the commits. I was not able to reproduce locally on my Mac but I think it is because time.sleep() could be inaccurate on a VM. So the update here is to use time.time() to make the sleeping duration more accurate.

@rynewang rynewang enabled auto-merge (squash) November 13, 2024 19:43
@rynewang rynewang merged commit 5788c4b into master Nov 13, 2024
6 checks passed
@rynewang rynewang deleted the issue-48628 branch November 13, 2024 19:45
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
…Error (ray-project#48636)

In a recent investigation, we found that when we call the
`ray._private.internal_api.free()` from a task the same time as a Raylet
is gracefully shutting down, the task might fail with application level
Broken pipe IOError. This resulted in job failure without any task
retries.

However, as the Broken pipe happens because the unhealthiness of the
local Raylet, the error should be a system level error and should be
retried automatically.

Updated changes in commit
[01f5f11](ray-project@01f5f11):
This PR add the logic for the above bahvior:
* When IOError is received in the `CoreWorker::Delete`, throw a system
error exception so that the task can retry

Why not add the exception check in the `free_objects` function?
* It is better to add the logic in the `CoreWorker::Delete` because it
can cover the case for other languages as well.
* The `CoreWorker::Delete` function is intended to be open to all
languages to call and is not called in other ray internal code paths.

Why not crash the worker when IOError is encountered in the
`WriteMessage` function?
* `QuickExit()` function will directly exit the process without
executing any shutdown logic for the worker. Directly calling the
function in the task execution might potentially causing resource leak
* At the same time, the write message function is called also on the
graceful shutdown scenario and it is possible during the graceful
shutdown process that the local Raylet is unreachable. Therefore, in the
graceful shutdown scenario, we shouldn't exit early but let the shutdown
logic finish.
* At the same time, it is not clear in the code regarding the behavior
of the graceful vs force shutdown. We might need some effort to make
them clear. The todo is added in the PR.

Updated changes in commit
[2029d36](ray-project@2029d36):
> This PR add the logic for the above behavior:
> * When IOError is received in the `free_objects()` function, throw a
system error exception so that the task can retry

Changes in commit
([9d57b29](ray-project@9d57b29))
:
> This PR add the logic for the above behavior:
> * Today, the internal `free` API deletes the objects from the local
Raylet object store by writing a message through a socket
> * When the write failed because the local Raylet is terminated, there
is already logic to quick exit the task
> * However, the current termination check didn't cover the case where
the local Raylet process is a Zombie process and IOError happens during
write messages.
> * This fix update the check criteria and fail the task when the Raylet
process is terminated or the write message function returns an IOError~


Signed-off-by: Mengjin Yan <[email protected]>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
…Error (ray-project#48636)

In a recent investigation, we found that when we call the
`ray._private.internal_api.free()` from a task the same time as a Raylet
is gracefully shutting down, the task might fail with application level
Broken pipe IOError. This resulted in job failure without any task
retries.

However, as the Broken pipe happens because the unhealthiness of the
local Raylet, the error should be a system level error and should be
retried automatically.

Updated changes in commit
[01f5f11](ray-project@01f5f11):
This PR add the logic for the above bahvior:
* When IOError is received in the `CoreWorker::Delete`, throw a system
error exception so that the task can retry

Why not add the exception check in the `free_objects` function?
* It is better to add the logic in the `CoreWorker::Delete` because it
can cover the case for other languages as well.
* The `CoreWorker::Delete` function is intended to be open to all
languages to call and is not called in other ray internal code paths.

Why not crash the worker when IOError is encountered in the
`WriteMessage` function?
* `QuickExit()` function will directly exit the process without
executing any shutdown logic for the worker. Directly calling the
function in the task execution might potentially causing resource leak
* At the same time, the write message function is called also on the
graceful shutdown scenario and it is possible during the graceful
shutdown process that the local Raylet is unreachable. Therefore, in the
graceful shutdown scenario, we shouldn't exit early but let the shutdown
logic finish.
* At the same time, it is not clear in the code regarding the behavior
of the graceful vs force shutdown. We might need some effort to make
them clear. The todo is added in the PR.

Updated changes in commit
[2029d36](ray-project@2029d36):
> This PR add the logic for the above behavior:
> * When IOError is received in the `free_objects()` function, throw a
system error exception so that the task can retry

Changes in commit
([9d57b29](ray-project@9d57b29))
:
> This PR add the logic for the above behavior:
> * Today, the internal `free` API deletes the objects from the local
Raylet object store by writing a message through a socket
> * When the write failed because the local Raylet is terminated, there
is already logic to quick exit the task
> * However, the current termination check didn't cover the case where
the local Raylet process is a Zombie process and IOError happens during
write messages.
> * This fix update the check criteria and fail the task when the Raylet
process is terminated or the write message function returns an IOError~

Signed-off-by: Mengjin Yan <[email protected]>
Signed-off-by: mohitjain2504 <[email protected]>
dentiny pushed a commit to dentiny/ray that referenced this pull request Dec 7, 2024
…Error (ray-project#48636)

In a recent investigation, we found that when we call the
`ray._private.internal_api.free()` from a task the same time as a Raylet
is gracefully shutting down, the task might fail with application level
Broken pipe IOError. This resulted in job failure without any task
retries.

However, as the Broken pipe happens because the unhealthiness of the
local Raylet, the error should be a system level error and should be
retried automatically.

Updated changes in commit
[01f5f11](ray-project@01f5f11):
This PR add the logic for the above bahvior:
* When IOError is received in the `CoreWorker::Delete`, throw a system
error exception so that the task can retry

Why not add the exception check in the `free_objects` function?
* It is better to add the logic in the `CoreWorker::Delete` because it
can cover the case for other languages as well.
* The `CoreWorker::Delete` function is intended to be open to all
languages to call and is not called in other ray internal code paths.

Why not crash the worker when IOError is encountered in the
`WriteMessage` function?
* `QuickExit()` function will directly exit the process without
executing any shutdown logic for the worker. Directly calling the
function in the task execution might potentially causing resource leak
* At the same time, the write message function is called also on the
graceful shutdown scenario and it is possible during the graceful
shutdown process that the local Raylet is unreachable. Therefore, in the
graceful shutdown scenario, we shouldn't exit early but let the shutdown
logic finish.
* At the same time, it is not clear in the code regarding the behavior
of the graceful vs force shutdown. We might need some effort to make
them clear. The todo is added in the PR.

Updated changes in commit
[2029d36](ray-project@2029d36):
> This PR add the logic for the above behavior:
> * When IOError is received in the `free_objects()` function, throw a
system error exception so that the task can retry

Changes in commit
([9d57b29](ray-project@9d57b29))
:
> This PR add the logic for the above behavior:
> * Today, the internal `free` API deletes the objects from the local
Raylet object store by writing a message through a socket
> * When the write failed because the local Raylet is terminated, there
is already logic to quick exit the task
> * However, the current termination check didn't cover the case where
the local Raylet process is a Zombie process and IOError happens during
write messages.
> * This fix update the check criteria and fail the task when the Raylet
process is terminated or the write message function returns an IOError~


Signed-off-by: Mengjin Yan <[email protected]>
Signed-off-by: hjiang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] Instance Preemption Causing Application Level Broken Pipe Issue
2 participants