Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes balancing issues under heavy load. #51

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rafa-be
Copy link
Collaborator

@rafa-be rafa-be commented Jan 29, 2025

I noticed 2 issues with our current balancing system under heavy load:

1) Forced task cancellation during balancing

The current balancing message isTaskCancel(force=True). This forces the task to be cancelled even when these are processing.

This creates some issues if the task being cancelled started sub-tasks. If this happens, an explosive creation of tasks can occur as sub-tasks will be created multiple times on re-routed workers.

The current PR changes the message to TaskCancel(force=False). If the task is running (or suspended because of an higher-priority task), the worker will answer with the new TaskStatus.CancelFailed status.

2) Workers can have multiple balancing message queued

Under heavy-load, the worker can accumulate multiple TaskCancel messages from the scheduler's balancing routine.

If the task is cancelable (i.e. not running), the worker will answer to the first TaskCancel message with TaskStatus.Cancelled and will then return TaskStatus.NotFound messages. This TaskStatus.NotFound status would then be mishandled by the scheduler, which will either corrupt the scheduler's internal state or propagate the TaskStatus.NotFound status to the client.

The current PR makes the handling of task results more resilient by ignoring messages originating from workers that do not currently handle the task (according to the scheduler's state).

@rafa-be rafa-be requested a review from sharpener6 January 29, 2025 17:48
@rafa-be rafa-be force-pushed the fix_heavy_load_balancing branch 2 times, most recently from 03125a7 to 6b60407 Compare January 29, 2025 17:58
Copy link
Contributor

@1597463007 1597463007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the PR but I'm a bit concerned of the lack of visibility from the scheduler with regards to workers processing tasks because if there are workers that can run tasks concurrently the scheduler will inadvertently steal/withhold tasks from certain workers.

This might require a worker to send the number of tasks that can be processed concurrently.

scaler/protocol/capnp/common.capnp Outdated Show resolved Hide resolved
scaler/scheduler/worker_manager.py Outdated Show resolved Hide resolved
@rafa-be
Copy link
Collaborator Author

rafa-be commented Jan 30, 2025

Not related to the PR but I'm a bit concerned of the lack of visibility from the scheduler with regards to workers processing tasks because if there are workers that can run tasks concurrently the scheduler will inadvertently steal/withhold tasks from certain workers.

This might require a worker to send the number of tasks that can be processed concurrently.

That's totally right.

However, we decided to implemented the sub-tasks system without any visibility from the scheduler, to keep the scheduler as simple as possible. That indeed creates some inefficiencies, but these seems to be reasonable from what I experienced.

@rafa-be rafa-be force-pushed the fix_heavy_load_balancing branch from 6b60407 to e988520 Compare January 30, 2025 16:34
@rafa-be
Copy link
Collaborator Author

rafa-be commented Jan 30, 2025

@rafa-be , in my original design, the cancel will never fail, but since now TaskCancel can have force=True/False, so it is possible that task cannot be canceled, then scheduler need wait until received task cancel ack, but that will raise another problem, so before receive the cancel ack, will receive multiple TaskCancel request, so we should add a new task status called Canceling, if in such state, we should ignore further task cancel request

@sharpener6 I like the idea, and thought it would not be that hard to implement: just keep a "cancelling" set in the scheduler.

But there is actually a tricky case in which we might want to send two TaskCancel messages anyway:

  1. the scheduler sends a TaskCancel(force=False) message because of balancing;
  2. the client cancels the task with a TaskCancel(force=True) before the worker can process the 1)st message.

In the PR, depending if the task is processing, the scheduler will either receive from the worker TaskStatus.Canceled then TaskStatus.NotFound (the 2nd will be ignored), or TaskStatus.CancelFailed then TaskStatus.Canceled (the 1st will be ignored). This works fine as the task will be ultimately canceled anyway. The client always immediately receives a TaskStatus.Canceled message.

If we do not allow multiple TaskCancel messages to be queued, the implementation of the scheduler becomes significantly more complex as it has to wait until the first cancel's ack before it can send the 2nd cancel message.

f"might due to worker get disconnected or canceled"
)
return

if worker != assigned_worker:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafa-be,

So there are Following situations:

  • force=True, this case TaskResultStatus will be always Canceled, in this case, reroute the task

The problem is below, too complicated, but let me put here

  • force=False
    • if in the queue, TaskResultStatus is Canceled, reroute the task
    • if processing, TaskResultStatus is CancelFailed, in this case, CancelFailed task should not be reroute, and after task finished, will be another TaskResult coming, can be Success and Failed

So in general, in which case this condition will happen?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* force=False
  
  * if in the queue, TaskResultStatus is Canceled, reroute the task
  * if processing, TaskResultStatus is CancelFailed, `in this case, CancelFailed task should not be reroute`, and after task finished, will be another TaskResult coming, can be Success and Failed

This.

The scheduler wants to cancel the task for balancing (with force=False), and then immediately after receives a TaskCancel(force=True) from the client.

If we decide to not send multiple TaskCancel messages, the scheduler will either wait for:

  • Canceled. Then ignore the message: the task is canceled as requested by the client;
  • CancelFailed. Then send an additional TaskCancel(force=True) message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants