Fixes balancing issues under heavy load. #51

rafa-be · 2025-01-29T17:48:42Z

I noticed 2 issues with our current balancing system under heavy load:

1) Forced task cancellation during balancing

The current balancing message isTaskCancel(force=True). This forces the task to be cancelled even when these are processing.

This creates some issues if the task being cancelled started sub-tasks. If this happens, an explosive creation of tasks can occur as sub-tasks will be created multiple times on re-routed workers.

The current PR changes the message to TaskCancel(force=False). If the task is running (or suspended because of an higher-priority task), the worker will answer with the new TaskStatus.CancelFailed status.

2) Workers can have multiple balancing message queued

Under heavy-load, the worker can accumulate multiple TaskCancel messages from the scheduler's balancing routine.

If the task is cancelable (i.e. not running), the worker will answer to the first TaskCancel message with TaskStatus.Cancelled and will then return TaskStatus.NotFound messages. This TaskStatus.NotFound status would then be mishandled by the scheduler, which will either corrupt the scheduler's internal state or propagate the TaskStatus.NotFound status to the client.

The current PR makes the handling of task results more resilient by ignoring messages originating from workers that do not currently handle the task (according to the scheduler's state).

1597463007

Not related to the PR but I'm a bit concerned of the lack of visibility from the scheduler with regards to workers processing tasks because if there are workers that can run tasks concurrently the scheduler will inadvertently steal/withhold tasks from certain workers.

This might require a worker to send the number of tasks that can be processed concurrently.

scaler/protocol/capnp/common.capnp

scaler/scheduler/worker_manager.py

rafa-be · 2025-01-30T14:22:40Z

Not related to the PR but I'm a bit concerned of the lack of visibility from the scheduler with regards to workers processing tasks because if there are workers that can run tasks concurrently the scheduler will inadvertently steal/withhold tasks from certain workers.

This might require a worker to send the number of tasks that can be processed concurrently.

That's totally right.

However, we decided to implemented the sub-tasks system without any visibility from the scheduler, to keep the scheduler as simple as possible. That indeed creates some inefficiencies, but these seems to be reasonable from what I experienced.

Signed-off-by: rafa-be <[email protected]>

rafa-be · 2025-01-30T17:13:02Z

@rafa-be , in my original design, the cancel will never fail, but since now TaskCancel can have force=True/False, so it is possible that task cannot be canceled, then scheduler need wait until received task cancel ack, but that will raise another problem, so before receive the cancel ack, will receive multiple TaskCancel request, so we should add a new task status called Canceling, if in such state, we should ignore further task cancel request

@sharpener6 I like the idea, and thought it would not be that hard to implement: just keep a "cancelling" set in the scheduler.

But there is actually a tricky case in which we might want to send two TaskCancel messages anyway:

the scheduler sends a TaskCancel(force=False) message because of balancing;
the client cancels the task with a TaskCancel(force=True) before the worker can process the 1)st message.

In the PR, depending if the task is processing, the scheduler will either receive from the worker TaskStatus.Canceled then TaskStatus.NotFound (the 2nd will be ignored), or TaskStatus.CancelFailed then TaskStatus.Canceled (the 1st will be ignored). This works fine as the task will be ultimately canceled anyway. The client always immediately receives a TaskStatus.Canceled message.

If we do not allow multiple TaskCancel messages to be queued, the implementation of the scheduler becomes significantly more complex as it has to wait until the first cancel's ack before it can send the 2nd cancel message.

sharpener6 · 2025-01-30T01:59:20Z

scaler/scheduler/worker_manager.py

                f"might due to worker get disconnected or canceled"
            )
            return

+        if worker != assigned_worker:


@rafa-be,

So there are Following situations:

force=True, this case TaskResultStatus will be always Canceled, in this case, reroute the task

The problem is below, too complicated, but let me put here

force=False

if in the queue, TaskResultStatus is Canceled, reroute the task

if processing, TaskResultStatus is CancelFailed, in this case, CancelFailed task should not be reroute, and after task finished, will be another TaskResult coming, can be Success and Failed

So in general, in which case this condition will happen?

* force=False * if in the queue, TaskResultStatus is Canceled, reroute the task * if processing, TaskResultStatus is CancelFailed, `in this case, CancelFailed task should not be reroute`, and after task finished, will be another TaskResult coming, can be Success and Failed

This.

The scheduler wants to cancel the task for balancing (with force=False), and then immediately after receives a TaskCancel(force=True) from the client.

If we decide to not send multiple TaskCancel messages, the scheduler will either wait for:

Canceled. Then ignore the message: the task is canceled as requested by the client;

CancelFailed. Then send an additional TaskCancel(force=True) message.

rafa-be requested a review from sharpener6 January 29, 2025 17:48

rafa-be force-pushed the fix_heavy_load_balancing branch 2 times, most recently from 03125a7 to 6b60407 Compare January 29, 2025 17:58

1597463007 reviewed Jan 29, 2025

View reviewed changes

scaler/protocol/capnp/common.capnp Outdated Show resolved Hide resolved

scaler/scheduler/worker_manager.py Outdated Show resolved Hide resolved

Fixes balancing issues under heavy load.

e988520

Signed-off-by: rafa-be <[email protected]>

rafa-be force-pushed the fix_heavy_load_balancing branch from 6b60407 to e988520 Compare January 30, 2025 16:34

sharpener6 reviewed Jan 30, 2025

View reviewed changes

gxuu mentioned this pull request Jan 31, 2025

Basic Implementation Making Queue Size Configurable by Workers #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes balancing issues under heavy load. #51

Fixes balancing issues under heavy load. #51

rafa-be commented Jan 29, 2025

1597463007 left a comment

rafa-be commented Jan 30, 2025

rafa-be commented Jan 30, 2025 •

edited

Loading

sharpener6 Jan 30, 2025

rafa-be Jan 30, 2025

Fixes balancing issues under heavy load. #51

Are you sure you want to change the base?

Fixes balancing issues under heavy load. #51

Conversation

rafa-be commented Jan 29, 2025

1) Forced task cancellation during balancing

2) Workers can have multiple balancing message queued

1597463007 left a comment

Choose a reason for hiding this comment

rafa-be commented Jan 30, 2025

rafa-be commented Jan 30, 2025 • edited Loading

sharpener6 Jan 30, 2025

Choose a reason for hiding this comment

rafa-be Jan 30, 2025

Choose a reason for hiding this comment

rafa-be commented Jan 30, 2025 •

edited

Loading