-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes balancing issues under heavy load. #51
base: main
Are you sure you want to change the base?
Conversation
03125a7
to
6b60407
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to the PR but I'm a bit concerned of the lack of visibility from the scheduler with regards to workers processing tasks because if there are workers that can run tasks concurrently the scheduler will inadvertently steal/withhold tasks from certain workers.
This might require a worker to send the number of tasks that can be processed concurrently.
That's totally right. However, we decided to implemented the sub-tasks system without any visibility from the scheduler, to keep the scheduler as simple as possible. That indeed creates some inefficiencies, but these seems to be reasonable from what I experienced. |
Signed-off-by: rafa-be <[email protected]>
6b60407
to
e988520
Compare
@sharpener6 I like the idea, and thought it would not be that hard to implement: just keep a "cancelling" set in the scheduler. But there is actually a tricky case in which we might want to send two
In the PR, depending if the task is processing, the scheduler will either receive from the worker If we do not allow multiple |
f"might due to worker get disconnected or canceled" | ||
) | ||
return | ||
|
||
if worker != assigned_worker: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are Following situations:
- force=True, this case TaskResultStatus will be always Canceled, in this case, reroute the task
The problem is below, too complicated, but let me put here
- force=False
- if in the queue, TaskResultStatus is Canceled, reroute the task
- if processing, TaskResultStatus is CancelFailed,
in this case, CancelFailed task should not be reroute
, and after task finished, will be another TaskResult coming, can be Success and Failed
So in general, in which case this condition will happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* force=False * if in the queue, TaskResultStatus is Canceled, reroute the task * if processing, TaskResultStatus is CancelFailed, `in this case, CancelFailed task should not be reroute`, and after task finished, will be another TaskResult coming, can be Success and Failed
This.
The scheduler wants to cancel the task for balancing (with force=False), and then immediately after receives a TaskCancel(force=True)
from the client.
If we decide to not send multiple TaskCancel
messages, the scheduler will either wait for:
Canceled
. Then ignore the message: the task is canceled as requested by the client;CancelFailed
. Then send an additionalTaskCancel(force=True)
message.
I noticed 2 issues with our current balancing system under heavy load:
1) Forced task cancellation during balancing
The current balancing message is
TaskCancel(force=True)
. This forces the task to be cancelled even when these are processing.This creates some issues if the task being cancelled started sub-tasks. If this happens, an explosive creation of tasks can occur as sub-tasks will be created multiple times on re-routed workers.
The current PR changes the message to
TaskCancel(force=False)
. If the task is running (or suspended because of an higher-priority task), the worker will answer with the newTaskStatus.CancelFailed
status.2) Workers can have multiple balancing message queued
Under heavy-load, the worker can accumulate multiple
TaskCancel
messages from the scheduler's balancing routine.If the task is cancelable (i.e. not running), the worker will answer to the first
TaskCancel
message withTaskStatus.Cancelled
and will then returnTaskStatus.NotFound
messages. ThisTaskStatus.NotFound
status would then be mishandled by the scheduler, which will either corrupt the scheduler's internal state or propagate theTaskStatus.NotFound
status to the client.The current PR makes the handling of task results more resilient by ignoring messages originating from workers that do not currently handle the task (according to the scheduler's state).