Migrate ensure_computing transitions to new WorkerState event mechanism - part 1 #6003

crusaderky · 2022-03-25T14:28:25Z

Partially closes #5895
See comment below for scope of this PR

crusaderky · 2022-03-25T14:31:09Z

distributed/worker.py

+                    self._async_instruction_callback,
+                    self.execute(inst.key, stimulus_id=inst.stimulus_id),
+                    stimulus_id=inst.stimulus_id,
+                )


This differs from the design document, which instead was creating a fire-and-forget asyncio.Task.
As far as I understand, the difference between the two is purely cosmetic.

Added value would be given from spawning a task, track it e.g. in a set Worker.running_asyncio_tasks, and then cancel it in Worker.close(). Even if desirable, however, I think this is best left to a future PR.

Added value would be given from spawning a task, track it e.g. in a set Worker.running_asyncio_tasks, and then cancel it in Worker.close(). Even if desirable, however, I think this is best left to a future PR.

That was my intention, I just didn't specify everything in my pseudo code. I was actually hoping we'd implement this already as part of #5922. I'm OK with postponing this to a follow up but I would like to get this done rather sooner than later.

I would like to avoid using add_callback if at all possible since tracking the tasks would actually allow us to, e.g. deal with the exception

crusaderky · 2022-03-25T14:33:35Z

distributed/worker.py

+            msg = error_message(exc)
+            recommendations = {ts: tuple(msg.values())}
+
+        return recommendations, []


This differs from #5895:

The Worker.execute method is modified such that it no longer performs any transition but instead returns appropriate StateMachineEvents that trigger the necessary handlers. For instance

TaskFinished

Rescheduled

TaskErred

I could not find any benefit in implementing those events vs. just returning recommendations?

The benefit is that we want to log the events and keep the recommendations as an internal detail of the to-be-defined WorkerState class. Not using recommendations here is one of the more important points of the design proposals

Please have a look now

fjetter · 2022-03-25T16:00:08Z

distributed/worker.py

+                    self._async_instruction_callback,
+                    self.execute(inst.key, stimulus_id=inst.stimulus_id),
+                    stimulus_id=inst.stimulus_id,
+                )


Added value would be given from spawning a task, track it e.g. in a set Worker.running_asyncio_tasks, and then cancel it in Worker.close(). Even if desirable, however, I think this is best left to a future PR.

That was my intention, I just didn't specify everything in my pseudo code. I was actually hoping we'd implement this already as part of #5922. I'm OK with postponing this to a follow up but I would like to get this done rather sooner than later.

I would like to avoid using add_callback if at all possible since tracking the tasks would actually allow us to, e.g. deal with the exception

distributed/worker.py

fjetter · 2022-03-28T08:38:15Z

distributed/worker.py

+            msg = error_message(exc)
+            recommendations = {ts: tuple(msg.values())}
+
+        return recommendations, []


The benefit is that we want to log the events and keep the recommendations as an internal detail of the to-be-defined WorkerState class. Not using recommendations here is one of the more important points of the design proposals

github-actions · 2022-03-28T16:18:48Z

Unit Test Results

      17 files -       1       17 suites - 1 8h 42m 40s ⏱️ - 28m 20s
  2 705 tests +      8   2 621 ✔️ +      7     82 💤 +  1 2 ❌ ±0
22 700 runs - 1 421 21 566 ✔️ - 1 339 1 132 💤 - 82 2 ❌ ±0

For more details on these failures, see this check.

Results for commit f83df7a. ± Comparison against base commit ccb0362.

♻️ This comment has been updated with latest results.

fork

crusaderky · 2022-03-28T16:54:49Z

distributed/worker.py

-            # yet.
-            assert not ts.dependents
-            self.transition(ts, "released", stimulus_id=stimulus_id)
+        self.handle_stimulus(CancelComputeEvent(key=key, stimulus_id=stimulus_id))


In the future we should consider sending event objects directly from the scheduler

Yes, I could see this very nicely being integrated in our RPC framework 👍

fjetter

I think there is some confusion abut the CancelComputeEvent but otherwise this looks already good

fjetter · 2022-03-29T13:00:53Z

distributed/worker.py

-            # yet.
-            assert not ts.dependents
-            self.transition(ts, "released", stimulus_id=stimulus_id)
+        self.handle_stimulus(CancelComputeEvent(key=key, stimulus_id=stimulus_id))


Yes, I could see this very nicely being integrated in our RPC framework 👍

fjetter · 2022-03-29T13:07:37Z

distributed/worker.py

+        if not ts:
+            return None
+        if ts.state == "cancelled":
+            return CancelComputeEvent(key=ts.key, stimulus_id=stimulus_id)


I would suggest a name like AlreadyCancelledEvent to distinguish this from Client.cancel events passed through the scheduler. We don't have this, yet, but there is some potential for confusion

I split them now

fjetter · 2022-03-29T13:10:22Z

distributed/worker.py

-            recommendations: Recs = {}
+                self.active_keys.discard(key)
+
+            self.threads[key] = result["thread"]


Do we want to move this to the result handlers as well? This will be a duplicated line but we'd cleanly separate result handling from the exceution and would only mutate state in a single place

I understand that neither the Worker ABC nor the WorkerStateMachine will ever have a concept of threads. That's why I left it here.

fjetter · 2022-03-29T13:10:38Z

distributed/worker.py

            if result["op"] == "task-finished":
-                ts.nbytes = result["nbytes"]
-                ts.type = result["type"]
-                recommendations[ts] = ("memory", value)
                if self.digests is not None:
                    self.digests["task-duration"].add(result["stop"] - result["start"])


Same here. Do we want to move this to the result/event handler?

self.digests is populated with a wealth of information that should remain alien to the state machine:

latency

transfer-bandwidth

get-data-send-duration

disk-load-duration

profile-duration

so I think it should remain in Worker?

We can also nuke the digests if they're slowing us down. They're not commonly used today.

fjetter · 2022-03-29T13:11:18Z

distributed/worker.py

+        raise TypeError(ev)  # pragma: nocover
+
+    # TODO Set return type annotation of all handle_event implementations
+    #      to tuple[Recs, Instructions] (requires Python >=3.9)


Is this a functools limitation or why do we need py3.9? Do we even need individual annotations if they are all the same?

Is this a functools limitation or why do we need py3.9?

@functools.singledispatchmethod exec()'s the delayed annotation - which however require Python 3.9+.
I worked around it better now.

Do we even need individual annotations if they are all the same?

mypy is not smart enough to reserve a special treatment for singledispatch functions, so yes

fjetter · 2022-03-29T13:13:10Z

distributed/worker.py

+    #      to tuple[Recs, Instructions] (requires Python >=3.9)
+    @handle_event.register
+    def _(self, ev: CancelComputeEvent):  # -> tuple[Recs, Instructions]:
+        ts = self.tasks.get(ev.key)


The tasks must always exist. There is no way this could've been dropped earlier. If so, that would be a severe bug

I had conflated the implementation of handle_cancel_compute coming from the scheduler and the code that dealt with a task being already cancelled by the time execute starts. They are now separate.

fjetter · 2022-03-29T13:20:01Z

distributed/worker.py

+        self.log.append((ev.key, "cancel-compute", ev.stimulus_id, time()))
+        # All possible dependents of ts should not be in state Processing on
+        # scheduler side and therefore should not be assigned to a worker, yet.
+        assert not ts.dependents


I suggest to not introduce any new assertions. For this specific handler there is no need to assert this condition, is there?

I'm not even sure if this assert is correct. I've seen such race conditions happening. For instance, there is a network partition such that this worker briefly disconnects. Another worker gets this task assigned and finishes earlier. In the meantime, this worker reconnects but the task is still executing/not yet executing but got a depentent assigned in the meantime.

See also the warning on scheduler side Unexpected worker completed task [...]

I had conflated the implementation of handle_cancel_compute coming from the scheduler and the code that dealt with a task being already cancelled by the time execute starts. They are now separate.

It should now be functionally identical to main

fjetter · 2022-03-29T13:22:33Z

distributed/worker.py

+    # TODO Set return type annotation of all handle_event implementations
+    #      to tuple[Recs, Instructions] (requires Python >=3.9)
+    @handle_event.register
+    def _(self, ev: CancelComputeEvent):  # -> tuple[Recs, Instructions]:


I'm a bit confused about the body of this handler. I don't think we should mix up the "CancelTask" event with a execute response that says "Didn't Execute. Task already cancelled"

I split them now

fjetter · 2022-03-29T13:23:23Z

distributed/worker.py

+        ts.startstops.append({"action": "compute", "start": ev.start, "stop": ev.stop})
+        ts.nbytes = ev.nbytes
+        ts.type = ev.type
+        return {ts: ("memory", ev.value)}, []


We are loosing stimulus_ids here aren't we?

We aren't; see handle_stimulus:

recs, instructions = self.handle_event(stim) self.transitions(recs, stimulus_id=stim.stimulus_id)

distributed/worker.py

This reverts commit 9d28bde.

crusaderky · 2022-03-30T10:47:59Z

@fjetter I did the work on _ensure_computing (see 9d28bde) but there are still some pretty conceptual failures caused by resource counting, so I'm moderately inclined to merge this PR as it is. I reverted it for now.
All tests are green.

In scope for this PR

Remove all self.loop.add_callback(self.execute, ...)

Out of scope for the PR but in scope for the issue (to be implemented immediately after this PR)

Refactor Worker.ensure_computing() -> None into Worker._ensure_computing() -> RecsInstr
Implement Worker.stimulus_history
Track tasks spawned by Worker._handle_instructions and cancel them in Worker.close

Out of scope for the issue

Send Event Python objects from the Scheduler
Other events (replicate etc.)

…k#6003)

fjetter

LGTM

crusaderky commented Mar 25, 2022

View reviewed changes

crusaderky requested a review from fjetter March 25, 2022 14:35

fjetter reviewed Mar 28, 2022

View reviewed changes

crusaderky added 2 commits March 28, 2022 13:30

Refactor execute()

4014303

Refactor executor pick

70a7ba6

crusaderky force-pushed the worker_state_machine branch from b96cb5a to 70a7ba6 Compare March 28, 2022 12:30

crusaderky added 2 commits March 28, 2022 17:44

Fork out dask#6009

b68f6f4

fork

execute() to return StateMachineEvent

fdb59c5

crusaderky force-pushed the worker_state_machine branch from 5dcee7e to fdb59c5 Compare March 28, 2022 16:45

crusaderky added 2 commits March 28, 2022 17:45

Merge branch 'main' into worker_state_machine

33a95c5

polish

fc7b8b1

crusaderky commented Mar 28, 2022

View reviewed changes

Fix annotations on Python 3.8

7837ccc

fjetter reviewed Mar 29, 2022

View reviewed changes

crusaderky added 5 commits March 29, 2022 23:02

Merge branch 'main' into worker_state_machine

abb57f4

polish

af1fb64

Polish

21bf5c1

Revisit annotations

e27d281

Rework cancel

0b25660

crusaderky force-pushed the worker_state_machine branch from 9f18059 to 0b25660 Compare March 30, 2022 00:30

crusaderky added 3 commits March 30, 2022 01:36

fix git

432b798

fix failing test

b15d5dc

_ensure_computing

9d28bde

crusaderky force-pushed the worker_state_machine branch from d056752 to 9d28bde Compare March 30, 2022 10:17

Revert "_ensure_computing"

47982af

This reverts commit 9d28bde.

Merge branch 'main' into worker_state_machine

42aec56

crusaderky marked this pull request as ready for review March 30, 2022 13:59

crusaderky changed the title ~~Migrate ensure_executing transitions to new WorkerState event mechanism~~ Migrate ensure_executing transitions to new WorkerState event mechanism - part 1 Mar 30, 2022

crusaderky added a commit to crusaderky/distributed that referenced this pull request Mar 30, 2022

Migrate ensure_executing to WorkerState event mechanism (dask#6003)

8d09d74

crusaderky self-assigned this Mar 30, 2022

crusaderky added 2 commits March 30, 2022 20:49

Cleanup

9a5b006

Merge branch 'main' into worker_state_machine

f83df7a

crusaderky added a commit to crusaderky/distributed that referenced this pull request Apr 4, 2022

Migrate ensure_executing to WorkerState event mechanism - part 1 (das…

74dfd11

…k#6003)

crusaderky mentioned this pull request Apr 4, 2022

Migrate ensure_computing transitions to new WorkerState event mechanism - part 2 #6062

Merged

fjetter approved these changes Apr 4, 2022

View reviewed changes

crusaderky merged commit c20a099 into dask:main Apr 4, 2022

crusaderky deleted the worker_state_machine branch April 4, 2022 15:17

crusaderky mentioned this pull request Apr 8, 2022

Migrate ensure_computing transitions to new WorkerState event mechanism #5895

Closed

3 tasks

crusaderky changed the title ~~Migrate ensure_executing transitions to new WorkerState event mechanism - part 1~~ Migrate ensure_computing transitions to new WorkerState event mechanism - part 1 Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate ensure_computing transitions to new WorkerState event mechanism - part 1 #6003

Migrate ensure_computing transitions to new WorkerState event mechanism - part 1 #6003

crusaderky commented Mar 25, 2022 •

edited

Loading

crusaderky Mar 25, 2022 •

edited

Loading

fjetter Mar 25, 2022

crusaderky Mar 25, 2022

fjetter Mar 28, 2022

crusaderky Mar 28, 2022

fjetter Mar 25, 2022

fjetter Mar 28, 2022

github-actions bot commented Mar 28, 2022 •

edited

Loading

crusaderky Mar 28, 2022

fjetter Mar 29, 2022

fjetter left a comment

fjetter Mar 29, 2022

fjetter Mar 29, 2022

crusaderky Mar 30, 2022

fjetter Mar 29, 2022

crusaderky Mar 30, 2022

fjetter Mar 29, 2022

crusaderky Mar 30, 2022 •

edited

Loading

mrocklin Mar 30, 2022

fjetter Mar 29, 2022

crusaderky Mar 30, 2022

fjetter Mar 29, 2022

crusaderky Mar 30, 2022

fjetter Mar 29, 2022

crusaderky Mar 30, 2022

crusaderky Mar 30, 2022

fjetter Mar 29, 2022

crusaderky Mar 30, 2022

fjetter Mar 29, 2022

crusaderky Mar 30, 2022

crusaderky commented Mar 30, 2022 •

edited

Loading

fjetter left a comment

Migrate ensure_computing transitions to new WorkerState event mechanism - part 1 #6003

Migrate ensure_computing transitions to new WorkerState event mechanism - part 1 #6003

Conversation

crusaderky commented Mar 25, 2022 • edited Loading

crusaderky Mar 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 28, 2022 • edited Loading

Unit Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Mar 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Mar 30, 2022 • edited Loading

In scope for this PR

Out of scope for the PR but in scope for the issue (to be implemented immediately after this PR)

Out of scope for the issue

fjetter left a comment

Choose a reason for hiding this comment

crusaderky commented Mar 25, 2022 •

edited

Loading

crusaderky Mar 25, 2022 •

edited

Loading

github-actions bot commented Mar 28, 2022 •

edited

Loading

crusaderky Mar 30, 2022 •

edited

Loading

crusaderky commented Mar 30, 2022 •

edited

Loading