Worker State Machine refactor: redesign TaskState and scheduler messages #5922

crusaderky · 2022-03-10T00:24:02Z

Partially addresses StateMachine event dispatch mechanism #5894

In scope

Moved state-related classes out of worker.py to a separate module
Replaced Smsgs dicts with a rigorous data model
TaskState is now a lot more compact
TaskState._to_dict produces a lot terser output
TaskState now uses __slots__ in Python >=3.10. This should lead to substantial savings in unmanaged memory.
TaskState.state is now a Literal. I would much rather have Enum for worker TaskState names #5444 but this was very cheap and non-intrusive.

Out of scope

instruction_id, as described in (Worker) State Machine determinism and replayability #5736 - left to later discussion
reimplement calls to compute() and gather_deps() as Instructions, as designed in StateMachine event dispatch mechanism #5894
yank transition logic out of Worker and into the new module, as designed in (Worker) State Machine determinism and replayability #5736

github-actions · 2022-03-10T03:21:03Z

Unit Test Results

      12 files ±  0       12 suites ±0 5h 38m 49s ⏱️ + 3m 56s
  2 647 tests +  8   2 564 ✔️ +10   80 💤 - 1 2 ❌ - 2 1 🔥 +1
13 005 runs +28 12 364 ✔️ +31 636 💤 - 3 4 ❌ - 1 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit 297bed6. ± Comparison against base commit 85bf1be.

♻️ This comment has been updated with latest results.

fjetter · 2022-03-10T10:18:23Z

CI still seems to be pretty upset but otherwise the changes look good, so far

crusaderky · 2022-03-10T14:14:55Z

All test failures are unrelated. This is ready for review and merge.

fjetter · 2022-03-10T14:52:29Z

docs/source/worker.rst

+.. autoclass:: distributed.worker_state_machine.UniqueTaskHeap
+   :members:
+


I'm not sure if this should be publicly documented

makes sense - removing it

fjetter · 2022-03-10T14:53:50Z

distributed/scheduler.py

@@ -5516,7 +5520,7 @@ def handle_task_erred(self, key=None, **msg):
        recommendations: dict
        client_msgs: dict
        worker_msgs: dict
-        r: tuple = self.stimulus_task_erred(key=key, **msg)
+        r: tuple = self.stimulus_task_erred(key=key, status="error", **msg)


what is the status kwarg for?

Cleaned up and broken out to #5926, which this PR incorporates.

crusaderky · 2022-03-10T00:32:36Z

distributed/worker.py

+    TaskFinishedMsg,
+    TaskState,
+    UniqueTaskHeap,
+)


Quite ugly but transitory. I expect that all *Msg classes and the state sets won't need to be imported after we move the state machine to the other module.

crusaderky · 2022-03-10T00:38:09Z

distributed/worker_state_machine.py

+        "rescheduled",
+        "resumed",
+        "waiting",
+    ]


I'd rather have #5444 but this is the next best thing

crusaderky · 2022-03-10T00:40:14Z

distributed/worker_state_machine.py

+    Not to be confused with :class:`distributed.scheduler.TaskState`, which holds
+    similar information on the scheduler side.
+    """
+


Reduced the size of the attributes declaration by a factor of 3 (docstring + class annotations + init method -> just the class annotations)

+1 for the dataclass

Re doc string I'm fine with this but we should be aware that this obviously removes any sphinx rendering. I know scheduler state tasks are rendered atm

crusaderky · 2022-03-10T00:41:05Z

distributed/worker_state_machine.py

+    #: The previous state of the task. This is a state machine implementation detail.
+    _previous: TaskStateState | None = None
+    #: The next state of the task. This is a state machine implementation detail.
+    _next: TaskStateState | None = None


@fjetter wanna chip in on these two?

There is some documentation about this here

distributed/distributed/worker.py

Lines 2415 to 2436 in 925c610

def _transition_from_resumed(

self, ts: TaskState, finish: str, *, stimulus_id: str

) -> tuple[Recs, Smsgs]:

"""`resumed` is an intermediate degenerate state which splits further up

into two states depending on what the last signal / next state is

intended to be. There are only two viable choices depending on whether

the task is required to be fetched from another worker `resumed(fetch)`

or the task shall be computed on this worker `resumed(waiting)`.

The only viable state transitions ending up here are

flight -> cancelled -> resumed(waiting)

or

executing -> cancelled -> resumed(fetch)

depending on the origin. Equally, only `fetch`, `waiting` or `released`

are allowed output states.

See also `transition_resumed_waiting`

"""

crusaderky · 2022-03-10T14:19:36Z

distributed/worker_state_machine.py

+dc_slots = {"slots": True} if sys.version_info >= (3, 10) else {}
+
+
+@dataclass(repr=False, eq=False, **dc_slots)


Only way I could find to get __slots__ in Python 3.8/3.9 was not to use @dataclass, which in my opinion offers much bigger rewards in terms of readability.

slots are a nice performance boost for attribute access but I don't think it's required on the worker side. I'm fine with this, shouldn't cause any problems.

crusaderky · 2022-03-10T16:26:42Z

docs/source/worker.rst

+.. autoclass:: distributed.worker_state_machine.UniqueTaskHeap
+   :members:
+


makes sense - removing it

crusaderky · 2022-03-10T18:18:17Z

Blocked by #5926

crusaderky · 2022-03-11T11:13:29Z

#5926 no longer blocks this issue

crusaderky · 2022-03-11T11:14:42Z

distributed/worker.py

+            traceback=ts.traceback,
+            exception_text=ts.exception_text,
+            traceback_text=ts.traceback_text,
+        )


Keys "status", "thread", and "startstops" were ignored by the scheduler

They are not. They are subtly picked up by extensions such as TaskStream and EventStream. -__-

crusaderky · 2022-03-11T15:01:51Z

All test failures are unrelated. Ready for final review and merge.

fjetter · 2022-03-14T11:06:28Z

distributed/worker_state_machine.py

+dc_slots = {"slots": True} if sys.version_info >= (3, 10) else {}
+
+
+@dataclass(repr=False, eq=False, **dc_slots)


slots are a nice performance boost for attribute access but I don't think it's required on the worker side. I'm fine with this, shouldn't cause any problems.

fjetter · 2022-03-14T11:09:06Z

distributed/worker_state_machine.py

+@lru_cache
+def _default_data_size() -> int:
+    return parse_bytes(dask.config.get("distributed.scheduler.default-data-size"))


What's the reason for this?

To read the config on first use instead of when loading the module like it was before, thus avoiding headaches related to module load order.

fjetter · 2022-03-14T11:11:43Z

distributed/worker_state_machine.py

+    Not to be confused with :class:`distributed.scheduler.TaskState`, which holds
+    similar information on the scheduler side.
+    """
+


+1 for the dataclass

Re doc string I'm fine with this but we should be aware that this obviously removes any sphinx rendering. I know scheduler state tasks are rendered atm

fjetter · 2022-03-14T11:13:45Z

distributed/worker_state_machine.py

+    #: The previous state of the task. This is a state machine implementation detail.
+    _previous: TaskStateState | None = None
+    #: The next state of the task. This is a state machine implementation detail.
+    _next: TaskStateState | None = None


There is some documentation about this here

distributed/distributed/worker.py

Lines 2415 to 2436 in 925c610

def _transition_from_resumed(

self, ts: TaskState, finish: str, *, stimulus_id: str

) -> tuple[Recs, Smsgs]:

"""`resumed` is an intermediate degenerate state which splits further up

into two states depending on what the last signal / next state is

intended to be. There are only two viable choices depending on whether

the task is required to be fetched from another worker `resumed(fetch)`

or the task shall be computed on this worker `resumed(waiting)`.

The only viable state transitions ending up here are

flight -> cancelled -> resumed(waiting)

or

executing -> cancelled -> resumed(fetch)

depending on the origin. Equally, only `fetch`, `waiting` or `released`

are allowed output states.

See also `transition_resumed_waiting`

"""

fjetter · 2022-03-14T11:19:46Z

distributed/worker_state_machine.py

+# Note: as of Python 3.10.2, @dataclass(slots=True) doesn't work with __init__subclass__
+# https://bugs.python.org/issue46970
+@dataclass
+class TaskFinishedMsg(SendMessageToScheduler, op="task-finished"):


Is there a functional difference to the case where I simply define the op in our subclasses, e.g.

class TaskFinishedMsg(SendMessageToScheduler): op = "task-finished"

The usage of metaclasses feels a bit complex. is there anything else going on that I'm not aware of or is this a style question?

No, it's just style. Happy to remove it.

crusaderky · 2022-03-14T13:24:13Z

Re doc string I'm fine with this but we should be aware that this obviously removes any sphinx rendering. I know scheduler state tasks are rendered atm

It renders fine

fjetter · 2022-03-14T13:52:19Z

Re doc string I'm fine with this but we should be aware that this obviously removes any sphinx rendering. I know scheduler state tasks are rendered atm

It renders fine

Interesting. Thanks for verifying. This is indeed much better and compact

crusaderky · 2022-03-14T17:03:36Z

@fjetter are there any outstanding points?

Refactor worker scheduler messages and TaskState

ff4987c

crusaderky force-pushed the workerstate branch from d69bac7 to ff4987c Compare March 10, 2022 00:43

crusaderky added 2 commits March 10, 2022 12:03

fix slots

3ae3686

move TaskState in imports

3c8abb8

crusaderky marked this pull request as ready for review March 10, 2022 14:14

crusaderky self-assigned this Mar 10, 2022

fjetter reviewed Mar 10, 2022

View reviewed changes

crusaderky added 3 commits March 10, 2022 16:02

fix sphinx docs

3f15676

Merge branch 'main' into workerstate

771f8f5

Merge branch 'main' into workerstate

6a88d1e

crusaderky commented Mar 10, 2022

View reviewed changes

crusaderky added 2 commits March 10, 2022 16:40

Don't expose UniqueTaskHeap as public API

2f75034

Don't swallow kwargs

5af52de

crusaderky mentioned this pull request Mar 10, 2022

Scheduler transitions should not swallow kwargs #5926

Closed

crusaderky marked this pull request as draft March 10, 2022 18:45

revert revert revert

349dd15

crusaderky force-pushed the workerstate branch from 10a52ab to 349dd15 Compare March 11, 2022 10:48

crusaderky commented Mar 11, 2022

View reviewed changes

crusaderky added 2 commits March 11, 2022 11:57

fix regressions

6b4f310

Merge branch 'main' into workerstate

64d8262

crusaderky marked this pull request as ready for review March 11, 2022 15:01

crusaderky added a commit to crusaderky/distributed that referenced this pull request Mar 11, 2022

Redesign TaskState and scheduler messages (dask#5922)

183959c

crusaderky mentioned this pull request Mar 11, 2022

Prevent data duplication on unspill #5936

Merged

fjetter reviewed Mar 14, 2022

View reviewed changes

crusaderky added 2 commits March 14, 2022 13:13

Merge branch 'rel_imports' into workerstate

a787758

Merge branch 'main' into workerstate

94acdc5

crusaderky added a commit to crusaderky/distributed that referenced this pull request Mar 14, 2022

Redesign TaskState and scheduler messages (dask#5922)

2c42823

Remove metaclass magic

297bed6

crusaderky added a commit to crusaderky/distributed that referenced this pull request Mar 14, 2022

Redesign TaskState and scheduler messages (dask#5922)

cc0bdfc

fjetter approved these changes Mar 14, 2022

View reviewed changes

fjetter merged commit 2fffe74 into dask:main Mar 14, 2022

crusaderky deleted the workerstate branch March 17, 2022 17:30

fjetter mentioned this pull request Mar 28, 2022

Migrate ensure_computing transitions to new WorkerState event mechanism - part 1 #6003

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker State Machine refactor: redesign TaskState and scheduler messages #5922

Worker State Machine refactor: redesign TaskState and scheduler messages #5922

crusaderky commented Mar 10, 2022 •

edited

Loading

github-actions bot commented Mar 10, 2022 •

edited

Loading

fjetter commented Mar 10, 2022

crusaderky commented Mar 10, 2022

fjetter Mar 10, 2022

crusaderky Mar 10, 2022

fjetter Mar 10, 2022

crusaderky Mar 10, 2022

crusaderky Mar 10, 2022

crusaderky Mar 10, 2022

crusaderky Mar 10, 2022

fjetter Mar 14, 2022

crusaderky Mar 10, 2022

fjetter Mar 14, 2022

crusaderky Mar 10, 2022

fjetter Mar 14, 2022

crusaderky Mar 10, 2022

crusaderky commented Mar 10, 2022

crusaderky commented Mar 11, 2022

crusaderky Mar 11, 2022 •

edited

Loading

crusaderky Mar 11, 2022

crusaderky commented Mar 11, 2022

fjetter Mar 14, 2022

fjetter Mar 14, 2022

crusaderky Mar 14, 2022

fjetter Mar 14, 2022

fjetter Mar 14, 2022

fjetter Mar 14, 2022

crusaderky Mar 14, 2022

crusaderky commented Mar 14, 2022

fjetter commented Mar 14, 2022

crusaderky commented Mar 14, 2022

		.. autoclass:: distributed.worker_state_machine.UniqueTaskHeap
		:members:

	def _transition_from_resumed(
	self, ts: TaskState, finish: str, *, stimulus_id: str
	) -> tuple[Recs, Smsgs]:
	"""`resumed` is an intermediate degenerate state which splits further up
	into two states depending on what the last signal / next state is
	intended to be. There are only two viable choices depending on whether
	the task is required to be fetched from another worker `resumed(fetch)`
	or the task shall be computed on this worker `resumed(waiting)`.

	The only viable state transitions ending up here are

	flight -> cancelled -> resumed(waiting)

	or

	executing -> cancelled -> resumed(fetch)

	depending on the origin. Equally, only `fetch`, `waiting` or `released`
	are allowed output states.

	See also `transition_resumed_waiting`
	"""

		dc_slots = {"slots": True} if sys.version_info >= (3, 10) else {}


		@dataclass(repr=False, eq=False, **dc_slots)

Worker State Machine refactor: redesign TaskState and scheduler messages #5922

Worker State Machine refactor: redesign TaskState and scheduler messages #5922

Conversation

crusaderky commented Mar 10, 2022 • edited Loading

In scope

Out of scope

github-actions bot commented Mar 10, 2022 • edited Loading

Unit Test Results

fjetter commented Mar 10, 2022

crusaderky commented Mar 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Mar 10, 2022

crusaderky commented Mar 11, 2022

crusaderky Mar 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Mar 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Mar 14, 2022

fjetter commented Mar 14, 2022

crusaderky commented Mar 14, 2022

crusaderky commented Mar 10, 2022 •

edited

Loading

github-actions bot commented Mar 10, 2022 •

edited

Loading

crusaderky Mar 11, 2022 •

edited

Loading