Rework some tests related to gather_dep #6472

crusaderky · 2022-05-30T14:30:47Z

Cosmetic tweaks to a few tests.

This has been salvaged from #6462 following controversy.

crusaderky · 2022-05-30T14:32:41Z

distributed/diagnostics/tests/test_eventstream.py

    for name, color in zip(lists["name"], lists["color"]):
-        if name == "transfer":
-            assert color == "red"
+        assert (name == "transfer-sum") == (color == "red")


Tightened tested conditions

distributed/tests/test_scheduler.py

crusaderky · 2022-05-30T14:35:30Z

distributed/tests/test_worker.py

+        assert ev[0] == "request-dep"
+        assert len(ev[2]) == 5
+    for ev in story[20:]:
+        assert ev[0] == "receive-dep"


Tighten tested condition + clarifications

crusaderky · 2022-05-30T14:36:23Z

distributed/tests/test_worker.py

-@gen_cluster(client=True, Worker=Nanny)
-async def test_acquire_replicas_already_in_flight(c, s, *nannies):
+@gen_cluster(client=True, nthreads=[("", 1)])
+async def test_acquire_replicas_already_in_flight(c, s, a):


Test is now deterministic and much faster

crusaderky · 2022-05-30T14:37:59Z

distributed/tests/test_worker.py

+@gen_cluster(client=True, nthreads=[("", 1)])
+async def test_gather_dep_cancelled_rescheduled(c, s, a):
+    """A task transitions flight->cancelled->fetch->flight, all while gather_dep is
+    waiting for the data of the initial flight.


This test can now be drastically simplified after #6371.
Remove assumption that both dependencies of a task will be gathered in a single fetch (this will no longer be the case in #6462).

This test can now be drastically simplified after #6371.

I don't think we should simplify tests just because we know that internals changed.

Remove assumption that both dependencies of a task will be gathered in a single fetch

Where do we assume this in the previous test?

I don't think we should simplify tests just because we know that internals changed.

The previous test was so complicated because it had to have two checkpoints:

after you enter gather_dep, but before you run the preamble

while you're in the comms

I think that being able to say "it doesn't really matter in which point of gather_dep you pinch it; nothing fancy will happen before comms" is a very reasonable thing to say?

Where do we assume this in the previous test?

fut4 = c.submit(sum, fut1, fut2, workers=[b.address], key="f4")

This sends a compute-task event to b, which in turn sends f1 and f2 into fetch at the same time.
As they're dependencies of the same task, they have the same exact priority.

After #6462, there are two separate fetches. You'll get that f1 is fetched first 50% of the times, at random. The two keys are most likely go through a set at some point during the process, and the ordering of sets changes every time you restart the python interpreter.

I think that being able to say "it doesn't really matter in which point of gather_dep you pinch it; nothing fancy will happen before comms" is a very reasonable thing to say?

Will this be true forever? How do we ensure that this stays true? This test asks the question
"Is it harmful to cancel a task after a fetch is scheduled (gather-dep) but before we're actually stepping in this coroutine".
The only way to answer this question with certainty is to have this test or ensure there is not a single await before the get_data message is sent which is hard/impossible.

While this is a seemingly artificial question to ask, this was causing a deadlock in the past #5525 and I don't think the pre-filtering optimization was entirely unreasonable.

Test are also there to protect us from reintroducing regressions and this test is written at a high enough abstraction level (we only rely on task state names (e.g. flight, cancelled) and the fact that there are the gather_dep/get_data methods) that I have confidence it won't bother us a lot. The test is not known to be flaky, it runs very fast (0.1s) and still works on your proposed change of #6462 (at least for me locally)

I don't see a reason why we should change anything about this test

crusaderky · 2022-05-30T14:38:09Z

distributed/tests/test_worker.py

+    """A task transitions from flight to cancelled while gather_dep is waiting for the
+    data.
+
+    See also test_gather_dep_cancelled_rescheduled


Remove assumption that both dependencies of a task will be gathered in a single fetch (this will no longer be the case in #6462).

Where do we assume this in the previous test?

same as above

The original test still passes for me w/ your changes in #6462

It does not.
50% of the times you get:

> assert b.tasks[fut2.key].state == "flight" E AssertionError: assert 'fetch' == 'flight'

github-actions · 2022-05-30T17:45:08Z

Unit Test Results

      15 files +      12       15 suites +12 6h 16m 7s ⏱️ + 5h 28m 21s
  2 832 tests +  1 636   2 748 ✔️ +  1 586   81 💤 +  47 2 ❌ +2 1 🔥 +1
20 987 runs +17 402 20 038 ✔️ +16 555 946 💤 +844 2 ❌ +2 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit 8d5d50e. ± Comparison against base commit 69b798d.

♻️ This comment has been updated with latest results.

fjetter

I would like us to use stories in tests only if necessary and would always prefer to use something more high level.
Most of these tests were written in a way that stories were not necessary before and I do not see added value in having a redundant, low level assert. I left comments everywhere so we can discuss individually if this is necessary

distributed/tests/test_scheduler.py

fjetter · 2022-06-01T17:05:27Z

distributed/tests/test_worker.py

    """
    a.total_out_connections = 2
    futures = await c.scatter(
        {f"x{i}": i for i in range(100)},
        workers=[w.address for w in workers],
    )
-    assert all(w.data for w in workers)
+    assert all(len(w.data) == 5 for w in workers)


I feel this is too granular for this test. Does this test require the distribution to be homogeneous? I think we should not test a "decide_worker" logic in this test.

How is 5 even a correct number when we're scattering 100 tasks to 21 workers?

I feel this is too granular for this test.

Ok, relaxing it.

How is 5 even a correct number when we're scattering 100 tasks to 21 workers?

No, we're scattering 100 tasks to 20 workers.
See function signature: c, s, a, *workers

Renamed workers to snd_workers for clarification

fjetter · 2022-06-01T17:10:13Z

distributed/tests/test_worker.py

+        while a.log[-1][:5] != ("x", "flight", "fetch", "flight", {}):
+            await asyncio.sleep(0.01)


Isn't there an event we can listen to? I consider events much more robust than the transition logs and much easier to read.

I would even prefer having a plain sleep in here than listening to the transition log

Changed to trigger the event directly on the worker so that no wait is involved

fjetter · 2022-06-01T17:12:44Z

distributed/tests/test_worker.py

+@gen_cluster(client=True, nthreads=[("", 1)])
+async def test_gather_dep_cancelled_rescheduled(c, s, a):
+    """A task transitions flight->cancelled->fetch->flight, all while gather_dep is
+    waiting for the data of the initial flight.


This test can now be drastically simplified after #6371.

I don't think we should simplify tests just because we know that internals changed.

Remove assumption that both dependencies of a task will be gathered in a single fetch

Where do we assume this in the previous test?

fjetter · 2022-06-01T17:13:03Z

distributed/tests/test_worker.py

+        assert_story(
+            a.story(fut1.key),
+            [
+                (fut1.key, "fetch", "flight", "flight", {}),
+                (fut1.key, "flight", "released", "cancelled", {}),
+                (fut1.key, "cancelled", "fetch", "flight", {}),
+                (fut1.key, "flight", "memory", "memory", {"f2": "ready"}),
+            ],
+        )
+        # Test that the data transfer only happens once
+        assert_story(
+            a.story("request-dep"),
+            [
+                ("request-dep", b.address, {fut1.key}),
+            ],
+            strict=True,
+        )


I don't think these stories add a lot of value to the test

What would you propose? I think it's important to test that we sent out only one transfer request.

There is outgoing_transfer_log and incoming_transfer_log or even more coarse, there are counters for both.

e.g.

Suggested change

assert_story(

a.story(fut1.key),

[

(fut1.key, "fetch", "flight", "flight", {}),

(fut1.key, "flight", "released", "cancelled", {}),

(fut1.key, "cancelled", "fetch", "flight", {}),

(fut1.key, "flight", "memory", "memory", {"f2": "ready"}),

],

)

# Test that the data transfer only happens once

assert_story(

a.story("request-dep"),

[

("request-dep", b.address, {fut1.key}),

],

strict=True,

)

assert a.incoming_count == 1

fjetter · 2022-06-01T17:16:28Z

distributed/tests/test_worker.py

+        assert_story(
+            b.story(fut1.key),
+            [
+                (fut1.key, "flight", "released", "cancelled", {}),
+                (fut1.key, "cancelled", "memory", "released", {fut1.key: "forgotten"}),
+                (fut1.key, "released", "forgotten", "forgotten", {}),
+            ],
+        )


I think this story is redundant. We're asserting all of this above already

fjetter · 2022-06-01T17:18:53Z

distributed/tests/test_worker.py

+    """A task transitions from flight to cancelled while gather_dep is waiting for the
+    data.
+
+    See also test_gather_dep_cancelled_rescheduled


Where do we assume this in the previous test?

fjetter

I'm struggling to reproduce issues that are supposedly introduced with #6462
If there are any issues that connect to that PR specifically, I suggest to discuss it over there. I appreciate the effort of breaking up larger changes but if the changes are directly connected, I'm actually having a harder time reviewing if they are split up

fjetter · 2022-06-03T08:56:58Z

distributed/tests/test_worker.py

+    """A task transitions from flight to cancelled while gather_dep is waiting for the
+    data.
+
+    See also test_gather_dep_cancelled_rescheduled


The original test still passes for me w/ your changes in #6462

fjetter · 2022-06-03T09:05:31Z

distributed/tests/test_worker.py

+@gen_cluster(client=True, nthreads=[("", 1)])
+async def test_gather_dep_cancelled_rescheduled(c, s, a):
+    """A task transitions flight->cancelled->fetch->flight, all while gather_dep is
+    waiting for the data of the initial flight.


I think that being able to say "it doesn't really matter in which point of gather_dep you pinch it; nothing fancy will happen before comms" is a very reasonable thing to say?

Will this be true forever? How do we ensure that this stays true? This test asks the question
"Is it harmful to cancel a task after a fetch is scheduled (gather-dep) but before we're actually stepping in this coroutine".
The only way to answer this question with certainty is to have this test or ensure there is not a single await before the get_data message is sent which is hard/impossible.

While this is a seemingly artificial question to ask, this was causing a deadlock in the past #5525 and I don't think the pre-filtering optimization was entirely unreasonable.

Test are also there to protect us from reintroducing regressions and this test is written at a high enough abstraction level (we only rely on task state names (e.g. flight, cancelled) and the fact that there are the gather_dep/get_data methods) that I have confidence it won't bother us a lot. The test is not known to be flaky, it runs very fast (0.1s) and still works on your proposed change of #6462 (at least for me locally)

I don't see a reason why we should change anything about this test

crusaderky · 2022-06-09T16:54:15Z

Salvaged PR. This is again ready for review and merge.

github-actions · 2022-06-09T18:52:27Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 28m 20s ⏱️ - 21m 18s
  2 865 tests ±0   2 784 ✔️ +33   80 💤 - 1 1 ❌ - 29
21 224 runs ±0 20 287 ✔️ +38 936 💤 - 3 1 ❌ - 32

For more details on these failures, see this check.

Results for commit 51a3ba9. ± Comparison against base commit 344868a.

♻️ This comment has been updated with latest results.

crusaderky commented May 30, 2022

View reviewed changes

distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved

crusaderky commented May 30, 2022

View reviewed changes

crusaderky marked this pull request as ready for review May 30, 2022 14:38

crusaderky mentioned this pull request May 30, 2022

Remove EnsureCommunicatingAfterTransitions #6462

Merged

crusaderky self-assigned this May 30, 2022

crusaderky mentioned this pull request May 30, 2022

Yank state machine out of Worker class #6476

Closed

crusaderky linked an issue May 30, 2022 that may be closed by this pull request

Yank state machine out of Worker class #6476

Closed

crusaderky force-pushed the WSMR/clustered_transfers_tests branch from d0d03ee to e59a138 Compare June 1, 2022 12:13

Revisit some tests related to gather_dep

efb2dd9

crusaderky force-pushed the WSMR/clustered_transfers_tests branch from e59a138 to efb2dd9 Compare June 1, 2022 14:32

fjetter requested changes Jun 1, 2022

View reviewed changes

crusaderky added 3 commits June 1, 2022 22:41

Code review

d8315eb

Merge branch 'main' into WSMR/clustered_transfers_tests

7c2a0c1

Merge branch 'main' into WSMR/clustered_transfers_tests

8d5d50e

fjetter reviewed Jun 3, 2022

View reviewed changes

crusaderky removed a link to an issue Jun 6, 2022

Yank state machine out of Worker class #6476

Closed

jrbourbeau mentioned this pull request Jun 7, 2022

Release 2022.6.0 dask/community#252

Closed

9 tasks

crusaderky marked this pull request as draft June 9, 2022 15:56

crusaderky added 2 commits June 9, 2022 17:49

Merge branch 'main' into WSMR/clustered_transfers_tests

0920c47

Rescope

8136bae

crusaderky marked this pull request as ready for review June 9, 2022 16:54

Merge branch 'main' into WSMR/clustered_transfers_tests

51a3ba9

fjetter approved these changes Jun 15, 2022

View reviewed changes

fjetter merged commit cb88e3b into dask:main Jun 15, 2022

crusaderky deleted the WSMR/clustered_transfers_tests branch June 16, 2022 09:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework some tests related to gather_dep #6472

Rework some tests related to gather_dep #6472

crusaderky commented May 30, 2022 •

edited

Loading

crusaderky May 30, 2022

crusaderky May 30, 2022 •

edited

Loading

crusaderky May 30, 2022

crusaderky May 30, 2022

fjetter Jun 1, 2022

crusaderky Jun 1, 2022

fjetter Jun 3, 2022

crusaderky May 30, 2022

fjetter Jun 1, 2022

crusaderky Jun 1, 2022

fjetter Jun 3, 2022

crusaderky Jun 6, 2022

github-actions bot commented May 30, 2022 •

edited

Loading

fjetter left a comment

fjetter Jun 1, 2022

crusaderky Jun 1, 2022

crusaderky Jun 1, 2022

fjetter Jun 1, 2022

crusaderky Jun 1, 2022

fjetter Jun 1, 2022

fjetter Jun 1, 2022

crusaderky Jun 1, 2022

fjetter Jun 3, 2022

fjetter Jun 1, 2022

crusaderky Jun 1, 2022

fjetter Jun 1, 2022

fjetter left a comment

fjetter Jun 3, 2022

fjetter Jun 3, 2022

crusaderky commented Jun 9, 2022

github-actions bot commented Jun 9, 2022 •

edited

Loading

		while a.log[-1][:5] != ("x", "flight", "fetch", "flight", {}):
		await asyncio.sleep(0.01)

Rework some tests related to gather_dep #6472

Rework some tests related to gather_dep #6472

Conversation

crusaderky commented May 30, 2022 • edited Loading

Choose a reason for hiding this comment

crusaderky May 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 30, 2022 • edited Loading

Unit Test Results

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Jun 9, 2022

github-actions bot commented Jun 9, 2022 • edited Loading

Unit Test Results

crusaderky commented May 30, 2022 •

edited

Loading

crusaderky May 30, 2022 •

edited

Loading

github-actions bot commented May 30, 2022 •

edited

Loading

github-actions bot commented Jun 9, 2022 •

edited

Loading