Use tempfile directory in cluster fixture #5825

fjetter · 2022-02-16T18:03:38Z

Diff is huge but in the end I'm using a tempfile directory instead of letting the worker create the dir. This is mostly cosmetic. I see a lot of directories being created in the repo when running tests.

Edit: This escalated in a larger refactoring of this function to ensure all processes are properly cleaned up, etc.

github-actions · 2022-02-17T12:41:17Z

Unit Test Results

      12 files ±      0       12 suites ±0 7h 17m 33s ⏱️ + 1h 6m 3s
  2 666 tests -       1   2 582 ✔️ ±      0   80 💤 ±  0 4 ❌ - 1
15 908 runs +1 439 15 048 ✔️ +1 378 856 💤 +64 4 ❌ - 3

For more details on these failures, see this check.

Results for commit cf62bba. ± Comparison against base commit 1da5199.

♻️ This comment has been updated with latest results.

fjetter · 2022-02-17T13:50:37Z

Interesting, this causes some permission errors on windows. I don't have time to look into this right now but I welcome if anybody else wants to pick this up

distributed/utils_test.py

fjetter · 2022-02-18T11:36:21Z

distributed/tests/test_utils_test.py

-async def test_gen_test_pytest_fixture(tmp_path, c):
+async def test_gen_test_pytest_fixture(tmp_path):
    assert isinstance(tmp_path, pathlib.Path)
-    assert isinstance(c, Client)


the c / client fixture is creating the client in a synchronous context. this will actually block the event loop causing this to block and teardown for 2 x connect-timeout

fjetter · 2022-02-18T13:17:04Z

distributed/utils_test.py

-            scheduler.terminate()
-            scheduler_q.close()
-            scheduler_q._reader.close()
-            scheduler_q._writer.close()
-
-            for w in workers:
-                w["proc"].terminate()
-                w["queue"].close()
-                w["queue"]._reader.close()
-                w["queue"]._writer.close()
-
-            scheduler.join(2)
-            del scheduler
-            for proc in [w["proc"] for w in workers]:
-                proc.join(timeout=30)


If the above disconnects timeout, these processes are never closed and are leaking, not just in case there is an xfail

fjetter · 2022-02-18T13:20:20Z

I am still receiving permission errors on windows when closing, even with existstack + tempdir. I assume this was never an issue since we suppressed the OSError on delete

fjetter · 2022-03-03T14:48:31Z

distributed/utils_test.py

+def _terminate_join(proc):
+    proc.terminate()
+    proc.join(timeout=30)


How long does it typically take for a process to terminate? Is it worth splitting this up into "terminate all" then "join all" to speed things up?

fjetter · 2022-03-03T16:32:17Z

All tests green 🎉

fjetter · 2022-03-04T09:16:15Z

@graingert care to provide a final review?

distributed/utils_test.py

graingert · 2022-03-08T17:14:37Z

~~looks very weird~~

~~https://github.com/dask/distributed/runs/5465174886?check_suite_focus=true#step:12:1744~~

    def _rmtree_unsafe(path, onerror):
        try:
>           with os.scandir(path) as scandir_it:
E           NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\_dask_test_workerqken5oys\\dask-worker-space\\worker-gqox5et3.dirlock'

edit: ah it's just rmtree trying to resolve a PermissionError:

E                   PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\_dask_test_workerqken5oys\\dask-worker-space\\worker-gqox5et3.dirlock'

C:\Miniconda3\envs\dask-distributed\lib\tempfile.py:805: PermissionError

During handling of the above exception, another exception occurred:

fjetter · 2022-03-15T16:39:06Z

Adding the process.close seems to have done the trick, for now. Hard to tell since I'm hitting test_nanny_worker_port_range on most win builds #5925

fjetter · 2022-03-22T09:07:24Z

I tried reproducing the permission error on a windows machine but it appears to be again one of these situations where it only triggers once in a few hundred/thousand invocations or possibly only in combination with other test runs. I don't think tracking this down is worth it right now and I would like to get this in since I think this includes several valuable fixes. I went on and ignored the PermissionError now. If nothing else pops up, I'll go ahead and merge.

fjetter · 2022-03-22T09:09:08Z

distributed/utils_test.py

@@ -766,12 +783,6 @@ def cluster(
        else:
            client.close()

-    start = time()
-    while any(proc.is_alive() for proc in ws):


procs are closed, i.e. is_alive is not possible anymore. In fact, it even raises an exception since you can't even interact with the proc object anymore after it is being closed