Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tempfile directory in cluster fixture #5825

Merged
merged 3 commits into from
Mar 22, 2022

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Feb 16, 2022

Diff is huge but in the end I'm using a tempfile directory instead of letting the worker create the dir. This is mostly cosmetic. I see a lot of directories being created in the repo when running tests.

Edit: This escalated in a larger refactoring of this function to ensure all processes are properly cleaned up, etc.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 17, 2022

Unit Test Results

       12 files  ±       0         12 suites  ±0   7h 17m 33s ⏱️ + 1h 6m 3s
  2 666 tests  -        1    2 582 ✔️ ±       0    80 💤 ±  0  4  - 1 
15 908 runs  +1 439  15 048 ✔️ +1 378  856 💤 +64  4  - 3 

For more details on these failures, see this check.

Results for commit cf62bba. ± Comparison against base commit 1da5199.

♻️ This comment has been updated with latest results.

@fjetter
Copy link
Member Author

fjetter commented Feb 17, 2022

Interesting, this causes some permission errors on windows. I don't have time to look into this right now but I welcome if anybody else wants to pick this up

@fjetter fjetter force-pushed the tempfile_dir_cluster_fixture branch from 6fa676d to 42ca0f0 Compare February 17, 2022 14:36
Comment on lines -205 to -207
async def test_gen_test_pytest_fixture(tmp_path, c):
async def test_gen_test_pytest_fixture(tmp_path):
assert isinstance(tmp_path, pathlib.Path)
assert isinstance(c, Client)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the c / client fixture is creating the client in a synchronous context. this will actually block the event loop causing this to block and teardown for 2 x connect-timeout

Comment on lines -720 to -752
scheduler.terminate()
scheduler_q.close()
scheduler_q._reader.close()
scheduler_q._writer.close()

for w in workers:
w["proc"].terminate()
w["queue"].close()
w["queue"]._reader.close()
w["queue"]._writer.close()

scheduler.join(2)
del scheduler
for proc in [w["proc"] for w in workers]:
proc.join(timeout=30)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the above disconnects timeout, these processes are never closed and are leaking, not just in case there is an xfail

@fjetter
Copy link
Member Author

fjetter commented Feb 18, 2022

I am still receiving permission errors on windows when closing, even with existstack + tempdir. I assume this was never an issue since we suppressed the OSError on delete

Comment on lines 633 to 635
def _terminate_join(proc):
proc.terminate()
proc.join(timeout=30)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does it typically take for a process to terminate? Is it worth splitting this up into "terminate all" then "join all" to speed things up?

@fjetter
Copy link
Member Author

fjetter commented Mar 3, 2022

All tests green 🎉

@fjetter
Copy link
Member Author

fjetter commented Mar 4, 2022

@graingert care to provide a final review?

@fjetter fjetter self-assigned this Mar 4, 2022
@graingert
Copy link
Member

graingert commented Mar 8, 2022

looks very weird

https://github.com/dask/distributed/runs/5465174886?check_suite_focus=true#step:12:1744

    def _rmtree_unsafe(path, onerror):
        try:
>           with os.scandir(path) as scandir_it:
E           NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\_dask_test_workerqken5oys\\dask-worker-space\\worker-gqox5et3.dirlock'

edit: ah it's just rmtree trying to resolve a PermissionError:

E                   PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\_dask_test_workerqken5oys\\dask-worker-space\\worker-gqox5et3.dirlock'

C:\Miniconda3\envs\dask-distributed\lib\tempfile.py:805: PermissionError

During handling of the above exception, another exception occurred:

@fjetter
Copy link
Member Author

fjetter commented Mar 15, 2022

Adding the process.close seems to have done the trick, for now. Hard to tell since I'm hitting test_nanny_worker_port_range on most win builds #5925

@fjetter fjetter force-pushed the tempfile_dir_cluster_fixture branch from 3806413 to 4e16eb6 Compare March 21, 2022 09:39
@fjetter
Copy link
Member Author

fjetter commented Mar 22, 2022

I tried reproducing the permission error on a windows machine but it appears to be again one of these situations where it only triggers once in a few hundred/thousand invocations or possibly only in combination with other test runs. I don't think tracking this down is worth it right now and I would like to get this in since I think this includes several valuable fixes. I went on and ignored the PermissionError now. If nothing else pops up, I'll go ahead and merge.

@@ -766,12 +783,6 @@ def cluster(
else:
client.close()

start = time()
while any(proc.is_alive() for proc in ws):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

procs are closed, i.e. is_alive is not possible anymore. In fact, it even raises an exception since you can't even interact with the proc object anymore after it is being closed

@fjetter fjetter merged commit fe862ad into dask:main Mar 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants