-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop dask scheduler gracefully #3332
base: main
Are you sure you want to change the base?
Conversation
Thank you for the fix @habibutsu ! I've tried this locally and confirmed both the original bug and that this fixes the issue. I'm actually fairly surprised that this bug was here in the first place. I thought that we had things working nicely here not too long ago. I'm curious, do you have any thoughts on how we might test for this to make sure that things don't revert to the poor behavior in the future? |
Hrm, the test failure in |
(cherry picked from commit b6abd558dc64e360f2b2190b0f91d5433ff28731)
Hopefully d5cd0c9 fixed the test. That's roughly doing
The difference is that we now (IMO correctly) tell the workers to close when the scheduler closes, so the worker cleanly exited and didn't attempt to reconnect. By closing the scheduler with |
Interestingly, this is failing after #3706. I'm not sure why that would be. Even @crusaderky do you have any guesses why that would be? If not, I can dig into things some more. edit: I suppose the changes at https://github.com/dask/distributed/pull/3706/files#diff-048ee949e66792811aa13d7ef8a7229aL53 are the likely relevant changes. edit2: yeah, this diff gets us passing again diff --git a/distributed/cli/utils.py b/distributed/cli/utils.py
index c1bff051..05a9de6f 100644
--- a/distributed/cli/utils.py
+++ b/distributed/cli/utils.py
@@ -49,11 +49,14 @@ def install_signal_handlers(loop=None, cleanup=None):
old_handlers = {}
+ from tornado import gen
+
def handle_signal(sig, frame):
- async def cleanup_and_stop():
+ @gen.coroutine
+ def cleanup_and_stop():
try:
if cleanup is not None:
- await cleanup(sig)
+ yield cleanup(sig)
finally:
loop.stop() So we need to figure out what the difference in behavior is there, such that the TimeoutError is raised with the asyncio variant distributed.scheduler - INFO - End scheduler at 'tcp://192.168.7.20:8786'
Traceback (most recent call last):
File "/Users/taugspurger/.virtualenvs/dask-dev/bin/dask-scheduler", line 11, in <module>
load_entry_point('distributed', 'console_scripts', 'dask-scheduler')()
File "/Users/taugspurger/sandbox/distributed/distributed/cli/dask_scheduler.py", line 230, in go
main()
File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/Users/taugspurger/sandbox/distributed/distributed/cli/dask_scheduler.py", line 222, in main
loop.run_sync(run)
File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/tornado/ioloop.py", line 531, in run_sync
raise TimeoutError("Operation timed out after %s seconds" % timeout)
tornado.util.TimeoutError: Operation timed out after None seconds |
yes, I noticed a few cases where you just can't replace the gen.coroutines with async def functions. Afraid I don't know enough about tornado.gen to understand why. |
asyncio signal handling is very particular. Maybe tornado has its own, incompatible, approach? |
OK thanks. For now I'm OK with reverting the asyncio changes in that function. |
@@ -50,10 +51,13 @@ def install_signal_handlers(loop=None, cleanup=None): | |||
old_handlers = {} | |||
|
|||
def handle_signal(sig, frame): | |||
async def cleanup_and_stop(): | |||
@gen.coroutine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gen.coroutine | |
# FIXME: this breaks when changing to async def... await | |
@gen.coroutine |
The test I added is failing on windows. Apparently the process may not be exiting cleanly there. If anyone is able to debug that it'd be welcome, but I've skipped the assertions on Windows for now. |
Currently if you terminating dask-scheduler following exception occurs: