Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop dask scheduler gracefully #3332

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

habibutsu
Copy link

Currently if you terminating dask-scheduler following exception occurs:

Traceback (most recent call last):
  File "/usr/local/bin/dask-scheduler", line 8, in <module>
    sys.exit(go())
  File "/usr/local/lib/python3.6/dist-packages/distributed/cli/dask_scheduler.py", line 248, in go
    main()
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/distributed/cli/dask_scheduler.py", line 237, in main
    loop.run_sync(run)
  File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 531, in run_sync
    raise TimeoutError("Operation timed out after %s seconds" % timeout)
tornado.util.TimeoutError: Operation timed out after None seconds

@mrocklin
Copy link
Member

Thank you for the fix @habibutsu ! I've tried this locally and confirmed both the original bug and that this fixes the issue. I'm actually fairly surprised that this bug was here in the first place. I thought that we had things working nicely here not too long ago. I'm curious, do you have any thoughts on how we might test for this to make sure that things don't revert to the poor behavior in the future?

@mrocklin
Copy link
Member

Hrm, the test failure in distributed/tests/test_client.py::test_reconnect does seem to be genuine.

(cherry picked from commit b6abd558dc64e360f2b2190b0f91d5433ff28731)
@TomAugspurger
Copy link
Member

Hopefully d5cd0c9 fixed the test. That's roughly doing

- create worker
- create scheduler
- close scheduler
- create scheduler
- do stuff assuming worker has reconnected

The difference is that we now (IMO correctly) tell the workers to close when the scheduler closes, so the worker cleanly exited and didn't attempt to reconnect.

By closing the scheduler with signal.SIGKILL, which we don't have a handler for, we restore the assumption behind the test.

@TomAugspurger
Copy link
Member

TomAugspurger commented Apr 20, 2020

Interestingly, this is failing after #3706. I'm not sure why that would be. Even dask-scheduler is exiting with a non-zero exit code now. If I revert #3706 on this branch, then things fail.

@crusaderky do you have any guesses why that would be? If not, I can dig into things some more.

edit: I suppose the changes at https://github.com/dask/distributed/pull/3706/files#diff-048ee949e66792811aa13d7ef8a7229aL53 are the likely relevant changes.

edit2: yeah, this diff gets us passing again

diff --git a/distributed/cli/utils.py b/distributed/cli/utils.py
index c1bff051..05a9de6f 100644
--- a/distributed/cli/utils.py
+++ b/distributed/cli/utils.py
@@ -49,11 +49,14 @@ def install_signal_handlers(loop=None, cleanup=None):
 
     old_handlers = {}
 
+    from tornado import gen
+
     def handle_signal(sig, frame):
-        async def cleanup_and_stop():
+        @gen.coroutine
+        def cleanup_and_stop():
             try:
                 if cleanup is not None:
-                    await cleanup(sig)
+                    yield cleanup(sig)
             finally:
                 loop.stop()

So we need to figure out what the difference in behavior is there, such that the TimeoutError is raised with the asyncio variant

distributed.scheduler - INFO - End scheduler at 'tcp://192.168.7.20:8786'
Traceback (most recent call last):
  File "/Users/taugspurger/.virtualenvs/dask-dev/bin/dask-scheduler", line 11, in <module>
    load_entry_point('distributed', 'console_scripts', 'dask-scheduler')()
  File "/Users/taugspurger/sandbox/distributed/distributed/cli/dask_scheduler.py", line 230, in go
    main()
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/cli/dask_scheduler.py", line 222, in main
    loop.run_sync(run)
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/tornado/ioloop.py", line 531, in run_sync
    raise TimeoutError("Operation timed out after %s seconds" % timeout)
tornado.util.TimeoutError: Operation timed out after None seconds

@crusaderky
Copy link
Collaborator

yes, I noticed a few cases where you just can't replace the gen.coroutines with async def functions. Afraid I don't know enough about tornado.gen to understand why.

@crusaderky
Copy link
Collaborator

asyncio signal handling is very particular. Maybe tornado has its own, incompatible, approach?

@TomAugspurger
Copy link
Member

OK thanks. For now I'm OK with reverting the asyncio changes in that function.

@@ -50,10 +51,13 @@ def install_signal_handlers(loop=None, cleanup=None):
old_handlers = {}

def handle_signal(sig, frame):
async def cleanup_and_stop():
@gen.coroutine
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@gen.coroutine
# FIXME: this breaks when changing to async def... await
@gen.coroutine

@TomAugspurger
Copy link
Member

The test I added is failing on windows. Apparently the process may not be exiting cleanly there. If anyone is able to debug that it'd be welcome, but I've skipped the assertions on Windows for now.

Base automatically changed from master to main March 8, 2021 19:04
@habibutsu habibutsu requested a review from fjetter as a code owner January 23, 2024 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants