-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock - task not running #5366
Comments
FYI -- if I go to the "Call Stack" for the task in the Dask UI, I find that it says "Task not actively running. It may be finished or not yet started" |
FWIW, if I turn off autoscaling and keep the clusters full size, I haven't seen deadlocks. Could it be that the extra communication from autoscaling is interfering? I will investigate if I can adjust the rate at which the autoscaling operates. |
thanks for the report and the cluster dump. After a brief look at the dump, this looks like the issue I'm currently investigating (reported offline). This is very helpful! I'll let you know once there is a fix available. How reliable can you reproduce this? Does this happen on every run or only every ~100th? |
I'm glad it is useful. It seems to happen fairly regularly, if I'm letting the cluster fluctuate in size. Another data point: I did a test with 2021.9.1 with autoscaling set with min=max so the cluster would remain constant size, and I didn't see any deadlocking. I thought this might disambiguate if the problem was the auto-scale communication versus the up/down of workers. But if you may already have a root cause, that's better news. |
@chrisroat if you get a chance could you try again with the latest |
I will try it out sometime this week, before your planned release on Friday. |
I'm sorry to report that I'm still seeing deadlocking, using the repos as of a few hours ago. I see deadlocks using 2021.09.1, as well. My clusters autoscale, and I find I can mitigate deadlocking by making my clusters fullsize and turning off autoscaling (at the expense of running a lot of resources if there are straggler tasks). dask: 2021.09.1+50.g7b1f3069 I'm attaching the worker/scheduler dumps that @fjetter's script writes out. |
FYI, I am using a dask-gateway setup when I'm talking about autoscaling. I'm not sure how the various dask distributed parameters interact with similarly named ones in the gateway, and if there is anything I could tweak in either set that might affect the problems I'm seeing. |
We do not expect current I'll have a look at the dump you provided to see if I can relate it somehow |
I just realized that you've been using the branch of #5426 already. I can see a lot of exceptions popping up and because the scheduler instructs a worker to release a key too soon. That causes a KeyError and likely gets that worker stuck. I'll need a bit more time understanding why this key is released to soon but the cluster dump is helpful! |
From the cluster dump, I would assume that you're not stuck but the computations fail hard. Am I wrong about this? |
The workflow is actually stuck. A few workers have many tasks, and many workers have no tasks. But there is no movement on the task graph. I believe I am using main, so not sure I have anything related to #5426. |
In that case, would you mind trying #5426 out? There was another user that reported similar sounding deadlock behavior which #5426 fixes (see #5426 (comment)) |
Right, I saw a change in the logs but that was merged in main already. I'm unfortunately struggling to make sense of the cluster dump you provided. For instance, worker Has this worker been reused? Scheduler restarted, same worker connects to a new scheduler and a new computation was scheduled on the new scheduler? From the logs, it doesn't look like it, at least the scheduler hostname + IP remained the same but that could just as well indicate a proxy between scheduler + worker. I'm not familiar with how dask-gateway works. |
OK, I re-ran using your branch. I still am seeing deadlocking. On the dashboard, I see a lot of tasks piling up on a few workers. I had two clusters running, using workers with 6 threads. The clusters start with 10 workers, and can autoscale up to 50. A full scale up is triggered immediately based on the workload, but it can take a few minutes for GCP to spin up the nodes. The first cluster hit a KilledWorker on its first graph. This happens sometimes because the work distribution on 10 nodes does not distribute well when nodes are added. (I could add more workers and/or reduce memory consumption.) This first cluster then ran a second graph. Since it was "warm" and nodes were provisioned, it distributed work nicely from the start. But it still has deadlocked, and 1 worker is sitting with 500 tasks. Debug info from first cluster The 2nd cluster ran its first graph up until it deadlocked. It's debug info: scheduler_20211021-2.pkl.gz [Note that worker logs are on GDrive, as I think I was hitting max file size limits on GitHub] |
@fjetter Is there anything I can do to help/explore the issue here? |
@fjetter - Could this issue be somehow specific to dask-gateway setups? |
Sorry for the delay. Parsing these logs typically requires a bit of time. I had a peak at the 2nd cluster you provided logs for and could see a few interesting exceptions
I am working on that over in #5507 but am still missing a reproducer so the "fix" might not be sufficient, yet. |
Thanks for taking a look. Do you have anywhere on the web where I can tip you a few coffees for your work on this? Let me know if I can help test in some way (i.e. merge 5507 and 5457?) |
Keep that thought in case we ever meet at a conference ;) In the meantime, don't worry about it. I'm not doing this on my own time! I'll ping here again once I have something of substance |
I just ran 11 of the graphs with 96d4fd4 (2012.2.0+), and haven't seen a deadlock. This is great! Typically, well over half of these graphs get stuck/deadlocked. While its hard to declare victory in cluster situations, as the background situation evolves based on availability/pre-emption, etc; I am hopeful! I do appreciate all the work @fjetter is doing! Kudos. I will be running more workloads over the coming days, and will report back. |
Happy New Year! On a sadder note, I wanted to provide an update that I do have a graph that just hit a deadlock. It has 36k tasks, and deadlocked for a couple hours with 1 task processing and 4 waiting. (Being near the end, I thought maybe it was I/O task that got wedged talking to GCS. Alas, it is a computational step.) The worker logs have 2022-01-01 13:11:32.141 PST distributed.protocol.core - CRITICAL - Failed to deserialize 2022-01-01 13:11:32.141 PST Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads return msgpack.loads( File "msgpack/_unpacker.pyx", line 194, in msgpack._cmsgpack.unpackb File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default 2022-01-01 13:11:32.141 PST File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 488, in merge_and_deserialize 2022-01-01 13:11:32.141 PST return deserialize(header, merged_frames, deserializers=deserializers) 2022-01-01 13:11:32.141 PST File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 417, in deserialize 2022-01-01 13:11:32.141 PST return loads(header, frames) 2022-01-01 13:11:32.141 PST File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 56, in dask_loads 2022-01-01 13:11:32.141 PST loads = dask_deserialize.dispatch(typ) 2022-01-01 13:11:32.141 PST File "/opt/conda/lib/python3.8/site-packages/dask/utils.py", line 614, in dispatch 2022-01-01 13:11:32.141 PST raise TypeError(f"No dispatch for {cls}") 2022-01-01 13:11:32.141 PSTTypeError: No dispatch for `class 'numpy.ndarray'` 2022-01-01 13:11:32.143 PSTdistributed.worker - ERROR - No dispatch for `class 'numpy.ndarray'` 2022-01-01 13:11:32.147 PSTTraceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 681, in log_errors yield File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2971, in gather_dep response = await get_data_from_worker( File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 4259, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 4239, in _get_data response = await send_recv( File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 663, in send_recv response = await comm.read(deserializers=deserializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 233, in read msg = await from_frames( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/utils.py", line 76, in from_frames res = await offload(_from_frames) File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 1317, in offload return await loop.run_in_executor( File "/opt/conda/lib/python3.8/concurrent/futures/thread.py", line 57, in run result = self.fn(*self.args, **self.kwargs) File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 1318, in _offload_executor, lambda: context.run(fn, *args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/utils.py", line 61, in _from_frames return protocol.loads( File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads return msgpack.loads( File "msgpack/_unpacker.pyx", line 194, in msgpack._cmsgpack.unpackb File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default I'm also attaching dumps from scheduler and worker: |
I have now had several deadlocks occur, so I don't think we are out of the woods yet. In addition to the previous comment, here is another one. This one does not have the dispatch errors. The scheduler just has some messages like:
dump_ba2351b92f024e2ca986819e8a4d34af_20220105.msgpack.gz I was able to get the graph to complete by deleting the pods of two workers that were processing tasks. |
What happened:
When running on an auto-scaling GKE cluster using dask-gateway, I sometimes find computation halting mid-graph. One or more workers will have tasks, but not actual by doing any work. Often the logs contain a traceback from some sort of failure.
@fjetter
What you expected to happen:
Graph to finish to completion.
Minimal Complete Verifiable Example:
I don't have a reproducable example, as the graph can sometimes succeed and will often succeed if I kill the worker(s) that are stuck. I am including scheduler and worker info as per @fjetter 's script (#5068), as well as GKE logs showing a traceback earlier in the process.
Anything else we need to know?:
K8s log
Environment:
scheduler.pkl.gz
worker.pkl.gz
The text was updated successfully, but these errors were encountered: