-
-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock triggered on nng_close #1236
Comments
For those interested, see codypiersall/pynng#62 for discussion on pynng's issue tracker. |
What platform are you encountering this on? |
Ubuntu 18.04 64-bit. I tested tcp and ipc, and only tcp had this issue. Some other folks confirmed it using the Python bindings as well. I don't expect this but to come up in real usage, at least not frequently, but my unit tests kept triggering the deadlock :-\ |
Also reproducible on macOS 10.15.3 in both straight C and Python bindings |
@gdamore it appears to be deadlocking waiting on the
|
Thank you for your test code. I've reproduced this in the debugger, and I can see that the problem is that one of the pipes is still there. (The dialers and listeners have exited.) I need to analyze further. I have some ideas though. |
Thanks for looking into this Garrett. I really appreciate the work you're doing on nng! |
I've not had enough time to really dig in deep. What I've found is that the thread that is blocked is waiting on the pipe close, and specifically that pipe, created by a dialer, is blocked trying stop the receive aio. This is in the context of the pipe_reap function. |
Ok, I think this is something super subtle. Essentially, it relates to the handling of closing the TCP pipe underneath, and our reliance on getting a callback from that when shutting down the upper layer (SP layer in this case) pipe. Basically, in order to prevent an infinite recursion when shutting down the aio (in this case the receive aio), we simply discard attempts to perform I/O on the closed aio structure. Unfortunately, this means that if the pipe is closed in precisely this way, the callback tcp pipe (lower) won't get called, and we need that to know that we can safely release the upper pipe. (This is to prevent use after free.) While this seems to be only impacting one code path, I believe that it's "fundamental" in nature, meaning that other I/O paths can experience the same bug. I will need to think about how best to fix this -- it isn't trivial. The problem is that the simplified calling conventions means that a submitter won't know about "closed" AIO. I think probably we just need to take some extra precaution to avoid closing the AIOs on the lower level pipe -- instead the upper layer needs to take ownership of that. Remember also that there three layers. At the bottom is TCP, then there is the NNG SP TCP transport, then there is the pair protocol. (Actually there's a layer below TCP as well, but I don't think it is involved in this problem.) |
Please see that commit / PR. @codypiersall if you have the ability to see if this fixes, I'd appreciate it. My above analysis wasn't quite correct. I think what is happening is a race if we wind up stopping an aio between the time it gets started with nni_aio_begin() and then running in nni_aio_schedule(). Basically in that case nni_aio_schedule() will return an error. Unfortunately that can lead to a situation where the task will never complete. The change in my PR attempts to release the task if nni_aio_schedule() does not succeed. |
Just tested this and I'm getting the same deadlock :-. Here's the main thread's backtrace:
Looks like its waiting on the |
Thanks. Bummer. I suspect that the changes I've made still fix a problem, but maybe not all of them. The CV is just a condition variable. If you have the ability to look at this deeper in a debugger, you can probably see that the actual condition(s) aren't met -- previously it was a pipe that wasn't getting fully cleaned up. Diagnosing that requires looking at the state of other threads. It will take some time to further asses I think. |
I think I've figured it out (famous last words). It looks like it's a race that can occur when cancellation arises between nni_aio_begin() and nni_aio_schedule(). I'm pretty sure I know how to fix it -- basically we need cancellation to mark the aio as canceled, and nni_aio_schedule() should check that. Unlike the aio->a_stop flag, this one should be cleared automatically by nni_aio_begin(). For anyone paying attention, it's hard to hit this bug (it takes me many 10's of thousands of trials to hit it), and it isn't restricted to any platform or transport -- it's a bug in the core AIO framework. I think this is a regression introduced in the 1.3.x series, as a result of some changes I made to reduce contention and improve performance. |
Well, I did say famous last words... I coming back at this later, the condition I described above doesn't seem like it should occur. We do call nni_aio_close() first, and that should have resulted in the a_stop field being set. That should have prevented the task being scheduled at all. I may need to add more debugging state to the aio. |
Holy crap. I think this is a stupid bug in the TCP cancellation. I think it probably only affects TCP, and only cancellation on the receive path . Stay tuned. |
So I've updated the PR with another commit. Right now I'm not able to reproduce the hang with these changes. @codypiersall can you retest? |
Ran the test program 100 times, 10.000 iterations each: no hang -> apparently the bug is corrected 👍 |
Awesome. I can also confirm that this bug is fixed. Thanks! |
Creator of pynng here. Under certain circumstances our unit tests were triggering a deadlock when closing sockets, and I was able to eventually reproduce it in plain C. I think the key point here is that
This is likely the same issue as #1219.
It seems relevant that the deadlock does not happen if the
nng_send
andnng_recv
calls are removed.The text was updated successfully, but these errors were encountered: