Deadlock triggered on nng_close #1236

codypiersall · 2020-04-27T02:08:30Z

Creator of pynng here. Under certain circumstances our unit tests were triggering a deadlock when closing sockets, and I was able to eventually reproduce it in plain C. I think the key point here is that

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>

#include <nng/nng.h>
#include <nng/protocol/pair0/pair.h>

#define CHECK(x) do { \
    int ret = (x); \
    if (ret != 0) { \
        printf("error %d: %s\n", ret, nng_strerror(ret));\
        abort(); \
    } \
} while (0)

char addr[] = "tcp://localhost:9899";
int main() {
    char buf[50];
    for (int i = 0; ; i++) {
        size_t got = sizeof buf;
        nng_socket s0, s1;
        CHECK(nng_pair_open(&s0));
        CHECK(nng_listen(s0, addr, NULL, 0));
        CHECK(nng_pair_open(&s1));
        CHECK(nng_dial(s1, addr, NULL, 0));

        CHECK(nng_send(s0, "hello", 5, 0));
        CHECK(nng_recv(s1, buf, &got, 0));
        printf("%d\n", i);
        nng_close(s1);
        nng_close(s0);
    }

    return 0;
}

This is likely the same issue as #1219.

It seems relevant that the deadlock does not happen if the nng_send and nng_recv calls are removed.

The text was updated successfully, but these errors were encountered:

codypiersall · 2020-04-27T02:09:47Z

For those interested, see codypiersall/pynng#62 for discussion on pynng's issue tracker.

gdamore · 2020-05-03T19:41:29Z

What platform are you encountering this on?

codypiersall · 2020-05-03T23:29:22Z

Ubuntu 18.04 64-bit. I tested tcp and ipc, and only tcp had this issue. Some other folks confirmed it using the Python bindings as well. I don't expect this but to come up in real usage, at least not frequently, but my unit tests kept triggering the deadlock :-\

wtfuzz · 2020-05-03T23:32:05Z

Also reproducible on macOS 10.15.3 in both straight C and Python bindings

wtfuzz · 2020-05-03T23:35:57Z

@gdamore it appears to be deadlocking waiting on the sock->s_cv condition variable in the loop that checks if listeners, dialers and pipes have all been removed.

(lldb) bt
* thread #1, queue = 'com.apple.main-thread'
  * frame #0: 0x00007fff71ab4ce6 libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff71b76185 libsystem_pthread.dylib`_pthread_cond_wait + 701
    frame #2: 0x0000000100035abd main`nni_pthread_cond_wait(c=0x0000000101802850, m=0x0000000101802810) at posix_thread.c:120:12
    frame #3: 0x0000000100035a8d main`nni_plat_cv_wait(cv=0x0000000101802850) at posix_thread.c:181:2
    frame #4: 0x000000010002ae45 main`nni_cv_wait(cv=0x0000000101802850) at thread.c:51:2
    frame #5: 0x0000000100024390 main`nni_sock_shutdown(sock=0x0000000101802800) at socket.c:755:3
    frame #6: 0x0000000100024505 main`nni_sock_close(s=0x0000000101802800) at socket.c:781:2
    frame #7: 0x00000001000017b7 main`nng_close(s=(id = 26)) at nng.c:45:2
    frame #8: 0x0000000100001718 main`main at main.c:30:9
    frame #9: 0x00007fff719717fd libdyld.dylib`start + 1
    frame #10: 0x00007fff719717fd libdyld.dylib`start + 1
(lldb) f 5
frame #5: 0x0000000100024390 main`nni_sock_shutdown(sock=0x0000000101802800) at socket.c:755:3
   752 		    (!nni_list_empty(&sock->s_listeners)) ||
   753 		    (!nni_list_empty(&sock->s_dialers))) {
   754 	    printf("WAIT\n");
-> 755 			nni_cv_wait(&sock->s_cv);
   756 		}
   757
   758 		sock->s_sock_ops.sock_close(sock->s_data);
(lldb)

gdamore · 2020-05-04T13:37:19Z

Thank you for your test code.

I've reproduced this in the debugger, and I can see that the problem is that one of the pipes is still there. (The dialers and listeners have exited.) I need to analyze further. I have some ideas though.

codypiersall · 2020-05-04T13:42:07Z

Thanks for looking into this Garrett. I really appreciate the work you're doing on nng!

gdamore · 2020-05-05T05:36:40Z

I've not had enough time to really dig in deep. What I've found is that the thread that is blocked is waiting on the pipe close, and specifically that pipe, created by a dialer, is blocked trying stop the receive aio. This is in the context of the pipe_reap function.

gdamore · 2020-05-05T06:35:10Z

Ok, I think this is something super subtle.

Essentially, it relates to the handling of closing the TCP pipe underneath, and our reliance on getting a callback from that when shutting down the upper layer (SP layer in this case) pipe.

Basically, in order to prevent an infinite recursion when shutting down the aio (in this case the receive aio), we simply discard attempts to perform I/O on the closed aio structure. Unfortunately, this means that if the pipe is closed in precisely this way, the callback tcp pipe (lower) won't get called, and we need that to know that we can safely release the upper pipe. (This is to prevent use after free.)

While this seems to be only impacting one code path, I believe that it's "fundamental" in nature, meaning that other I/O paths can experience the same bug.

I will need to think about how best to fix this -- it isn't trivial. The problem is that the simplified calling conventions means that a submitter won't know about "closed" AIO.

I think probably we just need to take some extra precaution to avoid closing the AIOs on the lower level pipe -- instead the upper layer needs to take ownership of that.

Remember also that there three layers. At the bottom is TCP, then there is the NNG SP TCP transport, then there is the pair protocol. (Actually there's a layer below TCP as well, but I don't think it is involved in this problem.)

gdamore · 2020-05-07T05:22:30Z

Please see that commit / PR. @codypiersall if you have the ability to see if this fixes, I'd appreciate it.

My above analysis wasn't quite correct. I think what is happening is a race if we wind up stopping an aio between the time it gets started with nni_aio_begin() and then running in nni_aio_schedule(). Basically in that case nni_aio_schedule() will return an error.

Unfortunately that can lead to a situation where the task will never complete. The change in my PR attempts to release the task if nni_aio_schedule() does not succeed.

codypiersall · 2020-05-07T14:56:33Z

Just tested this and I'm getting the same deadlock :-. Here's the main thread's backtrace:

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./a.out'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fc7fe7649f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0x562ee4197bd0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88      ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
[Current thread is 1 (Thread 0x7fc7feb6d740 (LWP 14160))]
(gdb) bt
#0  0x00007fc7fe7649f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0x562ee4197bd0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x562ee4197b80, cond=0x562ee4197ba8)
    at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x562ee4197ba8, mutex=0x562ee4197b80) at pthread_cond_wait.c:655
#3  0x0000562ee243222b in nni_pthread_cond_wait (c=0x562ee4197ba8, m=0x562ee4197b80)
    at ../src/platform/posix/posix_thread.c:120
#4  0x0000562ee243239e in nni_plat_cv_wait (cv=0x562ee4197ba8)
    at ../src/platform/posix/posix_thread.c:181
#5  0x0000562ee242f21e in nni_cv_wait (cv=0x562ee4197ba8) at ../src/core/thread.c:51
#6  0x0000562ee242af70 in nni_sock_shutdown (sock=0x562ee4197b70) at ../src/core/socket.c:754
#7  0x0000562ee242b047 in nni_sock_close (s=0x562ee4197b70) at ../src/core/socket.c:780
#8  0x0000562ee241b02a in nng_close (s=...) at ../src/nng.c:45
#9  0x0000562ee241afc1 in main ()

Looks like its waiting on the s_cv to be met.

gdamore · 2020-05-07T15:40:18Z

Thanks.

Bummer. I suspect that the changes I've made still fix a problem, but maybe not all of them.

The CV is just a condition variable. If you have the ability to look at this deeper in a debugger, you can probably see that the actual condition(s) aren't met -- previously it was a pipe that wasn't getting fully cleaned up. Diagnosing that requires looking at the state of other threads.

It will take some time to further asses I think.

gdamore · 2020-05-17T19:02:38Z

I think I've figured it out (famous last words). It looks like it's a race that can occur when cancellation arises between nni_aio_begin() and nni_aio_schedule(). I'm pretty sure I know how to fix it -- basically we need cancellation to mark the aio as canceled, and nni_aio_schedule() should check that. Unlike the aio->a_stop flag, this one should be cleared automatically by nni_aio_begin().

For anyone paying attention, it's hard to hit this bug (it takes me many 10's of thousands of trials to hit it), and it isn't restricted to any platform or transport -- it's a bug in the core AIO framework. I think this is a regression introduced in the 1.3.x series, as a result of some changes I made to reduce contention and improve performance.

gdamore · 2020-05-18T00:07:02Z

Well, I did say famous last words... I coming back at this later, the condition I described above doesn't seem like it should occur. We do call nni_aio_close() first, and that should have resulted in the a_stop field being set. That should have prevented the task being scheduled at all. I may need to add more debugging state to the aio.

gdamore · 2020-05-18T00:35:07Z

Holy crap. I think this is a stupid bug in the TCP cancellation. I think it probably only affects TCP, and only cancellation on the receive path . Stay tuned.

gdamore · 2020-05-18T00:54:56Z

So I've updated the PR with another commit. Right now I'm not able to reproduce the hang with these changes. @codypiersall can you retest?

JoelStienlet · 2020-05-19T08:41:07Z

Ran the test program 100 times, 10.000 iterations each: no hang -> apparently the bug is corrected 👍

codypiersall · 2020-05-19T13:08:43Z

Awesome. I can also confirm that this bug is fixed. Thanks!

gdamore added a commit that referenced this issue May 7, 2020

fixes #1236 Deadlock triggered on nng_close

ede6bd4

gdamore self-assigned this May 7, 2020

gdamore added the bug label May 7, 2020

gdamore closed this as completed in b7a11a0 May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock triggered on nng_close #1236

Deadlock triggered on nng_close #1236

codypiersall commented Apr 27, 2020

codypiersall commented Apr 27, 2020

gdamore commented May 3, 2020

codypiersall commented May 3, 2020

wtfuzz commented May 3, 2020

wtfuzz commented May 3, 2020

gdamore commented May 4, 2020

codypiersall commented May 4, 2020

gdamore commented May 5, 2020

gdamore commented May 5, 2020

gdamore commented May 7, 2020

codypiersall commented May 7, 2020

gdamore commented May 7, 2020

gdamore commented May 17, 2020

gdamore commented May 18, 2020

gdamore commented May 18, 2020

gdamore commented May 18, 2020

JoelStienlet commented May 19, 2020

codypiersall commented May 19, 2020

Deadlock triggered on nng_close #1236

Deadlock triggered on nng_close #1236

Comments

codypiersall commented Apr 27, 2020

codypiersall commented Apr 27, 2020

gdamore commented May 3, 2020

codypiersall commented May 3, 2020

wtfuzz commented May 3, 2020

wtfuzz commented May 3, 2020

gdamore commented May 4, 2020

codypiersall commented May 4, 2020

gdamore commented May 5, 2020

gdamore commented May 5, 2020

gdamore commented May 7, 2020

codypiersall commented May 7, 2020

gdamore commented May 7, 2020

gdamore commented May 17, 2020

gdamore commented May 18, 2020

gdamore commented May 18, 2020

gdamore commented May 18, 2020

JoelStienlet commented May 19, 2020

codypiersall commented May 19, 2020