Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock triggered on nng_close #1236

Closed
codypiersall opened this issue Apr 27, 2020 · 18 comments
Closed

Deadlock triggered on nng_close #1236

codypiersall opened this issue Apr 27, 2020 · 18 comments
Assignees
Labels

Comments

@codypiersall
Copy link
Contributor

Creator of pynng here. Under certain circumstances our unit tests were triggering a deadlock when closing sockets, and I was able to eventually reproduce it in plain C. I think the key point here is that

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>

#include <nng/nng.h>
#include <nng/protocol/pair0/pair.h>

#define CHECK(x) do { \
    int ret = (x); \
    if (ret != 0) { \
        printf("error %d: %s\n", ret, nng_strerror(ret));\
        abort(); \
    } \
} while (0)

char addr[] = "tcp://localhost:9899";
int main() {
    char buf[50];
    for (int i = 0; ; i++) {
        size_t got = sizeof buf;
        nng_socket s0, s1;
        CHECK(nng_pair_open(&s0));
        CHECK(nng_listen(s0, addr, NULL, 0));
        CHECK(nng_pair_open(&s1));
        CHECK(nng_dial(s1, addr, NULL, 0));

        CHECK(nng_send(s0, "hello", 5, 0));
        CHECK(nng_recv(s1, buf, &got, 0));
        printf("%d\n", i);
        nng_close(s1);
        nng_close(s0);
    }

    return 0;
}

This is likely the same issue as #1219.


It seems relevant that the deadlock does not happen if the nng_send and nng_recv calls are removed.

@codypiersall
Copy link
Contributor Author

For those interested, see codypiersall/pynng#62 for discussion on pynng's issue tracker.

@gdamore
Copy link
Contributor

gdamore commented May 3, 2020

What platform are you encountering this on?

@codypiersall
Copy link
Contributor Author

Ubuntu 18.04 64-bit. I tested tcp and ipc, and only tcp had this issue. Some other folks confirmed it using the Python bindings as well. I don't expect this but to come up in real usage, at least not frequently, but my unit tests kept triggering the deadlock :-\

@wtfuzz
Copy link

wtfuzz commented May 3, 2020

Also reproducible on macOS 10.15.3 in both straight C and Python bindings

@wtfuzz
Copy link

wtfuzz commented May 3, 2020

@gdamore it appears to be deadlocking waiting on the sock->s_cv condition variable in the loop that checks if listeners, dialers and pipes have all been removed.

(lldb) bt
* thread #1, queue = 'com.apple.main-thread'
  * frame #0: 0x00007fff71ab4ce6 libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff71b76185 libsystem_pthread.dylib`_pthread_cond_wait + 701
    frame #2: 0x0000000100035abd main`nni_pthread_cond_wait(c=0x0000000101802850, m=0x0000000101802810) at posix_thread.c:120:12
    frame #3: 0x0000000100035a8d main`nni_plat_cv_wait(cv=0x0000000101802850) at posix_thread.c:181:2
    frame #4: 0x000000010002ae45 main`nni_cv_wait(cv=0x0000000101802850) at thread.c:51:2
    frame #5: 0x0000000100024390 main`nni_sock_shutdown(sock=0x0000000101802800) at socket.c:755:3
    frame #6: 0x0000000100024505 main`nni_sock_close(s=0x0000000101802800) at socket.c:781:2
    frame #7: 0x00000001000017b7 main`nng_close(s=(id = 26)) at nng.c:45:2
    frame #8: 0x0000000100001718 main`main at main.c:30:9
    frame #9: 0x00007fff719717fd libdyld.dylib`start + 1
    frame #10: 0x00007fff719717fd libdyld.dylib`start + 1
(lldb) f 5
frame #5: 0x0000000100024390 main`nni_sock_shutdown(sock=0x0000000101802800) at socket.c:755:3
   752 		    (!nni_list_empty(&sock->s_listeners)) ||
   753 		    (!nni_list_empty(&sock->s_dialers))) {
   754 	    printf("WAIT\n");
-> 755 			nni_cv_wait(&sock->s_cv);
   756 		}
   757
   758 		sock->s_sock_ops.sock_close(sock->s_data);
(lldb)

@gdamore
Copy link
Contributor

gdamore commented May 4, 2020

Thank you for your test code.

I've reproduced this in the debugger, and I can see that the problem is that one of the pipes is still there. (The dialers and listeners have exited.) I need to analyze further. I have some ideas though.

@codypiersall
Copy link
Contributor Author

Thanks for looking into this Garrett. I really appreciate the work you're doing on nng!

@gdamore
Copy link
Contributor

gdamore commented May 5, 2020

I've not had enough time to really dig in deep. What I've found is that the thread that is blocked is waiting on the pipe close, and specifically that pipe, created by a dialer, is blocked trying stop the receive aio. This is in the context of the pipe_reap function.

@gdamore
Copy link
Contributor

gdamore commented May 5, 2020

Ok, I think this is something super subtle.

Essentially, it relates to the handling of closing the TCP pipe underneath, and our reliance on getting a callback from that when shutting down the upper layer (SP layer in this case) pipe.

Basically, in order to prevent an infinite recursion when shutting down the aio (in this case the receive aio), we simply discard attempts to perform I/O on the closed aio structure. Unfortunately, this means that if the pipe is closed in precisely this way, the callback tcp pipe (lower) won't get called, and we need that to know that we can safely release the upper pipe. (This is to prevent use after free.)

While this seems to be only impacting one code path, I believe that it's "fundamental" in nature, meaning that other I/O paths can experience the same bug.

I will need to think about how best to fix this -- it isn't trivial. The problem is that the simplified calling conventions means that a submitter won't know about "closed" AIO.

I think probably we just need to take some extra precaution to avoid closing the AIOs on the lower level pipe -- instead the upper layer needs to take ownership of that.

Remember also that there three layers. At the bottom is TCP, then there is the NNG SP TCP transport, then there is the pair protocol. (Actually there's a layer below TCP as well, but I don't think it is involved in this problem.)

gdamore added a commit that referenced this issue May 7, 2020
@gdamore
Copy link
Contributor

gdamore commented May 7, 2020

Please see that commit / PR. @codypiersall if you have the ability to see if this fixes, I'd appreciate it.

My above analysis wasn't quite correct. I think what is happening is a race if we wind up stopping an aio between the time it gets started with nni_aio_begin() and then running in nni_aio_schedule(). Basically in that case nni_aio_schedule() will return an error.

Unfortunately that can lead to a situation where the task will never complete. The change in my PR attempts to release the task if nni_aio_schedule() does not succeed.

@gdamore gdamore self-assigned this May 7, 2020
@gdamore gdamore added the bug label May 7, 2020
@codypiersall
Copy link
Contributor Author

Just tested this and I'm getting the same deadlock :-. Here's the main thread's backtrace:

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./a.out'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fc7fe7649f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0x562ee4197bd0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88      ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
[Current thread is 1 (Thread 0x7fc7feb6d740 (LWP 14160))]
(gdb) bt
#0  0x00007fc7fe7649f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0x562ee4197bd0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x562ee4197b80, cond=0x562ee4197ba8)
    at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x562ee4197ba8, mutex=0x562ee4197b80) at pthread_cond_wait.c:655
#3  0x0000562ee243222b in nni_pthread_cond_wait (c=0x562ee4197ba8, m=0x562ee4197b80)
    at ../src/platform/posix/posix_thread.c:120
#4  0x0000562ee243239e in nni_plat_cv_wait (cv=0x562ee4197ba8)
    at ../src/platform/posix/posix_thread.c:181
#5  0x0000562ee242f21e in nni_cv_wait (cv=0x562ee4197ba8) at ../src/core/thread.c:51
#6  0x0000562ee242af70 in nni_sock_shutdown (sock=0x562ee4197b70) at ../src/core/socket.c:754
#7  0x0000562ee242b047 in nni_sock_close (s=0x562ee4197b70) at ../src/core/socket.c:780
#8  0x0000562ee241b02a in nng_close (s=...) at ../src/nng.c:45
#9  0x0000562ee241afc1 in main ()

Looks like its waiting on the s_cv to be met.

@gdamore
Copy link
Contributor

gdamore commented May 7, 2020

Thanks.

Bummer. I suspect that the changes I've made still fix a problem, but maybe not all of them.

The CV is just a condition variable. If you have the ability to look at this deeper in a debugger, you can probably see that the actual condition(s) aren't met -- previously it was a pipe that wasn't getting fully cleaned up. Diagnosing that requires looking at the state of other threads.

It will take some time to further asses I think.

@gdamore
Copy link
Contributor

gdamore commented May 17, 2020

I think I've figured it out (famous last words). It looks like it's a race that can occur when cancellation arises between nni_aio_begin() and nni_aio_schedule(). I'm pretty sure I know how to fix it -- basically we need cancellation to mark the aio as canceled, and nni_aio_schedule() should check that. Unlike the aio->a_stop flag, this one should be cleared automatically by nni_aio_begin().

For anyone paying attention, it's hard to hit this bug (it takes me many 10's of thousands of trials to hit it), and it isn't restricted to any platform or transport -- it's a bug in the core AIO framework. I think this is a regression introduced in the 1.3.x series, as a result of some changes I made to reduce contention and improve performance.

@gdamore
Copy link
Contributor

gdamore commented May 18, 2020

Well, I did say famous last words... I coming back at this later, the condition I described above doesn't seem like it should occur. We do call nni_aio_close() first, and that should have resulted in the a_stop field being set. That should have prevented the task being scheduled at all. I may need to add more debugging state to the aio.

@gdamore
Copy link
Contributor

gdamore commented May 18, 2020

Holy crap. I think this is a stupid bug in the TCP cancellation. I think it probably only affects TCP, and only cancellation on the receive path . Stay tuned.

@gdamore
Copy link
Contributor

gdamore commented May 18, 2020

So I've updated the PR with another commit. Right now I'm not able to reproduce the hang with these changes. @codypiersall can you retest?

@JoelStienlet
Copy link

Ran the test program 100 times, 10.000 iterations each: no hang -> apparently the bug is corrected 👍

@codypiersall
Copy link
Contributor Author

Awesome. I can also confirm that this bug is fixed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants