Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock during nng_close() - multi platform #1813

Closed
mikisch81 opened this issue Apr 16, 2024 · 23 comments · Fixed by #1824
Closed

Deadlock during nng_close() - multi platform #1813

mikisch81 opened this issue Apr 16, 2024 · 23 comments · Fixed by #1824

Comments

@mikisch81
Copy link

Describe the bug

Continued from the closed #1543

When calling nng_close() there is sometimes a deadlock which causes nng_close() to hang.
This happens also when using only sync APIs (no AIOs).

Expected behavior

nng_close should finish successfully.

Actual Behavior

nng_close() hangs.

To Reproduce

I created a modified version of the reqrep example code here (I use it with IPC transport): https://gist.github.com/mikisch81/428c4ad87afcc1c8881b282cd5e80eb3

In the modified example in the client code right before calling nng_recv for the reply from the server I start a thread which just calls nng_close, after a couple of successful runs the deadlock happen:

mischwar@tlv-mpawy reqrep % ./reqrep client ipc:///tmp/reqrep_test
1712848018 - CLIENT: SENDING DATE REQUEST
1712848018 - CLIENT: WAITING FOR REPLY
1712848018 - CLIENT: CLOSING SOCKET
nng_recv error - Object closed
...
...
1712848018 - CLIENT: SENDING DATE REQUEST
1712848018 - CLIENT: WAITING FOR REPLY
1712848018 - CLIENT: CLOSING SOCKET
nng_recv error - Object closed
1712848018 - CLIENT: SENDING DATE REQUEST
1712848018 - CLIENT: WAITING FOR REPLY
1712848018 - CLIENT: CLOSING SOCKET. <--- The thread is suck in nng_close here

Environment Details

  • NNG version: 1.7.3
  • Operating system and version: MacOS Sonoma 14.4.1 (but also happens on Windows and Linux)
  • Compiler and language used: C/C++ clang
  • Shared or static library - static

Additional context

Here is a snapshot of the threads in the client app during the deadlock reproduction:
Screenshot 2024-04-11 at 5 32 52 PM

I recall that the initial suspect was an application callback which is not done:

Being blocked on nni_aio_wait (as above) and especially task_wait is in indication that your application has a callback executing that is not completing.

So in this example code there is no application callback at all and only blocking APIs are called.
The expected result (as I understand) is that nng_recv will always return ECLOSED.

@mikisch81
Copy link
Author

@gdamore
I updated the modified reqrep example code which reproduce the issue: https://gist.github.com/mikisch81/428c4ad87afcc1c8881b282cd5e80eb3

@KnappSas
Copy link

KnappSas commented Apr 23, 2024

Apologies for the thread hijack, but I wanted to bring up an issue that aligns with this discussion, even though it seems to be exclusive to Windows on our side.

We've been experiencing similar deadlocks on our end when closing IPC sockets (reqrep and pair sockets), happening 1 to 2 times daily in our QA. We've been using nng 1.7.3 but I also tried out 1.8.0 now to utilize the new logging functions

I've been trying to reproduce the issue within our system and have managed to narrow it down to the later part of the nni_sock_shutdown function, where it then hangs in nni_cv_wait. Despite efforts, I haven't been successful in reducing it to a simple NNG-only example yet.

Could anyone offer insights into potential debugging steps or suggest additional logging that might help us narrow down the root cause?

Any assistance would be greatly appreciated. Thank you!

@leowang552
Copy link

leowang552 commented Apr 23, 2024

Environment` Details:
git hash SHA-1: "2b967f48a669f7199c40bc730f800e53d6d4fb72"

vs 2022
SDK 10.0.22621.0
Reproduce:
aio_test.c

 void test_traffic(void)
{
 	nng_socket s1;
	nng_socket s2;
	nng_aio *  tx_aio;
	nng_aio *  rx_aio;
	int        tx_done = 0;
	int        rx_done = 0;
	nng_msg *  m;
	char *     addr = "inproc://traffic";

	NUTS_PASS(nng_pair1_open(&s1));
      	NUTS_PASS(nng_listen(s1, addr, NULL, 0));
       NUTS_PASS(nng_close(s1));   // **freeze** 
}
NUTS_TESTS = {
	{ "sleep", test_sleep },
	//{ "sleep timeout", test_sleep_timeout },
	//{ "insane nio", test_insane_nio },
	//{ "provider cancel", test_provider_cancel },
	//{ "consumer cancel", test_consumer_cancel },
	{ "traffic", test_traffic },
	{ "explicit timeout", test_explicit_timeout },
	{ "explicit expire", test_explicit_expiration },
	{ "inherited timeout", test_inherited_timeout },
	{ "zero timeout", test_zero_timeout },
	{ "aio reap", test_aio_reap },
	{ "sleep loop", test_sleep_loop },
	{ "sleep cancel", test_sleep_cancel },
	{ "aio busy", test_aio_busy },
	{ NULL, NULL },
};

@gdamore
Copy link
Contributor

gdamore commented Apr 23, 2024

Does this occur when using synchronous APIs or only when using aio? If your callbacks are hanging that could cause this behavior.

If you have callback functions on the socket can you share them?

@KnappSas
Copy link

KnappSas commented Apr 23, 2024

On my side using the synchronous API. I forgot to mention that I recently switched from nanomsg to nng and I'm using the compatibility layer

@leowang552
Copy link

leowang552 commented Apr 23, 2024

Does this occur when using synchronous APIs or only when using aio? If your callbacks are hanging that could cause this behavior.

If you have callback functions on the socket can you share them?

aio_test.c  path   nng/src/core/aio_test.c 

Firstly, Apologies for my English. I'm very bullish on nng.
This issue occurs in x64 debug version, however I tried the x64 release version, also happened.
You can view this https://github.com/nanomsg/nng/issues?q=nng_close.
Thanks in advance.

@KnappSas
Copy link

I'm also seeing this running the tests in pair0_test.c

@gdamore
Copy link
Contributor

gdamore commented Apr 24, 2024

Wait these are occurring with the included test suites not your own code???

Can I get details about the system you are using including OS, cpu, and compiler? Is this running in a cloud or virtual environment?

@gdamore
Copy link
Contributor

gdamore commented Apr 24, 2024

Ok I see the details there. Will update as soon as I have a diagnosis.

@leowang552
Copy link

I have fixed the bug, source code

int
nni_reap_sys_init(void)
{
	int rv;

	reap_exit = false;  // add the line
	// If this fails, we don't fail init, instead we will try to
	// start up at reap time.
	if ((rv = nni_thr_init(&reap_thr, reap_worker, NULL)) != 0) {
		return (rv);
	}
	nni_thr_run(&reap_thr);
	return (0);
}

@KnappSas
Copy link

I can confirm that the above from leowang fixes the problem I see in the test suit. Of course I cannot tell yet if it fixes the issue within our system too.

@mikisch81
Copy link
Author

@gdamore can you please address the original issue I opened? Everything is detailed there, I don't want to lose context due to other examples from other users which should be in different issues.
I use sync APIs, it happens on multiple platforms and I gave a link to a simple example which easily reproduces this issue.

@leowang552
Copy link

@mikisch81 can you share your code? I tried the code , but not reproduce.

@alzix
Copy link
Contributor

alzix commented Apr 24, 2024

the issue reproduces fairly easily on both macOS and Windows based on the modified demo/reqrep example

@leowang552
Copy link

Please post your modified code

@alzix
Copy link
Contributor

alzix commented Apr 24, 2024

diff --git a/demo/reqrep/reqrep.c b/demo/reqrep/reqrep.c
index 62ff29a4..a9ab2cd8 100644
--- a/demo/reqrep/reqrep.c
+++ b/demo/reqrep/reqrep.c
@@ -72,6 +72,7 @@ showdate(time_t now)
 int
 server(const char *url)
 {
+        for (;;) {
 	nng_socket sock;
 	nng_listener listener;
 	int        rv;
@@ -100,7 +101,6 @@ server(const char *url)
 	}
 	nng_listener_start(listener, 0);
 
-	for (;;) {
 		char *   buf = NULL;
 		size_t   sz;
 		uint64_t val;
@@ -127,10 +127,22 @@ server(const char *url)
 			if (rv != 0) {
 				fatal("nng_send", rv);
 			}
+                        rv  = nng_listener_close(listener);
+                        if (rv != 0) {
+                                fatal("nng_listener_close", rv);
+                        }
+                        rv  = nng_close(sock);
+                        if (rv != 0) {
+                                fatal("nng_close", rv);
+                        }
 			continue;
 		}
 		// Unrecognized command, so toss the buffer.
 		nng_free(buf, sz);
+                rv  = nng_close(sock);
+                if (rv != 0) {
+                        fatal("nng_close", rv);
+                }
 	}
 }
 
@@ -189,11 +201,11 @@ client(const char *url)
 		} else {
 			printf("CLIENT: GOT WRONG SIZE!\n");
 		}
-		nng_msleep(sleep);
-		sleep++;
-		if (sleep == 4) {
-			sleep = 4000;
-		}
+//		nng_msleep(sleep);
+//		sleep++;
+//		if (sleep == 4) {
+//			sleep = 4000;
+//		}
 	}
 
 	// This assumes that buf is ASCIIZ (zero terminated).
@@ -213,4 +225,4 @@ main(const int argc, const char **argv)
 
 	fprintf(stderr, "Usage: reqrep %s|%s <URL> ...\n", CLIENT, SERVER);
 	return (1);
-}
+}
\ No newline at end of file

@alzix
Copy link
Contributor

alzix commented Apr 24, 2024

the changes to make the issue more easily reproducible:

  1. tweak the client code to bombard the server with requests without sleep
  2. tweak the server code to close the socket after each reply

when the server closes the socket while there is a busy receive aio task - the system goes into deadlock
based on my observations the same issue occurs when the client is in blocking nng_recv and the server responds just before client socket is closed.

@leowang552
Copy link

diff --git a/demo/reqrep/reqrep.c b/demo/reqrep/reqrep.c
index 62ff29a4..a9ab2cd8 100644
--- a/demo/reqrep/reqrep.c
+++ b/demo/reqrep/reqrep.c
@@ -72,6 +72,7 @@ showdate(time_t now)
 int
 server(const char *url)
 {
+        for (;;) {
 	nng_socket sock;
 	nng_listener listener;
 	int        rv;
@@ -100,7 +101,6 @@ server(const char *url)
 	}
 	nng_listener_start(listener, 0);
 
-	for (;;) {
 		char *   buf = NULL;
 		size_t   sz;
 		uint64_t val;
@@ -127,10 +127,22 @@ server(const char *url)
 			if (rv != 0) {
 				fatal("nng_send", rv);
 			}
+                        rv  = nng_listener_close(listener);
+                        if (rv != 0) {
+                                fatal("nng_listener_close", rv);
+                        }
+                        rv  = nng_close(sock);
+                        if (rv != 0) {
+                                fatal("nng_close", rv);
+                        }
 			continue;
 		}
 		// Unrecognized command, so toss the buffer.
 		nng_free(buf, sz);
+                rv  = nng_close(sock);
+                if (rv != 0) {
+                        fatal("nng_close", rv);
+                }
 	}
 }
 
@@ -189,11 +201,11 @@ client(const char *url)
 		} else {
 			printf("CLIENT: GOT WRONG SIZE!\n");
 		}
-		nng_msleep(sleep);
-		sleep++;
-		if (sleep == 4) {
-			sleep = 4000;
-		}
+//		nng_msleep(sleep);
+//		sleep++;
+//		if (sleep == 4) {
+//			sleep = 4000;
+//		}
 	}
 
 	// This assumes that buf is ASCIIZ (zero terminated).
@@ -213,4 +225,4 @@ main(const int argc, const char **argv)
 
 	fprintf(stderr, "Usage: reqrep %s|%s <URL> ...\n", CLIENT, SERVER);
 	return (1);
-}
+}
\ No newline at end of file

I tested the modified code.
Server side print nng_recv: Object closed & quit. Client is waiting receiving.
Client print RECEIVED DATA if restart the server.
image

@alzix
Copy link
Contributor

alzix commented Apr 24, 2024

the server never exits in normal flow - in my case it hangs after 20-30 iterations.
what version/hash of nng are you using?
did you move the for loop as shown in the diff? in your screenshots, I see the server exits after ever receive - you will not be able to reproduce it easily in this way

@alzix
Copy link
Contributor

alzix commented Apr 24, 2024

consider saving the diff to a text file and applying it as a git diff - or just take the whole reqrep.c from the ghist

@leowang552
Copy link

Sorry! Please forgive me for ignoring the for loop in the diff.
Environment:
git hash SHA-1: "1941fdae7eca1b92a957dfffa6a1ec7ef549dc60"

vs 2022
SDK 10.0.22621.0
win 10 64bit

result after running 10 minutes
image

my code file is from https://gist.github.com/mikisch81/428c4ad87afcc1c8881b282cd5e80eb3

@alzix
Copy link
Contributor

alzix commented Apr 24, 2024

IIUC the issue is caused by the fact that the aio we are waiting on has no cancel function.
image

so the implementation of nni_aio_stop does not actually do anything and keeps waiting forever.

void
nni_aio_stop(nni_aio *aio)
{
	if (aio != NULL) {
		nni_aio_cancel_fn fn;
		void             *arg;
		nni_aio_expire_q *eq = aio->a_expire_q;

		nni_mtx_lock(&eq->eq_mtx);
		nni_aio_expire_rm(aio);
		fn                = aio->a_cancel_fn;
		arg               = aio->a_cancel_arg;
		aio->a_cancel_fn  = NULL;
		aio->a_cancel_arg = NULL;
		aio->a_stop       = true;
		nni_mtx_unlock(&eq->eq_mtx);

		if (fn != NULL) {
			fn(aio, arg, NNG_ECANCELED);
		}

		nni_aio_wait(aio);
	}
} 

@alzix
Copy link
Contributor

alzix commented Apr 24, 2024

a potential fix is proposed in #1824 - the modified demo runs for quite a long time already on my end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants