-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exceptions in Socket _recv_loop may cause thread to die silently #28
Comments
We do pass a timeout to paramiko's |
The code you linked there seems to be the opposite, it sets the timeout very low. One point of confusion here, where you're expecting to see something timeout is it because the TCP session dies or something else? I would assume that setting timeouts on the socket only affects TCP-level stuckness. But if the SSH server or NETCONF server itself took a long time to reply, that wouldn't get caught by socket-level timeouts. I could be wrong there, I haven't done socket programming in a while. But that's how I'd expect socket-level timeouts to work. |
To elaborate further: if it's e.g. the NETCONF server getting stuck at the NC protocol level, the TCP stream will still get ACK'd by the kernel and as far as TCP is concerned everything is fine. Eventually if the underlying service isn't grabbing stuff off the socket, the socket's buffer might fill up and then the kernel will stop ACK'ing stuff, but that might not happen if it's a low-volume connection. |
@JennToo the operative bit that caught my eye in the linked code is that the socket timeout exception is never propagated. What I'm seeing in my testing is that whether I kill the NETCONF server while keeping the TCP socket open or take the device down entirely, subsequent NETCONF RPCs hang indefinitely. It looks like this is because (and I'm sort of live-blogging my troubleshooting in this thread) |
Oh that's interesting. Are you using the ncclient-adapter manager from https://github.com/ADTRAN/netconf_client/blob/main/netconf_client/ncclient.py ? We're setting a timeout there on the futures returned by this library's session handler, and that should actually be catching this too. That one should be catching a timeout regardless of which protocol layer things get stuck at. There might be a bug with the timeout logic in this library though. |
Yes we are using the manager, but it's not catching it. I believe the reason is related to the fact that |
I did a re-read of how we do the timeout logic in the manager object and I didn't spot any obvious bugs. It is a little complicated though, so there could certainly be something wrong with it. |
We actually spawn a thread too, all the NC-level protocol stuff happens on that thread and the future is what the client thread is using. I guess it is possible though that something (possibly paramiko) is making a blocking call into some C thing with the GIL held. That would prevent the other python threads from running |
My hypothesis is that there's a simple solution of calling |
Quick update: The thing I ran into is described below, though it is not -this- issue and may not be a Weird error details below the cutI had an I found that when the device was dead, the following were true:
Thanks to #5, I got my code to work, but it bugs me that I don't understand in the slightest why 3-5 happen the way they do. |
Wow! Yeah that's very strange indeed. Some day I'd like for this library to be properly AIO-aware and compatible. It was written well before that stuff got standardized, but long term it'd be good to just make it async-native. I guess we'd also need (or at least want) an async-friendly SSH library too though or it'd be a bit moot. Realistically it'd be nearly a rewrite though, and at least for the way our company uses this library (mainly just for integration testing), it probably won't get priority any time soon. I'll change this bug's title to reflect what you found and leave it open, just in case anyone else ever stumbles into this. But it sounds like there's not much we can do to fix it within this library. |
Okay, this came back up for me earlier this month and after a lot of digging I believe I have found the ur-source of the problem at https://github.com/ADTRAN/netconf_client/blob/280d9d6e19828ae7c96d359ee3e2729b44e63a48/netconf_client/session.py#L107C1-L115C1. Things I have learned:
In my case, I found that an adequate workaround was to call |
This issue has vexed me once again, and I believe (again) that I have finally put it to bed. Pull request forthcoming. This time I created a misbehaving NETCONF server for testing purposes. Replicating what we are seeing at a customer, I added a 2-hour hang in the middle of This allowed me to locate The solution, implemented in the PR I will file shortly after this comment, is a separate call to |
See Paramiko Transport constructor at:
https://github.com/paramiko/paramiko/blob/main/paramiko/transport.py#LL454-L457
The result of this is that netconf_client ssh connections that should time out are susceptible to hang forever instead.
Working on further investigation/resolution, putting this here for situational awareness.
The text was updated successfully, but these errors were encountered: