-
Notifications
You must be signed in to change notification settings - Fork 28
Stomp connections get terminated by server but not cleaned up #7
Comments
Thank for the really detailed issue report! |
@martink2 please send me the dump you have, it would help uncover what may be tripping up the parser (that answers your 1st open question). As for the 2nd one, this may be a bug, and now that we have an error message that correlates with connections being open, it should be much easier to reproduce. @martink2 do you run a single node or multiple ones? |
@michaelklishin We can see the error on both single nodes and clusters. Thanks Martin |
@martink2 can you post the output of What does |
More questions: do your clients use NAT? |
On average we have 11.000 connections. for a failed connection from 10.97.168.128:54194 netstat -n -tcp | grep 10.97.168.128:54194 : No output For what sysctl key would you like me to grep ? As to the nat a few of our clients are behind a NAT but not the majority of them and it But our clients use a lot of short lived connections so the ephemeral port range rolls Thanks Martin |
Yes, the ephemeral port range roll overs is what I'm investigating. I'm after the You can send me the output privately, |
Some findings from today:
Will try with clients doing the same thing concurrently. |
OK, I believe I have a reliable way to reproduce this. The issue only has so much to do with high TCP connection churn and all that: the reader does not tell the processor to terminate in every case. Phew. |
I deployed the fixed code to our production node and it looks very good, thanks a lot Martin |
We try to release a bugfix release once a month. 3.5.2 should be released around the first half of May. |
That said, we also have nightly builds. |
@martink2 thank you very much for giving it a try so quickly! |
Why not just link the two? Then any network-related issue would produce a scary crash report in the SASL log and we won't give the processor some time to finish what it may still be doing. Fixes #7.
Background
we have been running a large mcollective installation and are having problems with client connections getting terminated but never disappearing from the rabbitmq connection table
resulting in a lot of "zombie" connections which sooner or later bring down the rabbitmq server
due to resource exhaustion. This is to follow up the conversation started at:
Google Groups
After a longer investigation we figured out that the connections are actually not lost due to WAN link problems but are terminated by the server but never cleaned out.
Current Setup
OS: RHEL 6.6
Erlang: 17.5-1.el6.x86_64
Rabbitmq: rabbitmq-server-3.5.1-1
Problematic Behaviour
When a stomp message to the server results in the folowing:
The Connection is closed by the server but never cleaned out of the connection table
and also the subscriptions from that client connection are still active.
Recorded Case
I tried to provide as much detail as i could from an example connection, so i performed a tcpdump on the running system from connection setup until the error:
State after Connection
Log:
netstat -an:
rabbitmqctl report:
erlang process state:
Client Server Conversation:
Payload:
TCP Packets
State after Connection Termination
Log:
netstat -an:
rabbitmqctl report:
erlang process state:
Open questions
I can provide a raw tcpdump pcap file, please let me know if i should produce more debug output.
From what we can see the Server sends a proper FIN to the client which gets ACK'd and should
result in a connection cleanup.
The text was updated successfully, but these errors were encountered: