Skip to content
This repository has been archived by the owner on Nov 18, 2020. It is now read-only.

Stomp connections get terminated by server but not cleaned up #7

Closed
martink2 opened this issue Apr 5, 2015 · 13 comments
Closed

Stomp connections get terminated by server but not cleaned up #7

martink2 opened this issue Apr 5, 2015 · 13 comments
Assignees
Milestone

Comments

@martink2
Copy link

martink2 commented Apr 5, 2015

Background

we have been running a large mcollective installation and are having problems with client connections getting terminated but never disappearing from the rabbitmq connection table
resulting in a lot of "zombie" connections which sooner or later bring down the rabbitmq server
due to resource exhaustion. This is to follow up the conversation started at:

Google Groups

After a longer investigation we figured out that the connections are actually not lost due to WAN link problems but are terminated by the server but never cleaned out.

Current Setup

OS: RHEL 6.6

Erlang: 17.5-1.el6.x86_64

Rabbitmq: rabbitmq-server-3.5.1-1

Problematic Behaviour

When a stomp message to the server results in the folowing:

STOMP detected network error on <0.11074.0> (10.97.6.171:56107 -> 10.67.76.191:61613):
{badmatch,<<"\n">>}

The Connection is closed by the server but never cleaned out of the connection table
and also the subscriptions from that client connection are still active.

Recorded Case

I tried to provide as much detail as i could from an example connection, so i performed a tcpdump on the running system from connection setup until the error:

State after Connection

Log:

[email protected]:accepting STOMP connection <0.11074.0> (10.97.6.171:56107 -> 10.67.76.191:61613)

netstat -an:

tcp        0      0 ::ffff:10.67.76.191:61613   ::ffff:10.97.6.171:56107    ESTABLISHED 

rabbitmqctl report:

<[email protected]> 10.97.6.171:56107 -> 10.67.76.191:61613 (1) <[email protected]> 1   guest   /mcollective    false   false   15  0   0   0   0   0   0   running
<[email protected]> stomp-subscription-13Ls1dCVpk0y5QTBuf9DxQ   false   true    []  <[email protected]>             0   0   0   1   1.0 22200               running 0   0   0   0   0   0   0   0   0   0   0   [{q1,0}, {q2,0}, {delta,{delta,undefined,0,undefined}}, {q3,0}, {q4,0}, {len,0}, {target_ram_count,infinity}, {next_seq_id,0}, {avg_ingress_rate,0.0}, {avg_egress_rate,0.0}, {avg_ack_ingress_rate,0.0}, {avg_ack_egress_rate,0.0}]

erlang process state:

<8873.12155.0> amqp_gen_connection:       0     891    7088       0 gen_server:loop/6

Client Server Conversation:

Payload:

CONNECT
accept-version:1.1,1.0
host:/mcollective
heart-beat:60500,59500
login:guest
passcode:guest
content-length:0
content-type:text/plain; charset=UTF-8

CONNECTED
session:session-0meFIa1iLkL54Y9pV8VSag
heart-beat:59000,60000
server:RabbitMQ/3.5.1
version:1.1

SUBSCRIBE
reply-to:/temp-queue/mcollective_reply_terminal
destination:/exchange/mcollective_broadcast/terminal
id:mcollective_broadcast_terminal
content-length:0
content-type:text/plain; charset=UTF-8

SUBSCRIBE
reply-to:/temp-queue/mcollective_reply_package
destination:/exchange/mcollective_broadcast/package
id:mcollective_broadcast_package
content-length:0
content-type:text/plain; charset=UTF-8

SUBSCRIBE
reply-to:/temp-queue/mcollective_reply_service
destination:/exchange/mcollective_broadcast/service
id:mcollective_broadcast_service
content-length:0
content-type:text/plain; charset=UTF-8

SUBSCRIBE
reply-to:/temp-queue/mcollective_reply_firewall
destination:/exchange/mcollective_broadcast/firewall
id:mcollective_broadcast_firewall
content-length:0
content-type:text/plain; charset=UTF-8

SUBSCRIBE
reply-to:/temp-queue/mcollective_reply_agent
destination:/exchange/mcollective_broadcast/agent
id:mcollective_broadcast_agent
content-length:0
content-type:text/plain; charset=UTF-8

SUBSCRIBE
reply-to:/temp-queue/mcollective_reply_monsoonfact
destination:/exchange/mcollective_broadcast/monsoonfact
id:mcollective_broadcast_monsoonfact
content-length:0
content-type:text/plain; charset=UTF-8

SUBSCRIBE
reply-to:/temp-queue/mcollective_reply_chef
destination:/exchange/mcollective_broadcast/chef
id:mcollective_broadcast_chef
content-length:0
content-type:text/plain; charset=UTF-8

SUBSCRIBE
reply-to:/temp-queue/mcollective_reply_provision
destination:/exchange/mcollective_broadcast/provision
id:mcollective_broadcast_provision
content-length:0
content-type:text/plain; charset=UTF-8

SUBSCRIBE
reply-to:/temp-queue/mcollective_reply_user
destination:/exchange/mcollective_broadcast/user
id:mcollective_broadcast_user
content-length:0
content-type:text/plain; charset=UTF-8

SUBSCRIBE
reply-to:/temp-queue/mcollective_reply_git
destination:/exchange/mcollective_broadcast/git
id:mcollective_broadcast_git
content-length:0
content-type:text/plain; charset=UTF-8

SUBSCRIBE
reply-to:/temp-queue/mcollective_reply_sleep
destination:/exchange/mcollective_broadcast/sleep
id:mcollective_broadcast_sleep
content-length:0
content-type:text/plain; charset=UTF-8

SUBSCRIBE
reply-to:/temp-queue/mcollective_reply_rpcutil
destination:/exchange/mcollective_broadcast/rpcutil
id:mcollective_broadcast_rpcutil
content-length:0
content-type:text/plain; charset=UTF-8

SUBSCRIBE
reply-to:/temp-queue/mcollective_reply_discovery
destination:/exchange/mcollective_broadcast/discovery
id:mcollective_broadcast_discovery
content-length:0
content-type:text/plain; charset=UTF-8

SUBSCRIBE
reply-to:/temp-queue/mcollective_reply_mcollective
destination:/exchange/mcollective_broadcast/mcollective
id:mcollective_broadcast_mcollective
content-length:0
content-type:text/plain; charset=UTF-8

SUBSCRIBE
destination:/exchange/mcollective_directed/mo-9c6b47c77
id:mo-9c6b47c77_directed_to_identity
content-length:0
content-type:text/plain; charset=UTF-8

... SNIP Empty lines ....

SEND
reply-to:/temp-queue/mcollective_reply_registration
expiration:70000
destination:/exchange/mcollective_broadcast/registration
content-length:79754
content-type:text/plain; charset=UTF-8

{:  body"6-----BEGIN PKCS7-----
MILkyQYJKoZIhvcNAQcCoILkujCC5LYCAQExCzAJBgUrDgMCGgUAMILdsgYJKoZI
hvcNAQcBoILdowSC3Z8ECHsIOghtc2d7DDoOYWdlbnRsaXN0WxJJIg10ZXJtaW5h

... SNIP ...

VNViH7eHj8HiNqdnjC5TCUJ+LvKHJOd3Z+MRfGc6xO2gIv8AfQzb2KN3OXkxH4sX
6YXyJ3kWPXlcvyvk6g==
-----END PKCS7-----
senderidI"mo-9c6b47c77:ET:requestidI"%71fd062060b859838b7bafedb8a10ff0;F:
                                                                         filter{
cf_class;F[I"F[I"
compound;F[:collectiveI"mcollective;T:
callerid"g/CN=mo-9c6b47c77.zone1.mo.xxx.corp/O=xxx_it_hyperic/OU=monitoring_zone1_prod/L=internal-zone1/C=DEttliA:
                                                                                                                  msgtimel+�P!U

TCP Packets

1   0.000000    10.97.6.171 10.67.76.191    TCP 74  56107→61613 [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=301787849 TSecr=0 WS=128

2   0.000038    10.67.76.191    10.97.6.171 TCP 74  61613→56107 [SYN, ACK] Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 SACK_PERM=1 TSval=2867615141 TSecr=301787849 WS=128

...

211 1249.492797 10.67.76.191    10.97.6.171 TCP 66  61613→56107 [FIN, ACK] Seq=130 Ack=83092 Win=64128 Len=0 TSval=2868864634 TSecr=303037342

212 1249.546370 10.97.6.171 10.67.76.191    TCP 66  56107→61613 [ACK] Seq=83092 Ack=131 Win=14720 Len=0 TSval=303037396 TSecr=2868864634

State after Connection Termination

Log:

=ERROR REPORT==== 5-Apr-2015::15:12:20 ===
STOMP detected network error on <0.11074.0> (10.97.6.171:56107 -> 10.67.76.191:61613):
{badmatch,<<"\n">>} 

netstat -an:

empty - No longer known to kernel 

rabbitmqctl report:

<[email protected]> 10.97.6.171:56107 -> 10.67.76.191:61613 (1) <[email protected]> 1   guest   /mcollective    false   false   15  0   0   0   0   0   0   running
<[email protected]> stomp-subscription-13Ls1dCVpk0y5QTBuf9DxQ   false   true    []  <[email protected]>             0   0   0   1   1.0 14304               running 0   0   0   0   0   0   0   0   0   0   0   [{q1,0}, {q2,0}, {delta,{delta,undefined,0,undefined}}, {q3,0}, {q4,0}, {len,0}, {target_ram_count,infinity}, {next_seq_id,0}, {avg_ingress_rate,0.0}, {avg_egress_rate,0.0}, {avg_ack_ingress_rate,0.0}, {avg_ack_egress_rate,0.0}]

erlang process state:

<8873.12155.0> amqp_gen_connection:       0       0    7088       0 gen_server:loop/6

Open questions

  • Why is rabbitmq terminating the Connection after the client published the message / why did it result in an error?
  • Why is the connection not properly cleaned out of rabbitmq's client list?

I can provide a raw tcpdump pcap file, please let me know if i should produce more debug output.
From what we can see the Server sends a proper FIN to the client which gets ACK'd and should
result in a connection cleanup.

@michaelklishin
Copy link
Member

Thank for the really detailed issue report!

@michaelklishin michaelklishin self-assigned this Apr 5, 2015
@michaelklishin
Copy link
Member

@martink2 please send me the dump you have, it would help uncover what may be tripping up the parser (that answers your 1st open question).

As for the 2nd one, this may be a bug, and now that we have an error message that correlates with connections being open, it should be much easier to reproduce.

@martink2 do you run a single node or multiple ones?

@martink2
Copy link
Author

martink2 commented Apr 6, 2015

@michaelklishin We can see the error on both single nodes and clusters.
Dump is send via email.

Thanks

Martin

@michaelklishin
Copy link
Member

@martink2 can you post the output of netstat -tcp when that happens? How many connections do you have on average?

What does sudo sysctl -a | grep output on the server?

@michaelklishin
Copy link
Member

More questions: do your clients use NAT?

@martink2
Copy link
Author

martink2 commented Apr 7, 2015

On average we have 11.000 connections.

for a failed connection from 10.97.168.128:54194

netstat -n -tcp | grep 10.97.168.128:54194 : No output

For what sysctl key would you like me to grep ?

As to the nat a few of our clients are behind a NAT but not the majority of them and it
does not show a higher chance of happening for NAT'ed vs direct connections.

But our clients use a lot of short lived connections so the ephemeral port range rolls
over very quick.

Thanks

Martin

@michaelklishin
Copy link
Member

Yes, the ephemeral port range roll overs is what I'm investigating.

I'm after the net.ipv4.tcp_tw_recycle and net.ipv4.tcp_tw_reuse sysctl keys. In fact, anything net.ipv4 may be useful (assuming your clients use IPv4).

You can send me the output privately, [email protected].

@michaelklishin
Copy link
Member

Some findings from today:

  • clients rolling over ip_local_port_range and attempting to re-connect constantly (sequentially) do not leave connections behind

Will try with clients doing the same thing concurrently.

@michaelklishin
Copy link
Member

OK, I believe I have a reliable way to reproduce this. The issue only has so much to do with high TCP connection churn and all that: the reader does not tell the processor to terminate in every case. Phew.

@martink2
Copy link
Author

martink2 commented Apr 8, 2015

@michaelklishin

I deployed the fixed code to our production node and it looks very good, thanks a lot
for the quick help. Any timeline for the 3.5.2 release of rabbitmq yet?

Martin

@martink2 martink2 closed this as completed Apr 8, 2015
@dumbbell
Copy link
Member

dumbbell commented Apr 8, 2015

We try to release a bugfix release once a month. 3.5.2 should be released around the first half of May.

@michaelklishin
Copy link
Member

That said, we also have nightly builds.

@michaelklishin
Copy link
Member

@martink2 thank you very much for giving it a try so quickly!

michaelklishin added a commit that referenced this issue Apr 8, 2015
Why not just link the two?

Then any network-related issue would
produce a scary crash report in the SASL log and
we won't give the processor some time to finish
what it may still be doing.

Fixes #7.
@dumbbell dumbbell added this to the 3.5.2 milestone Apr 8, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants