Memory use of a node spikes around the time of mass client disconnect #3226

james30233 · 2021-07-23T08:47:17Z

james30233
Jul 23, 2021

we have an Rocky openstack deployment that includes 3 controller and 500 computes.just at one moment,nova-compute detect that rabbitmq connection was broken ,then reconnected.In 15 minutes,memory consumption on rabbitmq-server increased abruptly,from 3G orinally to 150G, reached 40% watermark.

rabbitmq.log

2021-07-05 15:58:28.633 8 ERROR oslo.messaging._drivers.impl_rabbit [req-a09d4a8b-c24b-4b30-b433-64fe4f6bace5 - - - - -] [8ed1f425-ad67-4b98-874c-e4516aaf3134] AMQP server on 145.247.103.16:5671 is unreachable: . Trying again in 1 seconds.: timeout
2021-07-05 15:58:29.656 8 INFO oslo.messaging._drivers.impl_rabbit [req-a09d4a8b-c24b-4b30-b433-64fe4f6bace5 - - - - -] [8ed1f425-ad67-4b98-874c-e4516aaf3134] Reconnected to AMQP server on 145.247.103.16:5671 via [amqp] client with port 28205.

then rabbitmq report huge connections was closed by client.

=WARNING REPORT==== 5-Jul-2021::15:57:59 ===
closing AMQP connection <0.6345.754> (20.16.36.44:2451 -> 145.247.103.14:5671 - nova-compute:8:b4ce7b09-b9b5-4db1-983b-a071dc031c64, vhost: '/', user: 'openstack'):
client unexpectedly closed TCP connection

after 10 minutes ,cluster was blocked with 0.4 memory watermark.

=INFO REPORT==== 5-Jul-2021::16:19:29 ===
vm_memory_high_watermark set. Memory used:111358541824 allowed:107949065830

*** Publishers will be blocked until this alarm clears ***

However ,after the publishers were bloked ,rabbitmq pod still result in memory leaking,in the end, the node OOM,system force pod to restart.

rabbitmq-management : on
rabbitmq-server: 3.6.16
erlang: 19.3.6
amqp release : 2.5.2
oslo-messaging release :8.1.4
openstack : Rocky

Answered by michaelklishin

Jul 23, 2021

I'm afraid that's not a whole lot of evidence of a leak. Messages are not the only thing that consumes resources. Connections do, too, in particular in case of high churn that you have provided evidence of (mass client disconnections).
There are tools and metrics that would help you understand what exactly uses the memory.

RabbitMQ 3.6.16 has been out of support for over three years. Erlang memory allocators and GC have changed since Erlang 19 as well (latest releases are 24.x).

I'm afraid the only piece of advice we have is

Don't guess, consult metrics (a more basic version of rabbitmq-diagnostics memory_breakdown is available in rabbitmqctl status in 3.6)
Upgrade

View full answer

michaelklishin · 2021-07-23T08:50:33Z

michaelklishin
Jul 23, 2021
Maintainer

I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :(

0 replies

michaelklishin · 2021-07-23T08:56:32Z

michaelklishin
Jul 23, 2021
Maintainer

I'm afraid that's not a whole lot of evidence of a leak. Messages are not the only thing that consumes resources. Connections do, too, in particular in case of high churn that you have provided evidence of (mass client disconnections).
There are tools and metrics that would help you understand what exactly uses the memory.

RabbitMQ 3.6.16 has been out of support for over three years. Erlang memory allocators and GC have changed since Erlang 19 as well (latest releases are 24.x).

I'm afraid the only piece of advice we have is

Don't guess, consult metrics (a more basic version of rabbitmq-diagnostics memory_breakdown is available in rabbitmqctl status in 3.6)
Upgrade

0 replies

michaelklishin · 2021-07-23T09:10:54Z

michaelklishin
Jul 23, 2021
Maintainer

I now recall that around Erlang 18-19 series, difficult to explain massive heap allocations were relatively common. Heap fragmentation can still be observed with Erlang 23 and 24 but
it is not sudden, and tens of GiBs worth of allocations in a short period of time is something that I haven't seen in a couple of years.

There were quite a few potentially relevant changes, including around memory allocator behavior and available metrics starting with Erlang 21.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory use of a node spikes around the time of mass client disconnect #3226

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Memory use of a node spikes around the time of mass client disconnect #3226

james30233 Jul 23, 2021

Replies: 3 comments

michaelklishin Jul 23, 2021 Maintainer

michaelklishin Jul 23, 2021 Maintainer

michaelklishin Jul 23, 2021 Maintainer

james30233
Jul 23, 2021

michaelklishin
Jul 23, 2021
Maintainer

michaelklishin
Jul 23, 2021
Maintainer

michaelklishin
Jul 23, 2021
Maintainer