Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node can use significant (e.g. 80GB) amounts of RAM during startup #2254

Closed
gerhard opened this issue Feb 24, 2020 · 6 comments
Closed

Node can use significant (e.g. 80GB) amounts of RAM during startup #2254

gerhard opened this issue Feb 24, 2020 · 6 comments
Milestone

Comments

@gerhard
Copy link
Contributor

gerhard commented Feb 24, 2020

Part of TGIR S01E02: Help! RabbitMQ ate my RAM! I came across the following unexpected behaviour when rabbit app gets restart via rabbitmqctl stop_app ; rabbitmqctl start_app

Screenshot 2020-02-24 at 13 57 31

Screenshot 2020-02-24 at 13 45 18

A worker process is using 56GB of physical memory while reading through all queue segments, across all queues. This operation takes ~30 minutes to complete (13:44:16 - 14:15:46) with the disk reads starting at 410MiB/s and maxing out at 965MiB/s for 15 minutes straight:

image

This is what the system physical memory breakdown looks like:

image

Notice:

  1. Slow 1.8GiB -> 7.7GiB memory growth spanning ~13 minutes (13:28:00 - 13:41:00)
  2. Large spike to 42GiB the next minute (13:42:00)
  3. Another spike to 60GiB for which cached memory is made available (13:43:00)
  4. Last spike to 80GiB (13:43:45) 🤯
  5. A drop to 60GiB for 30 minutes during which the disk gets hammered (13:44:15)
  6. Drop to 8GiB (14:16:00) followed by a slow ramp-up during which publishers/consumers get reconnected
  7. Regular 11.5GiB -> 17.5GiB spikes, as expected for this workload.

I have captured the runtime thread stats to show the 88% idle regular schedulers (most time is spent in send) and 94% idle dirty_io schedulers. All other thread types are 98% or above idle.

rabbitmq-diagnostics runtime_thread_stats --sample-interval 60 | grep -v async
Will collect runtime thread stats on rabbit@tgir-s01e02-gerhard-rabbitmq-durable for 60 seconds...
Average thread real-time    :  60152930 us
Accumulated system run-time : 162534659 us
Average scheduler run-time  :   7169362 us

        Thread    alloc      aux      bifbusy_wait check_io emulator      ets       gc  gc_full      nif    other     port     send    sleep   timers

Stats per thread:
       aux( 1)    0.59%    0.64%    0.00%    0.00%    0.12%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   98.65%    0.00%
dirty_cpu_( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_( 4)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_( 5)    0.57%    0.00%    0.00%    2.62%    0.00%    0.02%    0.00%    1.75%    0.77%    0.00%    0.01%    0.00%    0.00%   94.26%    0.00%
dirty_cpu_( 6)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_( 7)    0.56%    0.00%    0.00%    2.61%    0.00%    0.02%    0.00%    1.68%    0.72%    0.00%    0.01%    0.00%    0.00%   94.40%    0.00%
dirty_cpu_( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_( 9)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_(11)    0.66%    0.00%    0.00%    2.96%    0.00%    0.02%    0.00%    1.92%    0.83%    0.00%    0.01%    0.00%    0.00%   93.60%    0.00%
dirty_cpu_(12)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_(13)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_(14)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_(15)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_(16)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 2)    0.03%    0.00%    0.00%    1.22%    0.00%    0.02%    0.00%    0.00%    0.00%    6.10%    0.01%    0.00%    0.00%   92.63%    0.00%
dirty_io_s( 3)    0.01%    0.00%    0.00%    0.49%    0.00%    0.01%    0.00%    0.00%    0.00%    2.62%    0.00%    0.00%    0.00%   96.87%    0.00%
dirty_io_s( 4)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 5)    0.05%    0.00%    0.00%    2.14%    0.00%    0.03%    0.00%    0.00%    0.00%   10.72%    0.01%    0.00%    0.00%   87.05%    0.00%
dirty_io_s( 6)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 7)    0.06%    0.00%    0.00%    1.67%    0.00%    0.02%    0.00%    0.00%    0.00%    8.26%    0.01%    0.00%    0.00%   89.98%    0.00%
dirty_io_s( 8)    0.06%    0.00%    0.00%    2.36%    0.00%    0.03%    0.00%    0.00%    0.00%   11.10%    0.01%    0.00%    0.00%   86.44%    0.00%
dirty_io_s( 9)    0.06%    0.00%    0.00%    2.20%    0.00%    0.03%    0.00%    0.00%    0.00%   11.07%    0.01%    0.00%    0.00%   86.63%    0.00%
dirty_io_s(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
      poll( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
 scheduler( 1)    3.07%    0.36%    0.07%    4.51%    0.01%   13.89%    0.69%    0.12%    0.00%    0.01%    0.08%    0.00%   21.67%   55.52%    0.01%
 scheduler( 2)    4.03%    0.29%    0.10%    4.62%    0.00%   18.43%    0.89%    0.16%    0.00%    0.01%    0.10%    0.00%   27.04%   44.34%    0.00%
 scheduler( 3)    1.93%    0.25%    0.07%    2.81%    0.00%    8.98%    0.48%    0.08%    0.00%    0.01%    0.05%    0.00%   23.87%   61.46%    0.00%
 scheduler( 4)    2.84%    0.27%    0.08%    3.69%    0.00%   13.01%    0.67%    0.12%    0.00%    0.01%    0.07%    0.00%   26.82%   52.40%    0.00%
 scheduler( 5)    0.00%    0.00%    0.00%    0.36%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.63%    0.00%
 scheduler( 6)    0.00%    0.01%    0.00%    0.43%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.56%    0.00%
 scheduler( 7)    0.00%    0.00%    0.00%    0.31%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.68%    0.00%
 scheduler( 8)    0.00%    0.00%    0.00%    0.31%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.68%    0.00%
 scheduler( 9)    0.00%    0.00%    0.00%    0.32%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.67%    0.00%
 scheduler(10)    0.00%    0.00%    0.00%    0.32%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.38%    0.00%    0.00%   99.30%    0.00%
 scheduler(11)    0.00%    0.00%    0.00%    0.30%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.70%    0.00%
 scheduler(12)    0.00%    0.00%    0.00%    0.40%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.59%    0.00%
 scheduler(13)    0.00%    0.00%    0.00%    0.33%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.67%    0.00%
 scheduler(14)    0.00%    0.00%    0.00%    0.32%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.67%    0.00%
 scheduler(15)    0.00%    0.00%    0.00%    0.31%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.68%    0.00%
 scheduler(16)    0.00%    0.00%    0.00%    0.28%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.71%    0.00%

Stats per type:
           aux    0.59%    0.64%    0.00%    0.00%    0.12%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   98.65%    0.00%
dirty_cpu_sche    0.11%    0.00%    0.00%    0.51%    0.00%    0.00%    0.00%    0.33%    0.14%    0.00%    0.00%    0.00%    0.00%   98.89%    0.00%
dirty_io_sched    0.03%    0.00%    0.00%    1.01%    0.00%    0.01%    0.00%    0.00%    0.00%    4.99%    0.01%    0.00%    0.00%   93.96%    0.00%
          poll    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
     scheduler    0.74%    0.08%    0.02%    1.23%    0.00%    3.39%    0.17%    0.03%    0.00%    0.00%    0.04%    0.00%    6.21%   88.08%    0.00%

I am attaching the netdata export: netdata-c8a3200a7c2f-20200224-130158-6846.snapshot.zip

There is no Prometheus metrics snapshot for the Erlang VM since this behaviour happens before the Prometheus app is started:

image

@gerhard gerhard changed the title [WIP] Should a node with 50 mil durable messages @ 26KB each spread across 50 durable queues require 60GB of memory to boot? Should a node with 50 mil persistent messages @ 26KB each spread across 50 durable queues require 60GB of memory to boot? Feb 24, 2020
@gerhard gerhard changed the title Should a node with 50 mil persistent messages @ 26KB each spread across 50 durable queues require 60GB of memory to boot? Should a node with 50mil persistent messages of size 26KB and spread across 50 durable queues (1mil msg per queue) require 60GB of memory to boot? Feb 24, 2020
@gerhard gerhard changed the title Should a node with 50mil persistent messages of size 26KB and spread across 50 durable queues (1mil msg per queue) require 60GB of memory to boot? Should a node with 50mil persistent messages of size 26KB require 60GB of RAM during startup? Feb 24, 2020
@gerhard gerhard changed the title Should a node with 50mil persistent messages of size 26KB require 60GB of RAM during startup? Should a node with 50mil persistent messages of size 26KB require 80GB of RAM during startup? Feb 24, 2020
gerhard added a commit to rabbitmq/tgir that referenced this issue Feb 24, 2020
gerhard added a commit to rabbitmq/tgir that referenced this issue Feb 24, 2020
Since everything is going to disk, I believe that it will make a big
difference for memory usage, and I also suspect that it will help with
rabbitmq/rabbitmq-server#2254. Let's try it
out!

Signed-off-by: Gerhard Lazu <[email protected]>
gerhard added a commit to rabbitmq/tgir that referenced this issue Feb 26, 2020
gerhard added a commit to rabbitmq/tgir that referenced this issue Feb 26, 2020
Since everything is going to disk, I believe that it will make a big
difference for memory usage, and I also suspect that it will help with
rabbitmq/rabbitmq-server#2254. Let's try it
out!

Signed-off-by: Gerhard Lazu <[email protected]>
@HoloRin
Copy link
Contributor

HoloRin commented Mar 11, 2020

After doing some investigation, I believe this is caused by the rebuilding of message refcounts by rabbit_msg_store. We can the disk asynchronously, and in a few limited tests, I saw that we scan and enqueue faster than we consume, resulting in a very large backlog of messages in memory.
The [erlang] queues in question are unbounded, so my initial thought is to bound those queues in a matter that puts are blocking, which will throttle the disk reads to limit the data transiently in memory.

@michaelklishin
Copy link
Member

@pjk25 we already have a mechanism for the producer-consumer problem and unbounded buffers in RabbitMQ. It's called credit flow. It's a producer side solution so if we don't control the "producer side" of this process interaction it cannot be used but sounds like we do. No need for blocking puts, producers can self-limit using credit_flow.

@michaelklishin michaelklishin changed the title Should a node with 50mil persistent messages of size 26KB require 80GB of RAM during startup? Node can use significant (e.g. 80GB) amounts of RAM during startup Mar 11, 2020
@HoloRin
Copy link
Contributor

HoloRin commented Mar 11, 2020

After further exploration, I have learned that the gatherer managing the above mentioned queues blocks its publishers until messages are consumed, invalidating the above theory.

@HoloRin
Copy link
Contributor

HoloRin commented Mar 12, 2020

@michaelklishin thanks for the insight. The queues I were looking at were these

{ok, #gstate { forks = 0, values = queue:new(), blocked = queue:new() },
, so not "rabbit" queues, although I'm sure I could have made my comment more clear.

@HoloRin
Copy link
Contributor

HoloRin commented Mar 18, 2020

It's worth noting that even though we shut down the broker with rabbitmqctl stop_app first, we still follow this code path, which I believe corresponds to an unclean shutdown

build_index(false, {MsgRefDeltaGen, MsgRefDeltaGenInit},

Also, the call to

ok = count_msg_refs(MsgRefDeltaGen, MsgRefDeltaGenInit, State),
takes about 10 minutes to complete, before we get to the point where memory is consumed rapidly.

@gerhard gerhard added this to the 3.8.4 milestone Mar 25, 2020
@gerhard
Copy link
Contributor Author

gerhard commented Mar 26, 2020

I've just tested this in the context of TGIR S01E03 and can confirm that #2274 fixes this.

The memory usage doesn't go above 8GiB and even though the recovery time didn't improve (it still took ~45 minutes for the node to restart), the memory usage was perfect: the ETS data structures used the expected memory, all other components only used what they needed.

Watch the video if you're interested to see how this was confirmed 🙌🏻

TL;DW

image
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants