rabbitmq used disk space grows after upgrade to 3.13 #10681

urusha · 2024-03-05T17:32:06Z

urusha
Mar 5, 2024

Describe the bug

After upgrade from 3.12.13 to 3.13.0 used disk space grows up.

Average messages sum (count, bytes, publish, delivery) remained the same.
While /metrics/per-object shows sum(rabbitmq_queue_messages_bytes) about 35G, vhost directory on disk uses 1.5Tb:

# du -sh *
1.5T	msg_store_persistent
4.0K	msg_store_transient
30G	queues
20K	recovery.dets
# ls msg_store_persistent | wc -l
97035

rabbitmq.conf contains classic_queue.default_version = 2 (was after upgrading to 3.12)

After restart of rabbitmq folder size drops to 500G (still too much) and then begins to grow again.

Reproduction steps

upgrade from 3.12 to 3.13
wait till directory size grows up

Expected behavior

Directory size =~ total size of all queues.

Additional context

rabbitmqctl version
3.13.0
erlang 26.2.2-1
debian 12 amd64

Answered by lhoguin

Mar 7, 2024

Thank you. This is consistent with what we are observing. I am working on a fix. For what it's worth I do not think it's related to hardware anymore, we have found a small piece of code that is very inefficient when there are many messages that make compactions much slower than they should be. So for the time being we will put back a limit on the number of compactions in-flight.

View full answer

lukebakken · 2024-03-05T17:41:06Z

lukebakken
Mar 5, 2024
Maintainer

@urusha thanks for using RabbitMQ.

You have not provided enough information and are basically asking us to spend time guessing how to reproduce your issue.

Please export your RabbitMQ definitions and ATTACH the file.
Attach your complete RabbitMQ configuration files.
Describe your upgrade process in detail.
What is your workload? Can you provide PerfTest arguments to mimic it? https://perftest.rabbitmq.com/

Ideally, you would provide a docker compose project that represents your workload and starts with RabbitMQ 3.12.

4 replies

urusha Mar 5, 2024
Author

Sorry, misinformed you, OS is ubuntu 22.04 amd64.
Configuration files:
enabled_plugins

[rabbitmq_management,rabbitmq_prometheus,rabbitmq_shovel,rabbitmq_shovel_management].

rabbitmq.conf

classic_queue.default_version = 2
disk_free_limit.absolute = 10GB
prometheus.tcp.ip = 127.0.0.1

Upgrade process:

apt install rabbitmq-server

this restarted rabbitmq-server systemd unit successfully, erlang was already the latest version.

Definitions (passwords,names modified):
defs3.json

Here is last week queue sizes in messages (not stacked) and bytes (stacked):

I can provide other monitoring data if needed.

Will try to make perftests later.

urusha Mar 5, 2024
Author

just restarted - size dropped:

du -sh *
203G	msg_store_persistent
4.0K	msg_store_transient
24G	queues
20K	recovery.dets

but grows again.

michaelklishin Mar 5, 2024
Maintainer

All that output tells us is that most of the data is used by classic queues. That's exactly why I have mentioned the -d 3 part of du -h -d 3.

Your queues use 10 priorities each. Internally that means you have 10 classic queues with the same semantic name and a different consumption mechanism.

Not only most systems do not need even 10 priorities,
I'm quite certain that CQv2 were tested with regular non-priority queues 99% of the time.

With peaks up to 70M+ messages, you will see periods of growing and shrinking disk space usage.

From there, besides reducing the number of priorities (which can only be changed by deleting and re-declaring the queues), I'd see what your applications do.

There may be genuine backlogs of messages, growing as you observe disk space usage. There may be an infinite or very high prefetch level that means that a lot of messages will NOT be accounted in the "ready" state but would still consume disk space at the very least until they are confirmed. And yet they may or may not be accounted by the metric you expect actual disk space usage to follow closerly.

Finally, there is an extremely common recommendation of not sending around large (say, 10s of MBs) where possible, or at least use compression on the application side. A 100 MB file belongs to a blob store, with its ID passed around in messages.

michaelklishin Mar 5, 2024
Maintainer

Moreover, both management UI and detailed Prometheus metrics should have enough to help correlate the number of Ready, Unacknowledged messages vs. computed disk footprint.

There may be a very small (even one) number of queues with a backlog of very large messages. When the node was restarted, there are two possible scenarios that can lead to a lot of data being removed:

All outstanding (unconfirmed) deliveries will be requeued
If applications publish messages as transient, those messages will be deleted on node startup/during queue data recovery

Of course, both of those factors can be in play at the same time, and they correspond reasonably well with the hypothesis that a small number of queues has a backlog of large messages that were published as transient. CQv2 will store such "transient" messages on disk just like any other but they would still be deleted on node restart,

Then after the node is restarted, the backlog begin accumulating again.

Queue length limits and message or even queue TTL are two guardrail features available for such cases. They have been around forever. They are not commonly used with priority queues but both should do their job reasonably well, as long as every sub-queue is moving (not stuck waiting for a single consumer).

michaelklishin · 2024-03-05T17:59:18Z

michaelklishin
Mar 5, 2024
Maintainer

@urusha I'm afraid the assumption that rabbitmq_queue_messages_bytes is all that node store on disk is a very naive one.

Quorum queues and streams can store a large amount of data on disk and there's nothing new in 3.13 compared to 3.12. Not only every message has protocol-level metadata, internal format metadata (which did change in 3.13, as release notes state) but that data is not deleted the moment a message is consumed.

Specifically, quorum queues store the entire Raft log, including messages and other state machine transitions, until the moment when the oldest unacknowledged message in the log is consumed and acknowledged. Stuck or slow consumers can affect this a great deal, which is why a consumer delivery timeout exists: to make sure a stuck consumer does not prevent a quorum queue from reclaiming disk space.

New clusters can use a different Raft WAL segment size. Changing this on existing clusters is dangerous and must not be attempted.

Streams retain as much data as you configure.

0 replies

michaelklishin · 2024-03-05T18:03:47Z

michaelklishin
Mar 5, 2024
Maintainer

The seesaw pattern on the chart above is large what I'd expect from quorum queues with many workloads, and streams as well. You have peaks and you have troughs. Few workloads result in a very even disk space use over time.

Without inspecting node data directory using du -h -d 3 /path/to/node/data/directory or something similar, we can only guess what specifically takes the space but usually the key contributor is the Raft log of a set of quorum queues.

CQv2 may result in higher peaks but we do not see this in practice, or somehow I've missed those cases. @lhoguin and @mkuratczyk would know better than I do.

2 replies

urusha Mar 5, 2024
Author

@michaelklishin We only use classic queues with one-and-the-only node. upgrade from 3.12 was at 2024-02-29 11:17. Before that time we've never had such issues with disk space (including 3.12.x with classic queues v2, for weeks). Overall data throughput didn't grow up either. Space issues started exactly after upgrade. Anyway, metadata size ~50x of actual data size isn't normal.

lukebakken Mar 5, 2024
Maintainer

We can't help you without more information. Please see my comment

lukebakken · 2024-03-05T22:48:21Z

lukebakken
Mar 5, 2024
Maintainer

@urusha to confirm -

You NEVER saw this issue using RabbitMQ 3.12.x, and you are certain that your workload was identical to what it is using 3.13.0
You have always used the classic_queue.default_version = 2 setting in your environments.

1 reply

urusha Mar 6, 2024
Author

yes
yes since 3.12

lhoguin · 2024-03-06T09:15:20Z

lhoguin
Mar 6, 2024
Maintainer

Hello, since space is not being reclaimed it's possible there is an issue with GCing messages, which has changed in 3.13. Please set the log levels to debug https://www.rabbitmq.com/docs/logging#log-levels and then provide the log lines for the following messages (filtering on reclaimed should help):

20 replies

lhoguin Mar 7, 2024
Maintainer

Thank you. I think it's simply not able to deal with the amount of compaction it is trying to do, resulting in not enough files getting truncated or deleted to reclaim the space. Evidently you have a lot of files that should be deleted - they are when the node is starting, before the node starts to do too many things and it gets behind again.

To confirm this I would like you to run rabbitmq-diagnostics observer. If you then press m + Enter to sort by memory, or r + Enter to sort by reductions, you should see a process named rabbit_msg_store_gc on your screen. Take a screenshot of these screens. On the left of the line is a number. By pressing that number + Enter, you will get the details of the process. You can then press S + Enter to get the state. Make sure your window is large enough, then screenshot this screen.

If this verifies, a workaround would be to get faster hardware (CPU or disk depending on what the bottleneck is). But we do want to get this fixed. My guess is that we simply try to queue up too much compaction, whereas in 3.12 and earlier there was a limit in the number of compacted files queued up at once. I made the wrong assumption that hardware was always fast enough that it was no longer necessary. Thank you!

lhoguin Mar 7, 2024
Maintainer

We can reproduce behavior similar to what you observe and are working on fixing that, but having confirmation would be great still.

urusha Mar 7, 2024
Author

lhoguin Mar 7, 2024
Maintainer

Thank you. This is consistent with what we are observing. I am working on a fix. For what it's worth I do not think it's related to hardware anymore, we have found a small piece of code that is very inefficient when there are many messages that make compactions much slower than they should be. So for the time being we will put back a limit on the number of compactions in-flight.

Answer selected by michaelklishin

urusha Mar 11, 2024
Author

@lhoguin thanks, glad to help! Is there a patch with the fix (and how to apply it), or a fixed package? we have to restart rabbit every several hours - and since it doesn't die gracefully (killed by systemd timeout), on start we have to wait for another batch of time till all topics get reindexed.

Mar 11 03:59:07 rabbit01 systemd[1]: Stopping RabbitMQ broker...
Mar 11 03:59:08 rabbit01 rabbitmqctl[93491]: Shutting down RabbitMQ node rabbit@rabbit01 running at PID 78243
Mar 11 04:00:37 rabbit01 systemd[1]: rabbitmq-server.service: Stopping timed out. Terminating.
Mar 11 04:00:37 rabbit01 systemd[1]: rabbitmq-server.service: Control process exited, code=killed, status=15/TERM
Mar 11 04:00:38 rabbit01 rabbitmqctl[93491]: Error:
Mar 11 04:00:38 rabbit01 rabbitmqctl[93491]: nodedown
Mar 11 04:02:08 rabbit01 systemd[1]: rabbitmq-server.service: State 'stop-sigterm' timed out. Killing.

lhoguin Mar 11, 2024
Maintainer

I am working on this branch: #10696 - the code is currently being tested but is showing good results so far.

Depending on what you are using there is an OCI image built automatically as part of the GH Actions. Needless to say this is not a build we provide support for.

I expect testing to be done today. A v3.13.1 containing the fix should release this week.

lukebakken Mar 11, 2024
Maintainer

@urusha please see this page:

https://hub.docker.com/r/pivotalrabbitmq/rabbitmq/tags

docker pull pivotalrabbitmq/rabbitmq:loic-cq-defer-gc-active-files-otp-max-bazel

That will contain the code from this PR. If you can test it in your environment, that would be great. Hopefully you can use docker.

michaelklishin Mar 11, 2024
Maintainer

To add to the above, if you need a different package type in order to test this, we can build you one. We would love to have some feedback of that kind before we ship 3.13.1.

The original post says you run RabbitMQ on Debian. Does it mean you need a .deb package? We have a development versions repo on Cloudsmith.io, and it should be updated shortly after the backporting PR is merged.

michaelklishin · 2024-07-31T16:49:29Z

michaelklishin
Jul 31, 2024
Maintainer

The original issue reported here was resolved, and I do not recall it being reported elsewhere since March.

@urusha please use the recommendations above (from today) and start a new one if you have something else to report, with a detailed set of steps to reproduce. Most likely we will need a data directory from the node that fails to start.

0 replies

michaelklishin · 2024-07-31T16:51:35Z

michaelklishin
Jul 31, 2024
Maintainer

The trigger for update to 3.16.6 was running out of space on the disk (thought it was this bug again), 
but after further investigation we found that additional space was a valid data in one of the queues. 

So we fixed the space problem. And we have disk_free_limit.absolute = 10GB,
so we always had real available bytes on the disk.

Now we have:
pivotalrabbitmq/rabbitmq:loic-cq-defer-gc-active-files-otp-max-bazel crashing on startup.

Like I said above, a node that has run out of disk space won't always be able to recover. Overprovision free disk space, put adequate guardrails such as queue length limits and max message size in place.

The issue with CQ compaction falling behind, reported in this thread in March, has been addressed and never reported again.

A recovering node cannot know what was not written to disk, so there always will be scenarios where it won't be able to safely recover. If your node has run out of disk space, consider it to be ready to be replaced.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rabbitmq used disk space grows after upgrade to 3.13 #10681

{{title}}

Replies: 7 comments 27 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

rabbitmq used disk space grows after upgrade to 3.13 #10681

urusha Mar 5, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 7 comments · 27 replies

lukebakken Mar 5, 2024 Maintainer

urusha Mar 5, 2024 Author

urusha Mar 5, 2024 Author

michaelklishin Mar 5, 2024 Maintainer

michaelklishin Mar 5, 2024 Maintainer

michaelklishin Mar 5, 2024 Maintainer

michaelklishin Mar 5, 2024 Maintainer

urusha Mar 5, 2024 Author

lukebakken Mar 5, 2024 Maintainer

lukebakken Mar 5, 2024 Maintainer

urusha Mar 6, 2024 Author

lhoguin Mar 6, 2024 Maintainer

lhoguin Mar 7, 2024 Maintainer

lhoguin Mar 7, 2024 Maintainer

urusha Mar 7, 2024 Author

lhoguin Mar 7, 2024 Maintainer

urusha Mar 11, 2024 Author

lhoguin Mar 11, 2024 Maintainer

lukebakken Mar 11, 2024 Maintainer

michaelklishin Mar 11, 2024 Maintainer

michaelklishin Jul 31, 2024 Maintainer

michaelklishin Jul 31, 2024 Maintainer

urusha
Mar 5, 2024

Replies: 7 comments 27 replies

lukebakken
Mar 5, 2024
Maintainer

urusha Mar 5, 2024
Author

urusha Mar 5, 2024
Author

michaelklishin Mar 5, 2024
Maintainer

michaelklishin Mar 5, 2024
Maintainer

michaelklishin
Mar 5, 2024
Maintainer

michaelklishin
Mar 5, 2024
Maintainer

urusha Mar 5, 2024
Author

lukebakken Mar 5, 2024
Maintainer

lukebakken
Mar 5, 2024
Maintainer

urusha Mar 6, 2024
Author

lhoguin
Mar 6, 2024
Maintainer

lhoguin Mar 7, 2024
Maintainer

lhoguin Mar 7, 2024
Maintainer

urusha Mar 7, 2024
Author

lhoguin Mar 7, 2024
Maintainer

urusha Mar 11, 2024
Author

lhoguin Mar 11, 2024
Maintainer

lukebakken Mar 11, 2024
Maintainer

michaelklishin Mar 11, 2024
Maintainer

michaelklishin
Jul 31, 2024
Maintainer

michaelklishin
Jul 31, 2024
Maintainer