Changing queue-mode to 'lazy" sometimes locks the queue up #850

cenekzach · 2016-06-20T14:10:55Z

I'm testing 3.6.2 version. I use 3-node cluster of identical virtual servers - 8xCPU, 30G RAM, 500G storage. There is no HA policy, clients connect to the first node.

I have 20 clients publishing to 20 durable queues a mix of 1kB/10kB persistent messages (single vhost, single exchange). When queues are around 50k long I stop the publishers. Then I add policy saying that all the queues in the vhost are now lazy (everything else is default). Most of the queues behave ok, but sometimes some of the queues do not.

For example in one test one of the queues have 17555 messages in memory (54920 total) and can not be purged nor deleted - both UI and pika client hangs. The node consumes 100% CPU (1/8), but there is nothing else going on. The only way I found to fix this is to restart the cluster, the node with the queue does not stop cleanly and needs to be killed. After the node starts, the 'bad' queue is marked as lazy and works fine.

Tried it again now, 6/20 queues are in this bad state, the node using 600% CPU (6/8). Nothing suspicious in the logs.

OS: CentOS 6.7
Erlang: 17.4

michaelklishin · 2016-06-20T14:43:15Z

Please post full log files.

michaelklishin · 2016-06-20T14:48:37Z

Aslo, have you considered that sending up to 0.5 GB of messages to disk takes some time and while that happens queues do perform any other operations? Do queues go out of this state eventually?

cenekzach · 2016-06-21T08:44:05Z

The messages are published as persistent, so they should be already stored in the disk. All the other queues switch to 'lazy' mode pretty fast (seconds top), the 'bad' ones does not seem to finish ever (I think I tried half an hour).

I tried it again today, but was only able to reproduce the behavior with publisher still running. Find the attached archive containing logs, configuration, rabbitmqctl status,report output (report hanged so it is incomplete), /api/queues/ output and 10s strace output of the 'beam' process. I tried to simplify the configuration so I removed the node from the cluster and turned off all monitoring agents we have in place.
report.tar.gz

It has been now 20 minutes since I applied the policy. 14/20 are in the 'bad' state, the server is using 800% CPU (all the CPUs available).

michaelklishin · 2016-06-22T15:21:36Z

There are no errors in the logs. We'd have to investigate after the 3.6.3 release.

Gsantomaggio · 2016-08-04T12:41:46Z

Hi @cenekzach

I can reproduce the issue in this way:

Publish messages with an high rate, until the queues go in flow status:

rabbitmqctl list_queues name arguments state
test_850_3  []  flow
test_850_11 []  flow
test_850_15 []  flow
test_850_27 []  flow
test_850_14 []  flow
test_850_18 []  flow
test_850_28 []  flow
test_850_1  []  flow
test_850_7  []  flow
test_850_10 []  flow

apply the policy rabbitmqctl set_policy Lazy "" '{"queue-mode":"lazy"}' --apply-to queues
stop publish.

What happen:

The queues enter in wait_for_msg_store_credit

The credit_flow:blocked() should return true but gets always false as:

(rabbit@mac)6>
(rabbit@mac)6> dbg:start().
{ok,<0.3806.0>}
(rabbit@mac)7> dbg:tracer().
{ok,<0.3806.0>}
(rabbit@mac)8> dbg:tpl(credit_flow, x).
{ok,[{matched,rabbit@mac,14},{saved,x}]}
(rabbit@mac)9>
(rabbit@mac)9> dbg:p(all, c).
(<0.3540.0>) call credit_flow:blocked()
(<0.3540.0>) returned from credit_flow:blocked/0 -> false
(<0.3446.0>) call credit_flow:blocked()
{ok,[{matched,rabbit@mac,310}]}
(<0.3446.0>) returned from credit_flow:blocked/0 -> false
(<0.3540.0>) call credit_flow:blocked()
(rabbit@mac)10> (<0.3471.0>) call credit_flow:blocked()
(<0.3522.0>) call credit_flow:blocked()
(<0.3446.0>) call credit_flow:blocked()
(<0.3540.0>) returned from credit_flow:blocked/0 -> false
(<0.3471.0>) returned from credit_flow:blocked/0 -> false

It enters in a loop

Various Dialyzer error fixes

michaelklishin added the no-decision label Jun 21, 2016

Gsantomaggio mentioned this issue Aug 9, 2016

Added resume after flow #910

Merged

michaelklishin added bug effort-medium and removed no-decision labels Aug 9, 2016

michaelklishin added this to the 3.6.6 milestone Aug 9, 2016

michaelklishin assigned Gsantomaggio Aug 9, 2016

michaelklishin closed this as completed Aug 10, 2016

dcorbacho pushed a commit that referenced this issue Nov 18, 2020

Merge pull request #850 from rabbitmq/mk-dialyzer-fixes

595f131

Various Dialyzer error fixes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing queue-mode to 'lazy" sometimes locks the queue up #850

Changing queue-mode to 'lazy" sometimes locks the queue up #850

cenekzach commented Jun 20, 2016

michaelklishin commented Jun 20, 2016

michaelklishin commented Jun 20, 2016

cenekzach commented Jun 21, 2016

michaelklishin commented Jun 22, 2016

Gsantomaggio commented Aug 4, 2016

Changing queue-mode to 'lazy" sometimes locks the queue up #850

Changing queue-mode to 'lazy" sometimes locks the queue up #850

Comments

cenekzach commented Jun 20, 2016

michaelklishin commented Jun 20, 2016

michaelklishin commented Jun 20, 2016

cenekzach commented Jun 21, 2016

michaelklishin commented Jun 22, 2016

Gsantomaggio commented Aug 4, 2016