Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing queue-mode to 'lazy" sometimes locks the queue up #850

Closed
cenekzach opened this issue Jun 20, 2016 · 5 comments
Closed

Changing queue-mode to 'lazy" sometimes locks the queue up #850

cenekzach opened this issue Jun 20, 2016 · 5 comments
Assignees
Milestone

Comments

@cenekzach
Copy link

I'm testing 3.6.2 version. I use 3-node cluster of identical virtual servers - 8xCPU, 30G RAM, 500G storage. There is no HA policy, clients connect to the first node.

I have 20 clients publishing to 20 durable queues a mix of 1kB/10kB persistent messages (single vhost, single exchange). When queues are around 50k long I stop the publishers. Then I add policy saying that all the queues in the vhost are now lazy (everything else is default). Most of the queues behave ok, but sometimes some of the queues do not.

For example in one test one of the queues have 17555 messages in memory (54920 total) and can not be purged nor deleted - both UI and pika client hangs. The node consumes 100% CPU (1/8), but there is nothing else going on. The only way I found to fix this is to restart the cluster, the node with the queue does not stop cleanly and needs to be killed. After the node starts, the 'bad' queue is marked as lazy and works fine.

Tried it again now, 6/20 queues are in this bad state, the node using 600% CPU (6/8). Nothing suspicious in the logs.

OS: CentOS 6.7
Erlang: 17.4

@michaelklishin
Copy link
Member

Please post full log files.

@michaelklishin
Copy link
Member

Aslo, have you considered that sending up to 0.5 GB of messages to disk takes some time and while that happens queues do perform any other operations? Do queues go out of this state eventually?

@cenekzach
Copy link
Author

The messages are published as persistent, so they should be already stored in the disk. All the other queues switch to 'lazy' mode pretty fast (seconds top), the 'bad' ones does not seem to finish ever (I think I tried half an hour).

I tried it again today, but was only able to reproduce the behavior with publisher still running. Find the attached archive containing logs, configuration, rabbitmqctl status,report output (report hanged so it is incomplete), /api/queues/ output and 10s strace output of the 'beam' process. I tried to simplify the configuration so I removed the node from the cluster and turned off all monitoring agents we have in place.
report.tar.gz

It has been now 20 minutes since I applied the policy. 14/20 are in the 'bad' state, the server is using 800% CPU (all the CPUs available).

@michaelklishin
Copy link
Member

There are no errors in the logs. We'd have to investigate after the 3.6.3 release.

@Gsantomaggio
Copy link
Member

Hi @cenekzach

I can reproduce the issue in this way:

  1. Publish messages with an high rate, until the queues go in flow status:

    rabbitmqctl list_queues name arguments state
    test_850_3  []  flow
    test_850_11 []  flow
    test_850_15 []  flow
    test_850_27 []  flow
    test_850_14 []  flow
    test_850_18 []  flow
    test_850_28 []  flow
    test_850_1  []  flow
    test_850_7  []  flow
    test_850_10 []  flow
    
  2. apply the policy rabbitmqctl set_policy Lazy "" '{"queue-mode":"lazy"}' --apply-to queues

  3. stop publish.

What happen:

The queues enter in wait_for_msg_store_credit

The credit_flow:blocked() should return true but gets always false as:

(rabbit@mac)6>
(rabbit@mac)6> dbg:start().
{ok,<0.3806.0>}
(rabbit@mac)7> dbg:tracer().
{ok,<0.3806.0>}
(rabbit@mac)8> dbg:tpl(credit_flow, x).
{ok,[{matched,rabbit@mac,14},{saved,x}]}
(rabbit@mac)9>
(rabbit@mac)9> dbg:p(all, c).
(<0.3540.0>) call credit_flow:blocked()
(<0.3540.0>) returned from credit_flow:blocked/0 -> false
(<0.3446.0>) call credit_flow:blocked()
{ok,[{matched,rabbit@mac,310}]}
(<0.3446.0>) returned from credit_flow:blocked/0 -> false
(<0.3540.0>) call credit_flow:blocked()
(rabbit@mac)10> (<0.3471.0>) call credit_flow:blocked()
(<0.3522.0>) call credit_flow:blocked()
(<0.3446.0>) call credit_flow:blocked()
(<0.3540.0>) returned from credit_flow:blocked/0 -> false
(<0.3471.0>) returned from credit_flow:blocked/0 -> false

It enters in a loop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants