Inconsistent QQ state (between mnesia & ra state machine) when remove a node under high CPU load #11029

tvhong-amazon · 2024-04-17T21:52:15Z

tvhong-amazon
Apr 17, 2024

Description

When a cluster broker is under high CPU load, removing a node via forget_cluster_node command can cause QQs to get into inconsistent state between mnesia metadata & ra state machine.

Once this happens, delete_member would fail, similar to #6511 (comment). The mitigation mentioned in the thread is to enable quorum_queue.continuous_membership_reconciliation.auto_remove = true. However, @SimonUnge said that likely won't work because auto reconcilliation calls delete_member underneath.

Please let me know if I can provide more information. I'm able to reproduce this quite easily in my test stack.

Reproduction

Create a 3-node cluster with RabbitMQ v3.13.0
Start perftest tool to load the broker with 150 QQs, and high message throughput (I used 4500 msgs/s publish & 4500 msgs/s consume on 20 queues) to push the CPU usage on all nodes up to >90%.

Notice that I can reproduce this issue with 20 queues, but not with 1 queue.
Example perftest configuration:

    {
        "queue-args": "x-queue-type=quorum",
        "queue-pattern": "qq-%d",
        "queue-pattern-from": 1,
        "queue-pattern-to": 150,
         "rate": int(4500 / 20),
         "consumer-rate": int(4500 / 20),
         "producers": 20,
         "consumers": 20
    }

Run forget_cluster_node on one node.
Observe that some queues fail to remove the node, and that the queue has inconsistent state.

Example logs and command outputs

Here is a sample queue qq-140 after a few rounds of instance replacements:

Failure to remove member [email protected]:

2024-04-10 17:47:26.029089+00:00 [warning] <0.204996.0> queue 'qq-140' in vhost '/': failed to remove member (replica) on node '[email protected]', error: {no_more_servers_to_try,[{error,not_member},{timeout,{'%2F_qq-140','[email protected]'}},{timeout,{'%2F_qq-140','[email protected]'}}]}

quorum_status still shows [email protected]:

$ rabbitmq rabbitmq-queues quorum_status qq-140
Status of quorum queue qq-140 on node [email protected] ...
┌───────────────────────────────────────────────────────┬────────────┬────────────┬────────────────┬──────────────┬──────────────┬──────────────┬────────────────┬──────┬─────────────────┐
│ Node Name                                             │ Raft State │ Membership │ Last Log Index │ Last Written │ Last Applied │ Commit Index │ Snapshot Index │ Term │ Machine Version │
├───────────────────────────────────────────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ [email protected] │ leader     │ voter      │ 13             │ 13           │ 13           │ 13           │ undefined      │ 1    │ 3               │
├───────────────────────────────────────────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ [email protected] │ follower   │ voter      │ 13             │ 13           │ 13           │ 13           │ undefined      │ 1    │ 3               │
├───────────────────────────────────────────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ [email protected]   │ timeout    │            │                │              │              │              │                │      │                 │
├───────────────────────────────────────────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ [email protected]   │ follower   │ voter      │ 13             │ 13           │ 13           │ 13           │ undefined      │ 1    │ 3               │
├───────────────────────────────────────────────────────┼────────────┼────────────┼────────────────┼──────────────┼���─────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ [email protected]   │ timeout    │            │                │              │              │              │                │      │                 │
└───────────────────────────────────────────────────────┴────────────┴────────────┴────────────────┴──────────────┴──────────────┴──────────────┴────────────────┴──────┴─────────────────┘

< NOTE: there is a time gap between the previous quorum_status command and the following 2 commands >
Mnesia shows [email protected] to be a member:

$ rabbitmqctl eval "{ok, Q} = rabbit_amqqueue:lookup(rabbit_misc:r(<<\"/\">>, queue, <<\"qq-140\">>)), Q."
{amqqueue,{resource,<<"/">>,queue,<<"qq-140">>},
          true,false,none,
          [{<<"x-queue-type">>,longstr,<<"quorum">>}],
          {'%2F_qq-140','[email protected]'},
          [],[],[],undefined,undefined,[],[],live,0,[],<<"/">>,
          #{user => <<"perf-test-user">>},
          rabbit_quorum_queue,
          #{nodes =>
                ['[email protected]',
                 '[email protected]',
                 '[email protected]',
                 '[email protected]']}}

But ra:members doesn't have it:

$ rabbitmqctl eval "QName = rabbit_misc:r(<<\"/\">>, queue, <<\"qq-140\">>), {ok, Q0} = rabbit_amqqueue:lookup(QName), Leader = amqqueue:get_pid(Q0), {ok, Members, _} =ra:members(Leader), Members."
[{'%2F_qq-140','[email protected]'},
 {'%2F_qq-140','[email protected]'},
 {'%2F_qq-140','[email protected]'}]

Cannot run delete_member:

$ rabbitmq-queues delete_member qq-140 "[email protected]"
Removing a replica of queue qq-140 on node [email protected]...
Error:
{:no_more_servers_to_try, [error: :not_member, timeout: {:"%2F_qq-140", :"[email protected]"}, error: :nodedown, error: :not_member]}

Answered by SimonUnge

Oct 3, 2024

@michaelklishin with #11065 and #11278 I think we can close this discussion as solved!

View full answer

michaelklishin · 2024-04-17T22:54:39Z

michaelklishin
Apr 17, 2024
Maintainer

The guide on Upgrades explicitly recommends upgrading when the system is not under stress.

QQ or stream membership changes do update two places, and under close to peak CPU (or disk I/O) load, one of them can hit a timeout. Khepri won't change things dramatically either. While it is much closer to quorum queues and streams in terms of the algorithm used, you still run the risk of hitting a timeout, the higher the load is.

Ra supports timeouts for specific state machine operations. Many CLI tools commands accept a --timeout flag but different commands use it differently.

rabbitmq-queues delete_member could use/pass on the --timeout value.

But even with higher timeouts, with two places where the list of members is stored, you will always run this risk. This is why #8218 introduces a periodic repair operation.

Still, the recommendation not to upgrade clusters under close to peak load still stands and always will. This simply doesn't come up often enough, as about six years of real world quorum queue experience suggests. Perhaps because most RabbitMQ clusters are upgraded outside of their peak load periods.

0 replies

tvhong-amazon · 2024-04-17T23:13:27Z

tvhong-amazon
Apr 17, 2024
Author

@michaelklishin Thank you for the quick response!

I strongly agree with the recommendation to not upgrade / terminate instances under peak load. This report comes out of a series of tests we're doing to validate the stability of QQs under different edge cases.

Regarding mitigation, as a user, I would expect either rabbitmq-queues delete_member or quorum_queue.continuous_membership_reconciliation.auto_remove = true (from #8218) to fix the issue. But that's not the case. Hence, I started this discussion to raise awareness of this issue.

2 replies

michaelklishin Apr 17, 2024
Maintainer

For quorum_queue.continuous_membership_reconciliation.auto_remove to have an effect, it must be able to deal with scenarios like above. It may or may not do that, it was tested with certain scenarios but likely not clusters that are deprived of CPU resources.

For example, it should gracefully handle failed delete operations from Ra. It's a good question whether it should try to aggressively purge entries from Mnesia/Khepri in the event of a timeout, most folks would likely say "no" but there's not right answer. Anyhow, as long as it handles certain errors and the load subdues, it should be able to finally update a relevant schema data store entry.

In fact, quorum_queue.continuous_membership_reconciliation.auto_remove can make things worse during an upgrade because it is not the cheapest possible operation in a cluster with a lot of QQs and streams.

As a side note, we have recently started working on certain types of chaos testing around disk I/O that were not obviously necessary or important earlier. Now that quorum queues have matured and we see very unusual I/O failures with certain Kubernetes-as-a-Service provides (I will not name them but it is not AWS), more time is dedicated towards new classes of failure testing and changes that would allow Ra, QQs, streams and Khepri recover to the extent practically possible when disk I/O starts returning errors beyond a certain frequency.

I suspect #8218 will follow the same trajectory.

michaelklishin Apr 17, 2024
Maintainer

Also note: there was a follow-up change (#10242) which has reduced a certain cost of #8218 in clusters with a lot of QQs.

This is just an example of how now just upgrades (shutting a node down, starting it up, recovering data, transferring replicas, and potentially mass leader elections) can be resource-intensive in practice, and with CPU and I/O resources being scarce, they can start introducing increasingly rare and hard to reproduce failure scenarios.

kjnilsson · 2024-04-18T10:43:46Z

kjnilsson
Apr 18, 2024
Maintainer

I think I can see a problem here. If we look at the error that was returned:

{no_more_servers_to_try,
[
{error,not_member},
{timeout,{'%2F_qq-140','[email protected]'}},
{timeout,{'%2F_qq-140','[email protected]'}}
]}

we can see that it is an aggregate error. When we try to remove a member we try a list of members (got from mnesia) and if any of them errors we try the next one and so on. We can see here that one of the members returned {error, not_member} indicating that the member we want to remove isn't actually part of the cluster. I am not sure how we ended up like this but this is enough for the delete_member command to stop and not execute the next step which is removing this member from the mnesia queue record. (no we cannot make these actions transactional).

I think we need to make a change where we evaluate the error properly after trying each member, if we receive {error, not_member} it should be safe to proceed and correct the mnesia record.

I do think that this particular error must have occurred after we'd already failed to update mnesia in another test and it would be good to see errors from the first failed run also.

It appears we have a function for making the amqqueue record match the truth (ra:members/1) but it isn't called automatically from anywhere:

rabbitmq-server/deps/rabbit/src/rabbit_quorum_queue.erl

Line 605 in 2e94033

repair_amqqueue_nodes(Q0) ->

Perhaps we should do this periodically like we do for the leader pid:

rabbitmq-server/deps/rabbit/src/rabbit_quorum_queue.erl

Line 580 in 2e94033

repair_leader_record(Q, Self) ->

Also related: #7863

2 replies

tvhong-amazon Apr 18, 2024
Author

I do think that this particular error must have occurred after we'd already failed to update mnesia in another test and it would be good to see errors from the first failed run also.

Here are the member removal logs for queue qq-140 that I can find:

2024-04-10 17:47:13.249800+00:00 [info] <0.204996.0> queue 'qq-140' in vhost '/': removing member (replica) on node '[email protected]'
2024-04-10 17:47:26.023920+00:00 [warning] <0.8301.0> queue 'qq-140' in vhost '/' command {'$ra_leave',#{ts => 1712771246023,from => {<38684.204996.0>,[alias|#Ref<38684.0.9462291.2084994383.4043898882.91062>]}},{'%2F_qq-140','[email protected]'},after_log_append} NOT appended to log. Reason not_member
2024-04-10 17:47:26.029089+00:00 [warning] <0.204996.0> queue 'qq-140' in vhost '/': failed to remove member (replica) on node '[email protected]', error: {no_more_servers_to_try,[{error,not_member},{timeout,{'%2F_qq-140','[email protected]'}},{timeout,{'%2F_qq-140','[email protected]'}}]}

For other (successful) queues, I only see a single log such as:

2024-04-10 17:47:47.061809+00:00 [info] <0.204996.0> queue 'qq-90' in vhost '/': removing member (replica) on node '[email protected]'

I believe these queue removals were triggered as part of the rabbitmqctl forget_cluster_node command.

tvhong-amazon May 16, 2024
Author

I tried repair_amqqueue_nodes on a queue in this state and confirmed that it fixes the problem.

SimonUnge · 2024-04-19T21:00:34Z

SimonUnge
Apr 19, 2024
Collaborator

@kjnilsson
"Perhaps we should do this periodically like we do for the leader pid"

Could add it to auto reconcile, but feel that it should perhaps be done by some other periodic process (as auto reconcile might not be turned on by most)

2 replies

kjnilsson Apr 22, 2024
Maintainer

no let the QQs handle this themselves in the tick handler

SimonUnge Apr 22, 2024
Collaborator

Right, makes more sense.

kjnilsson · 2024-04-23T14:04:39Z

kjnilsson
Apr 23, 2024
Maintainer

I have created a PR to Ra which I think could be part of improving the behaviour here: rabbitmq/ra#433

Before all errors would result in the ra:remove_member/2 call trying the next member in the list but if we received an {error, not_member for example there is no point in retrying so instead we return this plain API error.

Combined with this change we could improve the code in:

rabbitmq-server/deps/rabbit/src/rabbit_quorum_queue.erl

Lines 1281 to 1282 in a8bcf4a

    
           case ra:remove_member(Members, ServerId) of 
        
               {ok, _, _Leader} ->

such that we still update the mnesia record when it returns {error, not_member} - this will at least "fix" the mnesia record and make the delete_member call return success.

3 replies

kjnilsson Apr 23, 2024
Maintainer

@tvhong-amazon here is a PR that may handle your test a bit better: #11065

I have only updated it to use make so if you use bazel you have to make sure you change it to include the ra github reference accordingly.

SimonUnge Apr 23, 2024
Collaborator

Nice! Worst case @tvhong-amazon I will build you a release to test with this change.

@kjnilsson with this updated rabbit_quorum_queue:delete_member/2, there no longer a need to 'heal' the record with repair_leader_record in say the tick callback, correct?

kjnilsson Apr 24, 2024
Maintainer

We still need the repair. What this achieves is firstly: it increases the ra:remove_member/2 timeout so fewer queues should experience the issue in a loaded env, secondly: it allows the case where the member was removed from the ra cluster but mnesia not updated to still proceed to the mnesia update stage and "fix" itself by running the delete_member command again.

SimonUnge · 2024-10-03T17:42:34Z

SimonUnge
Oct 3, 2024
Collaborator

@michaelklishin with #11065 and #11278 I think we can close this discussion as solved!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent QQ state (between mnesia & ra state machine) when remove a node under high CPU load #11029

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 9 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Inconsistent QQ state (between mnesia & ra state machine) when remove a node under high CPU load #11029

tvhong-amazon Apr 17, 2024

Description

Reproduction

Example logs and command outputs

Replies: 6 comments · 9 replies

michaelklishin Apr 17, 2024 Maintainer

tvhong-amazon Apr 17, 2024 Author

michaelklishin Apr 17, 2024 Maintainer

michaelklishin Apr 17, 2024 Maintainer

kjnilsson Apr 18, 2024 Maintainer

tvhong-amazon Apr 18, 2024 Author

tvhong-amazon May 16, 2024 Author

SimonUnge Apr 19, 2024 Collaborator

kjnilsson Apr 22, 2024 Maintainer

SimonUnge Apr 22, 2024 Collaborator

kjnilsson Apr 23, 2024 Maintainer

kjnilsson Apr 23, 2024 Maintainer

SimonUnge Apr 23, 2024 Collaborator

kjnilsson Apr 24, 2024 Maintainer

SimonUnge Oct 3, 2024 Collaborator

tvhong-amazon
Apr 17, 2024

Replies: 6 comments 9 replies

michaelklishin
Apr 17, 2024
Maintainer

tvhong-amazon
Apr 17, 2024
Author

michaelklishin Apr 17, 2024
Maintainer

michaelklishin Apr 17, 2024
Maintainer

kjnilsson
Apr 18, 2024
Maintainer

tvhong-amazon Apr 18, 2024
Author

tvhong-amazon May 16, 2024
Author

SimonUnge
Apr 19, 2024
Collaborator

kjnilsson Apr 22, 2024
Maintainer

SimonUnge Apr 22, 2024
Collaborator

kjnilsson
Apr 23, 2024
Maintainer

kjnilsson Apr 23, 2024
Maintainer

SimonUnge Apr 23, 2024
Collaborator

kjnilsson Apr 24, 2024
Maintainer

SimonUnge
Oct 3, 2024
Collaborator