Inconsistent QQ state (between mnesia & ra state machine) when remove a node under high CPU load #11029
-
DescriptionWhen a cluster broker is under high CPU load, removing a node via Once this happens, Please let me know if I can provide more information. I'm able to reproduce this quite easily in my test stack. Reproduction
Example logs and command outputsHere is a sample queue Failure to remove member
< NOTE: there is a time gap between the previous quorum_status command and the following 2 commands >
But
Cannot run
|
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 9 replies
-
The guide on Upgrades explicitly recommends upgrading when the system is not under stress. QQ or stream membership changes do update two places, and under close to peak CPU (or disk I/O) load, one of them can hit a timeout. Khepri won't change things dramatically either. While it is much closer to quorum queues and streams in terms of the algorithm used, you still run the risk of hitting a timeout, the higher the load is. Ra supports timeouts for specific state machine operations. Many CLI tools commands accept a
But even with higher timeouts, with two places where the list of members is stored, you will always run this risk. This is why #8218 introduces a periodic repair operation. Still, the recommendation not to upgrade clusters under close to peak load still stands and always will. This simply doesn't come up often enough, as about six years of real world quorum queue experience suggests. Perhaps because most RabbitMQ clusters are upgraded outside of their peak load periods. |
Beta Was this translation helpful? Give feedback.
-
@michaelklishin Thank you for the quick response! I strongly agree with the recommendation to not upgrade / terminate instances under peak load. This report comes out of a series of tests we're doing to validate the stability of QQs under different edge cases. Regarding mitigation, as a user, I would expect either |
Beta Was this translation helpful? Give feedback.
-
I think I can see a problem here. If we look at the error that was returned:
we can see that it is an aggregate error. When we try to remove a member we try a list of members (got from mnesia) and if any of them errors we try the next one and so on. We can see here that one of the members returned I think we need to make a change where we evaluate the error properly after trying each member, if we receive I do think that this particular error must have occurred after we'd already failed to update mnesia in another test and it would be good to see errors from the first failed run also. It appears we have a function for making the amqqueue record match the truth (ra:members/1) but it isn't called automatically from anywhere: Perhaps we should do this periodically like we do for the leader pid: Also related: #7863 |
Beta Was this translation helpful? Give feedback.
-
@kjnilsson Could add it to auto reconcile, but feel that it should perhaps be done by some other periodic process (as auto reconcile might not be turned on by most) |
Beta Was this translation helpful? Give feedback.
-
I have created a PR to Ra which I think could be part of improving the behaviour here: rabbitmq/ra#433 Before all errors would result in the Combined with this change we could improve the code in: rabbitmq-server/deps/rabbit/src/rabbit_quorum_queue.erl Lines 1281 to 1282 in a8bcf4a such that we still update the mnesia record when it returns |
Beta Was this translation helpful? Give feedback.
-
@michaelklishin with #11065 and #11278 I think we can close this discussion as solved! |
Beta Was this translation helpful? Give feedback.
@michaelklishin with #11065 and #11278 I think we can close this discussion as solved!