Nodes fail to form a cluster, so classic mirrored queue leader election cannot proceed #3837

kaushiksrinivas · 2021-12-02T20:02:36Z

kaushiksrinivas
Dec 2, 2021

We recently encounterd an issue where, the rabbitmq cluster fails to continue operations after a sequence of fail over events in the rabbitmq cluster servers.
Environment: kubernetes. Rabbitmq servers are deployed as pods (via k8s statefulset) with persistant volume claims to persist the data on disk.
versions: tested and found the issue to be present across versions 3.7.23 to 3.8.21

Below is the sequence of steps to reproduce the same:

Install 2 rabbitmq servers cluster. (call it server0 & server1)
ha-promote-on-failure & ha-promote-on-shutdown are defaults.
Create a policy HA all for queues starting with "ha" with command from server0: rabbitmqctl set_policy ha-all "^ha" '{"ha-mode":"all"}'
Open the management UI and create a queue named ha1 with properties > Durable = true. Synchronization is manual.
Assume server0 is leader and server1 is the follower of the queue ha1.
Use pika client python program and publish 100 records to queue ha1 with properties.
connected to server0 (i.e leader of the queue), delivery_mode=2 (persistant message) , channel.confirm_delivery() (publish confirms enabled)
Now after 100 records are published, abruptly kill the follower server process (not a graceful shutdown) i.e server1 in this case via kill -9 pid command.
Data publisher continues to publish records and adds another 500 records to the queue.
(total records at this point in the queue is 100 (when both servers were running) + 500 (when only server0 is running) = 600)
Now abruptly kill the leader server process (not a graceful shutdown) i.e server0 via kill -9 pid command.
Publisher connection to leader server0 fails and stops (since both the brokers in the cluster are down).
Now boot up the server1 (previous follower) back again.

Expectation at this point is since the ha-promote-on-failure flag is default always, server1 is expected to become the leader of the queue and since
server1 was active only till initial 100 records and was unsynchronized (Synchronization is default manual and was not explicitly done at anytime in this sequence), should boot up normally and serve initial 100 records to the consumers.

But, the server1 fails to come up and continously terminates with below logs in the server.

Running boot step worker_pool defined by app rabbit
2021-12-02 14:16:32.104 [info] <0.377.0> Will use 6 processes for default worker pool
2021-12-02 14:16:32.104 [info] <0.377.0> Starting worker pool 'worker_pool' with 6 processes in it
2021-12-02 14:16:32.105 [info] <0.274.0> Running boot step database defined by app rabbit
2021-12-02 14:16:32.115 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2021-12-02 14:17:02.115 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:17:02.116 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2021-12-02 14:17:32.116 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:17:32.117 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2021-12-02 14:18:02.117 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:18:02.118 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2021-12-02 14:18:32.118 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:18:32.119 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
2021-12-02 14:19:02.119 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:19:02.120 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
2021-12-02 14:19:23.728 [info] <0.60.0> SIGTERM received - shutting down
2021-12-02 14:19:23.729 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:19:23.729 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 3 retries left
2021-12-02 14:19:23.729 [warning] <0.274.0> Error while waiting for Mnesia tables: {failed_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],{node_not_running,'rabbit@rabbitcheck-crmq-0'}}
2021-12-02 14:19:23.729 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 2 retries left
2021-12-02 14:19:23.730 [warning] <0.274.0> Error while waiting for Mnesia tables: {failed_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],{node_not_running,'rabbit@rabbitcheck-crmq-0'}}
2021-12-02 14:19:23.730 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 1 retries left
2021-12-02 14:19:23.730 [warning] <0.274.0> Error while waiting for Mnesia tables: {failed_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],{node_not_running,'rabbit@rabbitcheck-crmq-0'}}
2021-12-02 14:19:23.731 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2021-12-02 14:19:23.734 [error] <0.274.0> Feature flag quorum_queue: migration function crashed: {error,{failed_waiting_for_tables,['rabbit@rabbitcheck-crmq-0','rabbit@rabbitcheck-crmq-1'],{node_not_running,'rabbit@rabbitcheck-crmq-0'}}}
[{rabbit_table,wait,3,[{file,"src/rabbit_table.erl"},{line,121}]},{rabbit_core_ff,quorum_queue_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,68}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1602}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-0-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2269}]},{maps,fold_1,3,[{file,"maps.erl"},{line,410}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2267}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,2082}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,644}]}]
2021-12-02 14:19:23.735 [error] <0.274.0> Feature flag user_limits: migration function crashed: {aborted,{no_exists,rabbit_user,attributes}}
[{mnesia,abort,1,[{file,"mnesia.erl"},{line,361}]},{rabbit_core_ff,user_limits_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,163}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1602}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-0-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2269}]},{maps,fold_1,3,[{file,"maps.erl"},{line,410}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2267}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,2082}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,644}]}]
2021-12-02 14:19:23.736 [error] <0.274.0> Feature flag virtual_host_metadata: migration function crashed: {aborted,{no_exists,rabbit_vhost,attributes}}
[{mnesia,abort,1,[{file,"mnesia.erl"},{line,361}]},{rabbit_core_ff,virtual_host_metadata_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,131}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1602}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-0-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2269}]},{maps,fold_1,3,[{file,"maps.erl"},{line,410}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2267}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,2082}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,644}]}]
/usr/lib/rabbitmq/bin/rabbitmq-server: line 89: 30 Killed start_rabbitmq_server "$@"

And, when looked into the disk snapshot of the queue, idx files are missing in the disk. Only metadata files are available at this point.
-rw-rw-r--. 1 rabbitmq 10000 20 Dec 2 17:31 .queue_name
drwxrwsr-x. 3 rabbitmq 10000 4096 Dec 2 17:31 ..
drwxrwsr-x. 2 rabbitmq 10000 4096 Dec 2 17:34 .
-rw-rw-r--. 1 rabbitmq 10000 15275 Dec 2 17:34 journal.jif

The rabbitmq cluster remains at this stage and never succeeds to boot up successfully from this point.
We are able to reproduce this scenario every time with above mentioned steps.

Answered by lukebakken

Dec 6, 2021

What do you suggest ?

You've demonstrated that the cluster can be restored when nodes come up within 5 minutes, as is documented.

We don't know anything about your liveness and readiness checks. All I can say is that it is imperative all nodes are up and running well within 5 minutes so that cluster re-formation does not time out.

#3837 (comment)

https://blog.rabbitmq.com/posts/2020/08/deploying-rabbitmq-to-kubernetes-whats-involved/

https://rabbitmq.com/clustering.html#restarting-readiness-probes

We strongly suggest using our official k8s operator -

https://www.rabbitmq.com/kubernetes/operator/operator-overview.html

View full answer

lukebakken · 2021-12-02T20:09:27Z

lukebakken
Dec 2, 2021
Maintainer

found the issue to be present across versions 3.7.23 to 3.8.21

Since it appears it is relatively easy for you to reproduce this issue, could you please re-run your test using the current version of RabbitMQ?

https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.9.11

Also, please note that two-node RabbitMQ clusters are not supported.

Finally, it would be extremely helpful if you could provide a script we can run to set up a k8s environment in exactly the same manner as you have. It takes a significant amount of time that we frankly don't have to try to reproduce an environment based on a description of how you have set yours up.

0 replies

michaelklishin · 2021-12-02T20:40:16Z

michaelklishin
Dec 2, 2021
Maintainer

I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :(

0 replies

michaelklishin · 2021-12-02T20:48:07Z

michaelklishin
Dec 2, 2021
Maintainer

This behavior has nothing to do with queues. Your nodes fail to form a cluster after a restart. See Restarting Cluster Nodes to learn what the assumptions are. The key one is: all previously existing cluster nodes must come online within a 5 minute window of time by default.

This is evident from the logs where the node times out after 10 attempts to contact its peer:

2021-12-02 14:19:23.730 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 1 retries left
2021-12-02 14:19:23.730 [warning] <0.274.0> Error while waiting for Mnesia tables: {failed_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],{node_not_running,'rabbit@rabbitcheck-crmq-0'}}
2021-12-02 14:19:23.731 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 0 retries left

On Kubernetes, a poorly picked readiness probe will lead to a deadlock because it would expect one node to report as fully booted which it won't, because it waits for its peers to come online, which never will because Kubernetes can be instructed to deploy pods one by one. This can be solved by either forming a cluster in parallel — our own Kubernetes Operator does that with modern RabbitMQ versions — or using a very basic health check as readiness probe as the docs demonstrate.

This interplay between readiness probes and RabbitMQ cluster nodes assumptions upon restart has been discussed so many times on rabbitmq-users and in other places
that it has a dedicated section in the docs.

Until all of that happens and nodes re-form the cluster, none of your classic mirrored queue settings would have any effect.

I should mention that

Two nodes clusters is an unsupported configuration
Classic mirrored queues are deprecated

Not that quorum queues would magically work around the problem with a readiness probe or another deployment feature prevents
all cluster nodes from starting within a time window

0 replies

kaushiksrinivas · 2021-12-03T15:58:10Z

kaushiksrinivas
Dec 3, 2021
Author

Hi @michaelklishin, @lukebakken,
Thanks for the quick responses.

We tried below additional tests inline with your previous responses.

Tested on rabbitmq v3.9.11 -- Same issue is observed even in this latest version.
Since 2 cluster configuration is not recommended/supported, we tested the same with 3 node cluster with below sequence of actions.
servers: 0,1,2
leader of the queue: 2
followers of the queue: 0,1

publish to the queue with server endpoint 2
kill one follower : 1 -- 312 records published at this point.
kill second follower: 0 -- 735 records odd published at this point.
kill leader: 2 -- 1231 records records were sent in total before the publisher's connection got interrrupted.

bring back/boot up the old follower : 0.
Server 0 fails to come up with the same traces.

Note: statefulset was used to only deploy the initial set of replicas needed in the cluster.
a non cascade delete of the statefulset was done to control the above flow of tests. We even increased the initial delay of liveness&readiness to 10 minutes to take out the pod kills from these checks in this equation. With this we could say the behavior would be exactly similar to a non k8s environment with 3 servers running on 3 different virtual machines.

So if in a 3 node rabbitmq cluster, if the nodes go down in the above mentioned sequence and tried to boot up the node 0 after that phase, it fails to come up.
Considering the point mentioned in the feedbacks about statefulset's behavior of bringing up pods one at a time in order, we did a test by tweaking the podManagementPolicy of the statefulset to Parallel from OrderedReady. But we see the 3 rabbitmq servers form independent clusters of size 1.

We also tried the same sequence on a queue with "ha-promote-on-failure" set to "when synced", "ha-sync-mode" set to "automatic" and ha set to "all". Results were the same i.e cluster fails to recover back.

On the request for a script to deploy a k8s cluster for the same, since we have used statefulset and pods just to run the rabbitmq servers and none of the k8s behavior has impacted the above test flow, similar thing can be done even on a plain rpm mode of deployment. This issue should not be k8s specific.

3 replies

lukebakken Dec 3, 2021
Maintainer

similar thing can be done even on a plain rpm mode of deployment

Do you have a script to automate your procedure using a standard RabbitMQ installation? I could probably guess how you're running your test but if it's already automated, share that code please.

Also note in the clustering guide, you should restart all nodes within a 5 minute window. In your last test scenario, what happens when you restart node 2, then 0, then 1 after killing all three?

kaushiksrinivas Dec 6, 2021
Author

@lukebakken
Sorry for the delayed response.

We do not have an automated script for the same, so could not help with that.

On the second point, we did modify the statefulset and removed the liveness and readiness probes. By removing the probes, we gave a chance to the pods to come up close in time to each other. In this test, Server-0 (nature of sts setup and also our original test procedure) came up first and started waiting for mnesia sync. Server-1 came up next and cluster was operational back again.

What do you suggest ?

lukebakken Dec 6, 2021
Maintainer

What do you suggest ?

You've demonstrated that the cluster can be restored when nodes come up within 5 minutes, as is documented.

We don't know anything about your liveness and readiness checks. All I can say is that it is imperative all nodes are up and running well within 5 minutes so that cluster re-formation does not time out.

#3837 (comment)

https://blog.rabbitmq.com/posts/2020/08/deploying-rabbitmq-to-kubernetes-whats-involved/

https://rabbitmq.com/clustering.html#restarting-readiness-probes

We strongly suggest using our official k8s operator -

https://www.rabbitmq.com/kubernetes/operator/operator-overview.html

Answer selected by michaelklishin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes fail to form a cluster, so classic mirrored queue leader election cannot proceed #3837

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Nodes fail to form a cluster, so classic mirrored queue leader election cannot proceed #3837

kaushiksrinivas Dec 2, 2021

Replies: 4 comments · 3 replies

lukebakken Dec 2, 2021 Maintainer

michaelklishin Dec 2, 2021 Maintainer

michaelklishin Dec 2, 2021 Maintainer

kaushiksrinivas Dec 3, 2021 Author

lukebakken Dec 3, 2021 Maintainer

kaushiksrinivas Dec 6, 2021 Author

lukebakken Dec 6, 2021 Maintainer

kaushiksrinivas
Dec 2, 2021

Replies: 4 comments 3 replies

lukebakken
Dec 2, 2021
Maintainer

michaelklishin
Dec 2, 2021
Maintainer

michaelklishin
Dec 2, 2021
Maintainer

kaushiksrinivas
Dec 3, 2021
Author

lukebakken Dec 3, 2021
Maintainer

kaushiksrinivas Dec 6, 2021
Author

lukebakken Dec 6, 2021
Maintainer