Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue mirror process terminates on updating delta #944

Closed
dcorbacho opened this issue Sep 7, 2016 · 5 comments
Closed

Queue mirror process terminates on updating delta #944

dcorbacho opened this issue Sep 7, 2016 · 5 comments
Assignees
Milestone

Comments

@dcorbacho
Copy link
Contributor

Similar to #687 but with a different root cause (in investigation) as no priority queues are used.
The 3-node cluster uses persistent queues, autoheal and HA all and automatic sync.

Partial partitions are simulated between the nodes while messages are being published and consumed.

=ERROR REPORT==== 7-Sep-2016::11:46:59 ===
** Generic server <0.22142.0> terminating
** Last message in was {'$gen_cast',{gm,{depth,600}}}
** When Server state == {state,
                         {amqqueue,
                          {resource,<<"/">>,queue,<<"test_4">>},
                          true,false,none,[],<32924.28899.0>,[],[],[],
                          [{vhost,<<"/">>},
                           {name,<<"ha-all">>},
                           {pattern,<<".*">>},
                           {'apply-to',<<"all">>},
                           {definition,
                            [{<<"ha-mode">>,<<"all">>},
                             {<<"ha-sync-mode">>,<<"automatic">>}]},
                           {priority,0}],
                          [{<32924.28900.0>,<32924.28899.0>}],
                          [],live},

...

** Reason for termination ==
** {{badmatch,600},
    [{rabbit_mirror_queue_slave,update_delta,2,
                                [{file,"src/rabbit_mirror_queue_slave.erl"},
                                 {line,989}]},
     {rabbit_mirror_queue_slave,process_instruction,2,
                                [{file,"src/rabbit_mirror_queue_slave.erl"},
                                 {line,945}]},
     {rabbit_mirror_queue_slave,handle_cast,2,
                                [{file,"src/rabbit_mirror_queue_slave.erl"},
                                 {line,260}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1032}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
@amulyas
Copy link

amulyas commented Sep 7, 2016

my understanding is its just metadata about the Qs get stored in mnesia DB not the Q or .. so what process actually sync the Qs is it rabbitmq-server .. ?

@michaelklishin
Copy link
Member

Please post questions to rabbitmq-users or Stack Overflow. RabbitMQ uses GitHub issues for specific actionable items engineers can work on, not questions. Thank you.

@michaelklishin
Copy link
Member

rabbitmq/internals.

@michaelklishin michaelklishin changed the title Slave crash on updating delta Mirror process terminates on updating delta Sep 7, 2016
@michaelklishin michaelklishin changed the title Mirror process terminates on updating delta Queue mirror process terminates on updating delta Sep 7, 2016
@dcorbacho
Copy link
Contributor Author

One of the causes for this crash is the coexistence of several slaves for the same queue on the same node. This bug is probably caused during the partial partitions and restarts by autoheal, when several master can be alive on different nodes at the same time as these nodes are disconnected. Mnesia updates could propagate views of the cluster where the first slave disappears, thus a second one is allowed to start. See logs below

(note: the warning is a debug message added for testing only)

=INFO REPORT==== 16-Sep-2016::10:31:58 ===
Mirrored queue 'test_71' in vhost '/': Adding mirror on node 'rabbit@ubuntu-c1': <0.2078.2>

=INFO REPORT==== 16-Sep-2016::10:32:06 ===
Mirrored queue 'test_71' in vhost '/': Slave <[email protected]> saw deaths of mirrors <[email protected]>

=INFO REPORT==== 16-Sep-2016::10:32:06 ===
Mirrored queue 'test_71' in vhost '/': Adding mirror on node 'rabbit@ubuntu-c1': <0.2998.2>

=INFO REPORT==== 16-Sep-2016::10:32:10 ===
Mirrored queue 'test_71' in vhost '/': Slave <[email protected]> saw deaths of mirrors <[email protected]>

=INFO REPORT==== 16-Sep-2016::10:32:11 ===
Mirrored queue 'test_71' in vhost '/': Slave <[email protected]> saw deaths of mirrors <[email protected]>

=INFO REPORT==== 16-Sep-2016::10:32:30 ===
Mirrored queue 'test_71' in vhost '/': Slave <[email protected]> saw deaths of mirrors <[email protected]>

=INFO REPORT==== 16-Sep-2016::10:32:47 ===
Mirrored queue 'test_71' in vhost '/': Slave <[email protected]> saw deaths of mirrors <[email protected]>

=INFO REPORT==== 16-Sep-2016::10:32:48 ===
Mirrored queue 'test_71' in vhost '/': Slave <[email protected]> saw deaths of mirrors <[email protected]>

=INFO REPORT==== 16-Sep-2016::10:32:49 ===
Mirrored queue 'test_71' in vhost '/': Promoting slave <[email protected]> to master

=WARNING REPORT==== 16-Sep-2016::10:32:49 ===
(<0.2998.2>) MASTER on promote_backing_queue_state broadcasting 0 for {resource,
                                                                       <<"/">>,
                                                                       queue,
                                                                       <<"test_71">>}
=INFO REPORT==== 16-Sep-2016::10:32:49 ===
Mirrored queue 'test_71' in vhost '/': Synchronising: 0 messages to synchronise

=INFO REPORT==== 16-Sep-2016::10:32:49 ===
Mirrored queue 'test_71' in vhost '/': Synchronising: batch size: 4096

=INFO REPORT==== 16-Sep-2016::10:32:49 ===
Mirrored queue 'test_71' in vhost '/': Synchronising: all slaves already synced

=ERROR REPORT==== 16-Sep-2016::10:32:49 ===
** Generic server <0.2078.2> terminating
** Last message in was {'$gen_cast',{gm,{depth,0}}}
** When Server state == {state,
                         {amqqueue,
                          {resource,<<"/">>,queue,<<"test_71">>},
                          true,false,none,[],<0.2998.2>,[],[],[],
...

** Reason for termination ==
** {{badmatch,-2322},
    [{rabbit_mirror_queue_slave,update_delta,2,
                                [{file,"src/rabbit_mirror_queue_slave.erl"},
                                 {line,991}]},
     {rabbit_mirror_queue_slave,process_instruction,2,
                                [{file,"src/rabbit_mirror_queue_slave.erl"},
                                 {line,945}]},
     {rabbit_mirror_queue_slave,handle_cast,2,
                                [{file,"src/rabbit_mirror_queue_slave.erl"},
                                 {line,260}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1032}]},
     {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}

=WARNING REPORT==== 16-Sep-2016::10:32:57 ===
(<0.2998.2>) MASTER terminate(shutdown) {resource,<<"/">>,queue,<<"test_71">>}

=WARNING REPORT==== 16-Sep-2016::10:32:57 ===
Mirrored queue 'test_71' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

@dcorbacho
Copy link
Contributor Author

When two slaves are alive on the same node one of them can be promoted to master, and the depth notification of this master reach the other local slave. As this slave is synced with other master in other node, it will crash. The presence of multiple masters in the cluster during a partial partition can also cause slaves to synchronise with one master and receive depth notifications from a different one. This is also probably causing #959.

Slaves should detect they have been removed from the slaves list and do a clean stop. This should avoid at least the majority of the crashes, and allow the queue to eventually reach a consistent state. Note that it is possible that messages are lost in this situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants