Queue mirror process terminates on updating delta #944

dcorbacho · 2016-09-07T11:55:20Z

Similar to #687 but with a different root cause (in investigation) as no priority queues are used.
The 3-node cluster uses persistent queues, autoheal and HA all and automatic sync.

Partial partitions are simulated between the nodes while messages are being published and consumed.

=ERROR REPORT==== 7-Sep-2016::11:46:59 ===
** Generic server <0.22142.0> terminating
** Last message in was {'$gen_cast',{gm,{depth,600}}}
** When Server state == {state,
                         {amqqueue,
                          {resource,<<"/">>,queue,<<"test_4">>},
                          true,false,none,[],<32924.28899.0>,[],[],[],
                          [{vhost,<<"/">>},
                           {name,<<"ha-all">>},
                           {pattern,<<".*">>},
                           {'apply-to',<<"all">>},
                           {definition,
                            [{<<"ha-mode">>,<<"all">>},
                             {<<"ha-sync-mode">>,<<"automatic">>}]},
                           {priority,0}],
                          [{<32924.28900.0>,<32924.28899.0>}],
                          [],live},

...

** Reason for termination ==
** {{badmatch,600},
    [{rabbit_mirror_queue_slave,update_delta,2,
                                [{file,"src/rabbit_mirror_queue_slave.erl"},
                                 {line,989}]},
     {rabbit_mirror_queue_slave,process_instruction,2,
                                [{file,"src/rabbit_mirror_queue_slave.erl"},
                                 {line,945}]},
     {rabbit_mirror_queue_slave,handle_cast,2,
                                [{file,"src/rabbit_mirror_queue_slave.erl"},
                                 {line,260}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1032}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}

The text was updated successfully, but these errors were encountered:

amulyas · 2016-09-07T23:10:46Z

my understanding is its just metadata about the Qs get stored in mnesia DB not the Q or .. so what process actually sync the Qs is it rabbitmq-server .. ?

michaelklishin · 2016-09-07T23:11:55Z

Please post questions to rabbitmq-users or Stack Overflow. RabbitMQ uses GitHub issues for specific actionable items engineers can work on, not questions. Thank you.

michaelklishin · 2016-09-07T23:12:13Z

rabbitmq/internals.

dcorbacho · 2016-09-16T10:53:57Z

One of the causes for this crash is the coexistence of several slaves for the same queue on the same node. This bug is probably caused during the partial partitions and restarts by autoheal, when several master can be alive on different nodes at the same time as these nodes are disconnected. Mnesia updates could propagate views of the cluster where the first slave disappears, thus a second one is allowed to start. See logs below

(note: the warning is a debug message added for testing only)

=INFO REPORT==== 16-Sep-2016::10:31:58 ===
Mirrored queue 'test_71' in vhost '/': Adding mirror on node 'rabbit@ubuntu-c1': <0.2078.2>

=INFO REPORT==== 16-Sep-2016::10:32:06 ===
Mirrored queue 'test_71' in vhost '/': Slave <[email protected]> saw deaths of mirrors <[email protected]>

=INFO REPORT==== 16-Sep-2016::10:32:06 ===
Mirrored queue 'test_71' in vhost '/': Adding mirror on node 'rabbit@ubuntu-c1': <0.2998.2>

=INFO REPORT==== 16-Sep-2016::10:32:10 ===
Mirrored queue 'test_71' in vhost '/': Slave <[email protected]> saw deaths of mirrors <[email protected]>

=INFO REPORT==== 16-Sep-2016::10:32:11 ===
Mirrored queue 'test_71' in vhost '/': Slave <[email protected]> saw deaths of mirrors <[email protected]>

=INFO REPORT==== 16-Sep-2016::10:32:30 ===
Mirrored queue 'test_71' in vhost '/': Slave <[email protected]> saw deaths of mirrors <[email protected]>

=INFO REPORT==== 16-Sep-2016::10:32:47 ===
Mirrored queue 'test_71' in vhost '/': Slave <[email protected]> saw deaths of mirrors <[email protected]>

=INFO REPORT==== 16-Sep-2016::10:32:48 ===
Mirrored queue 'test_71' in vhost '/': Slave <[email protected]> saw deaths of mirrors <[email protected]>

=INFO REPORT==== 16-Sep-2016::10:32:49 ===
Mirrored queue 'test_71' in vhost '/': Promoting slave <[email protected]> to master

=WARNING REPORT==== 16-Sep-2016::10:32:49 ===
(<0.2998.2>) MASTER on promote_backing_queue_state broadcasting 0 for {resource,
                                                                       <<"/">>,
                                                                       queue,
                                                                       <<"test_71">>}
=INFO REPORT==== 16-Sep-2016::10:32:49 ===
Mirrored queue 'test_71' in vhost '/': Synchronising: 0 messages to synchronise

=INFO REPORT==== 16-Sep-2016::10:32:49 ===
Mirrored queue 'test_71' in vhost '/': Synchronising: batch size: 4096

=INFO REPORT==== 16-Sep-2016::10:32:49 ===
Mirrored queue 'test_71' in vhost '/': Synchronising: all slaves already synced

=ERROR REPORT==== 16-Sep-2016::10:32:49 ===
** Generic server <0.2078.2> terminating
** Last message in was {'$gen_cast',{gm,{depth,0}}}
** When Server state == {state,
                         {amqqueue,
                          {resource,<<"/">>,queue,<<"test_71">>},
                          true,false,none,[],<0.2998.2>,[],[],[],
...

** Reason for termination ==
** {{badmatch,-2322},
    [{rabbit_mirror_queue_slave,update_delta,2,
                                [{file,"src/rabbit_mirror_queue_slave.erl"},
                                 {line,991}]},
     {rabbit_mirror_queue_slave,process_instruction,2,
                                [{file,"src/rabbit_mirror_queue_slave.erl"},
                                 {line,945}]},
     {rabbit_mirror_queue_slave,handle_cast,2,
                                [{file,"src/rabbit_mirror_queue_slave.erl"},
                                 {line,260}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1032}]},
     {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}

=WARNING REPORT==== 16-Sep-2016::10:32:57 ===
(<0.2998.2>) MASTER terminate(shutdown) {resource,<<"/">>,queue,<<"test_71">>}

=WARNING REPORT==== 16-Sep-2016::10:32:57 ===
Mirrored queue 'test_71' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

dcorbacho · 2016-09-19T12:12:24Z

When two slaves are alive on the same node one of them can be promoted to master, and the depth notification of this master reach the other local slave. As this slave is synced with other master in other node, it will crash. The presence of multiple masters in the cluster during a partial partition can also cause slaves to synchronise with one master and receive depth notifications from a different one. This is also probably causing #959.

Slaves should detect they have been removed from the slaves list and do a clean stop. This should avoid at least the majority of the crashes, and allow the queue to eventually reach a consistent state. Note that it is possible that messages are lost in this situation.

dcorbacho self-assigned this Sep 7, 2016

dcorbacho mentioned this issue Sep 7, 2016

Priority queue fails to synchronize after node restart. #687

Closed

michaelklishin added the bug label Sep 7, 2016

michaelklishin added this to the 3.6.x milestone Sep 7, 2016

michaelklishin changed the title ~~Slave crash on updating delta~~ Mirror process terminates on updating delta Sep 7, 2016

michaelklishin changed the title ~~Mirror process terminates on updating delta~~ Queue mirror process terminates on updating delta Sep 7, 2016

This was referenced Sep 12, 2016

GM - group deleted during partial partitions #950

Closed

Amqqueue record deleted during partial partitions using HA #953

Closed

GM - crash in calculate_activity #959

Closed

Crash on tree handling in rabbit_variable_queue #960

Closed

This was referenced Sep 19, 2016

Stop slaves and masters once removed from the amqqueue record by other nodes #969

Merged

GM - crash in add_on_right #972

Closed

michaelklishin modified the milestones: 3.6.6, 3.6.x Oct 26, 2016

michaelklishin closed this as completed Oct 26, 2016

michaelklishin added the effort-medium label Oct 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue mirror process terminates on updating delta #944

Queue mirror process terminates on updating delta #944

dcorbacho commented Sep 7, 2016

amulyas commented Sep 7, 2016

michaelklishin commented Sep 7, 2016

michaelklishin commented Sep 7, 2016

dcorbacho commented Sep 16, 2016

dcorbacho commented Sep 19, 2016

Queue mirror process terminates on updating delta #944

Queue mirror process terminates on updating delta #944

Comments

dcorbacho commented Sep 7, 2016

amulyas commented Sep 7, 2016

michaelklishin commented Sep 7, 2016

michaelklishin commented Sep 7, 2016

dcorbacho commented Sep 16, 2016

dcorbacho commented Sep 19, 2016