Quorum queue crash during rolling upgrade to 3.13 (install_snapshot_rpc) #11442
-
During a rolling upgrade from 3.12.13 to 3.13.2 (Erlang 26.2.5) of a 3-node cluster a quorum queue follower crashed with the following, when
I suspect the reason is that PR ra#375 changed the Is this really a bug in rabbitmq/ra? Is there any workaround to push through an upgrade? Maybe using tagging @illotum as you have a deep understanding of this part of the code (btw thanks for the great feature of non-voters) ** State machine 'vhost1_qq1' terminating
** Last event = {{call,
{<27487.7215.0>,
[alias|
#Ref<27487.0.923523.1478895759.3268214785.253450>]}},
{install_snapshot_rpc,13,
{'vhost1_qq1',
'rabbit@host-01'},
#{index => 7511,term => 13,machine_version => 3,
cluster =>
#{{'vhost1_qq1',
'rabbit@host-01'} =>
#{},
{'vhost1_qq1',
'rabbit@host-02'} =>
#{},
{'vhost1_qq1',
'rabbit@host-03'} =>
#{}}},
{1,last},
<<...>>}}
...
** Reason for termination = error:function_clause
** Callback modules = [ra_server_proc]
** Callback mode = [state_functions,state_enter]
** Stacktrace =
** [{ra_server,'-make_cluster/2-lists^foldl/2-0-',
[#Fun<ra_server.18.7275718>,#{},
#{{'vhost1_qq1','rabbit@host-01'} =>
#{},
{'vhost1_qq1','rabbit@host-02'} =>
#{},
{'vhost1_qq1','rabbit@host-03'} =>
#{}}],
[{file,"src/ra_server.erl"},{line,2210}]},
{ra_server,make_cluster,2,[{file,"src/ra_server.erl"},{line,2210}]},
{ra_server,handle_receive_snapshot,2,
[{file,"src/ra_server.erl"},{line,1270}]},
{ra_server_proc,handle_receive_snapshot,2,
[{file,"src/ra_server_proc.erl"},{line,1064}]},
{ra_server_proc,receive_snapshot,3,
[{file,"src/ra_server_proc.erl"},{line,817}]},
{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1395}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}] |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
😮Indeed, there is nothing prohibiting leaders from sending out snapshots in the new format. The opposite works, followers will accept both old and new. Perhaps the best way to fix is make it send both values under different keys. |
Beta Was this translation helpful? Give feedback.
-
We've seen that there were some leader abdications during this rolling upgrade. We were wondering if and how it could be related to the generation of the snapshot at such unfortunate moment, and more precisely why did 01 jump to term 13? Logs below are a combination of the 3 nodes logs. They've been simplified and marked after the time stamp with the correspondent node number.
|
Beta Was this translation helpful? Give feedback.
-
For what it's worth, I've re-visited this discussion several times, and have tried to reproduce this issue using this project - https://github.com/lukebakken/rabbitmq-server-11441 No luck! I think I'll close this discussion but if someone is able to reproduce, please follow up. |
Beta Was this translation helpful? Give feedback.
For what it's worth, I've re-visited this discussion several times, and have tried to reproduce this issue using this project - https://github.com/lukebakken/rabbitmq-server-11441
No luck! I think I'll close this discussion but if someone is able to reproduce, please follow up.