Update durable queues outside of Khepri transactions #10742

the-mikedavis · 2024-03-14T17:40:26Z

@mkuratczyk found a bug that happens when the khepri_db feature flag is enabled and a vhost fails to recover. Specifically this block:

rabbitmq-server/deps/rabbit/src/rabbit_vhost_process.erl

Lines 49 to 53 in e8e50c8

    
           rabbit_amqqueue:mark_local_durable_queues_stopped(VHost), 
        
           rabbit_log:error("Unable to recover vhost ~tp data. Reason ~tp~n" 
        
                            " Stacktrace ~tp", 
        
                            [VHost, Reason, Stacktrace]), 
        
           {stop, Reason}

The rabbit_amqqueue:mark_local_durable_queues_stopped/1 call fails when Khepri is enabled because it tries to create a transaction function which Khepri disallows - it finds queues which are not alive (via an RPC call) and marks them as stopped. That's unsafe to do in a Khepri transaction since it would be executed on each node and the value of node() would change, and side effects like RPC calls would be repeated by each Khepri cluster member. So that function currently errors. (For example start a 3.13.0 broker, enable khepri_db and execute rabbit_amqqueue:mark_local_durable_queues_stopped(<<"/">>) - this will error out).

We can work around this by using Khepri's advanced API to get the version number in the database of each queue, filter and update each queue and then use a transaction to apply the updates. This way we get transaction-like behavior of atomically updating all queues or none without the restrictions of Khepri transaction functions for FilterFun and UpdateFun.

`rabbit_db_queue:update_durable/2`'s caller (`rabbit_amqqueue:mark_local_durable_queues_stopped`/1) passes a filter function that performs some operations that aren't allowed within Khepri transactions like looking up and using the current node and executing an RPC. Calling `rabbit_amqqueue:mark_local_durable_queues_stopped/1` on a Rabbit with the `khepri_db` feature flag enabled will result in an error. We can safely update a number of queues by using Khepri's `khepri_adv:get_many/3` advanced API which returns the internal version number of each queue. We can filter and update the queues outside of a transaction function and then perform all updates at once, failing if any queue has changed since the `khepri_adv:get_many/3` query. So we get the main benefits of a transaction but we can still execute any update or filter function.

mkuratczyk · 2024-03-15T09:21:44Z

Thanks, I can no longer reproduce the problem.

Update durable queues outside of Khepri transactions (backport #10742)

the-mikedavis added the backport-v3.13.x label Mar 14, 2024

the-mikedavis marked this pull request as draft March 14, 2024 18:21

Add a unit test for rabbit_amqqueue:mark_local_durable_queues_stopped/1

8a03b28

the-mikedavis force-pushed the md-fix-vhost-recovery-queue-update branch from 0ade88e to 8a03b28 Compare March 14, 2024 18:24

the-mikedavis marked this pull request as ready for review March 14, 2024 18:37

michaelklishin added this to the 3.13.1 milestone Mar 14, 2024

michaelklishin merged commit a04b092 into main Mar 20, 2024
16 checks passed

michaelklishin deleted the md-fix-vhost-recovery-queue-update branch March 20, 2024 14:06

michaelklishin modified the milestones: 3.13.1, 4.0.0 Mar 20, 2024

mergify bot mentioned this pull request Mar 20, 2024

Update durable queues outside of Khepri transactions (backport #10742) #10807

Merged

the-mikedavis added a commit that referenced this pull request Mar 20, 2024

Merge pull request #10807 from rabbitmq/mergify/bp/v3.13.x/pr-10742

289b110

Update durable queues outside of Khepri transactions (backport #10742)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update durable queues outside of Khepri transactions #10742

Update durable queues outside of Khepri transactions #10742

the-mikedavis commented Mar 14, 2024

mkuratczyk commented Mar 15, 2024

	rabbit_amqqueue:mark_local_durable_queues_stopped(VHost),
	rabbit_log:error("Unable to recover vhost ~tp data. Reason ~tp~n"
	" Stacktrace ~tp",
	[VHost, Reason, Stacktrace]),
	{stop, Reason}

Update durable queues outside of Khepri transactions #10742

Update durable queues outside of Khepri transactions #10742

Conversation

the-mikedavis commented Mar 14, 2024

mkuratczyk commented Mar 15, 2024