Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update durable queues outside of Khepri transactions #10742

Merged
merged 2 commits into from
Mar 20, 2024

Conversation

the-mikedavis
Copy link
Member

@mkuratczyk found a bug that happens when the khepri_db feature flag is enabled and a vhost fails to recover. Specifically this block:

rabbit_amqqueue:mark_local_durable_queues_stopped(VHost),
rabbit_log:error("Unable to recover vhost ~tp data. Reason ~tp~n"
" Stacktrace ~tp",
[VHost, Reason, Stacktrace]),
{stop, Reason}

The rabbit_amqqueue:mark_local_durable_queues_stopped/1 call fails when Khepri is enabled because it tries to create a transaction function which Khepri disallows - it finds queues which are not alive (via an RPC call) and marks them as stopped. That's unsafe to do in a Khepri transaction since it would be executed on each node and the value of node() would change, and side effects like RPC calls would be repeated by each Khepri cluster member. So that function currently errors. (For example start a 3.13.0 broker, enable khepri_db and execute rabbit_amqqueue:mark_local_durable_queues_stopped(<<"/">>) - this will error out).

We can work around this by using Khepri's advanced API to get the version number in the database of each queue, filter and update each queue and then use a transaction to apply the updates. This way we get transaction-like behavior of atomically updating all queues or none without the restrictions of Khepri transaction functions for FilterFun and UpdateFun.

`rabbit_db_queue:update_durable/2`'s caller
(`rabbit_amqqueue:mark_local_durable_queues_stopped`/1) passes a filter
function that performs some operations that aren't allowed within
Khepri transactions like looking up and using the current node and
executing an RPC. Calling
`rabbit_amqqueue:mark_local_durable_queues_stopped/1` on a Rabbit with
the `khepri_db` feature flag enabled will result in an error.

We can safely update a number of queues by using Khepri's
`khepri_adv:get_many/3` advanced API which returns the internal version
number of each queue. We can filter and update the queues outside of
a transaction function and then perform all updates at once, failing if
any queue has changed since the `khepri_adv:get_many/3` query. So we
get the main benefits of a transaction but we can still execute any
update or filter function.
@the-mikedavis the-mikedavis marked this pull request as draft March 14, 2024 18:21
@the-mikedavis the-mikedavis force-pushed the md-fix-vhost-recovery-queue-update branch from 0ade88e to 8a03b28 Compare March 14, 2024 18:24
@the-mikedavis the-mikedavis marked this pull request as ready for review March 14, 2024 18:37
@michaelklishin michaelklishin added this to the 3.13.1 milestone Mar 14, 2024
@mkuratczyk
Copy link
Contributor

Thanks, I can no longer reproduce the problem.

@michaelklishin michaelklishin merged commit a04b092 into main Mar 20, 2024
16 checks passed
@michaelklishin michaelklishin deleted the md-fix-vhost-recovery-queue-update branch March 20, 2024 14:06
@michaelklishin michaelklishin modified the milestones: 3.13.1, 4.0.0 Mar 20, 2024
the-mikedavis added a commit that referenced this pull request Mar 20, 2024
Update durable queues outside of Khepri transactions (backport #10742)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants