CQ: Fix entry missing from cache leading to crash on read (backport #11288) #11292

mergify · 2024-05-21T17:26:33Z

The issue comes from a mechanic that allows us to avoid writing to disk when a message has already been consumed. It works fine in normal circumstances, but fan-out makes things trickier.

When multiple queues write and read the same message, we could get a crash. Let's say queues A and B both handle message Msg.

Queue A asks store to write Msg
Queue B asks store to write Msg
Queue B asks store to delete Msg (message was immediately consumed)
Store processes Msg write from queue A
- Store writes Msg to current file
Store processes Msg write from queue B
- Store notices queue B doesn't need Msg anymore; doesn't write
- Store clears Msg from the cache
Queue A tries to read Msg
- Msg is missing from the cache
- Queue A tries to read from disk
- Msg is in the current write file and may not be on disk yet
- Crash

The problem is that the store clears Msg from the cache. We need all messages written to the current file to remain in the cache as we can't guarantee the data is on disk when comes the time to read. That is, until we roll over to the next file.

The issue was that a match was wrong, instead of matching a single location from the index, the code was matching against a list. The error was present in the code for almost 13 years since commit 2ef30dc.

Types of Changes

Bug fix (non-breaking change which fixes discussion comment Restarting crashed queue with function_clause reason after update to RabbitMQ 3.13 #10902 (reply in thread))
This is an automatic backport of pull request CQ: Fix entry missing from cache leading to crash on read #11288 done by Mergify.

The issue comes from a mechanic that allows us to avoid writing to disk when a message has already been consumed. It works fine in normal circumstances, but fan-out makes things trickier. When multiple queues write and read the same message, we could get a crash. Let's say queues A and B both handle message Msg. * Queue A asks store to write Msg * Queue B asks store to write Msg * Queue B asks store to delete Msg (message was immediately consumed) * Store processes Msg write from queue A * Store writes Msg to current file * Store processes Msg write from queue B * Store notices queue B doesn't need Msg anymore; doesn't write * Store clears Msg from the cache * Queue A tries to read Msg * Msg is missing from the cache * Queue A tries to read from disk * Msg is in the current write file and may not be on disk yet * Crash The problem is that the store clears Msg from the cache. We need all messages written to the current file to remain in the cache as we can't guarantee the data is on disk when comes the time to read. That is, until we roll over to the next file. The issue was that a match was wrong, instead of matching a single location from the index, the code was matching against a list. The error was present in the code for almost 13 years since commit 2ef30dc. (cherry picked from commit 93e4e88)

mergify bot assigned lhoguin May 21, 2024

michaelklishin added this to the 3.13.3 milestone May 21, 2024

michaelklishin merged commit 6dbe4de into v3.13.x May 21, 2024
8 of 10 checks passed

michaelklishin deleted the mergify/bp/v3.13.x/pr-11288 branch May 21, 2024 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CQ: Fix entry missing from cache leading to crash on read (backport #11288) #11292

CQ: Fix entry missing from cache leading to crash on read (backport #11288) #11292

mergify bot commented May 21, 2024

CQ: Fix entry missing from cache leading to crash on read (backport #11288) #11292

CQ: Fix entry missing from cache leading to crash on read (backport #11288) #11292

Conversation

mergify bot commented May 21, 2024

Types of Changes