-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
database: Potential data race for mutual exclusion #9193
Comments
In case you use |
@hinto-janai I believe you mis-read the code: while (creation_gate.test_and_set());
num_active_txns++;
creation_gate.clear(); Thread1 is blocked from advancing until The patch by @0xFFFC0000 is likely overkill, and not needed (that's just my knee-jerk reaction at this point). |
@vtnerd this isn't related to this specific issue but sometimes when there is heavy RPC traffic it's possible that the node can't add new blocks and falls behind due to the current locking mechanism. moneromooo suggested a while ago that a rw lock would help here. |
And to describe why the example doesn't work as @hinto-janai suggested - |
Yes, this may have some resource starvation issues since it's a dirty spin lock instead of some queued lock. But there is no data race afaik. I'm not certain that #9181 fixes this though, I think the original spin lock code would have to be removed as well. |
I believe if you read the PR carefully, you will realize that PR is addressing another issue. As a side-effect, it does solve the data race problem on If you want to go into more technical depth, I am happy to do it :) |
@0xFFFC0000 well there is no data race where @hinto-janai describes, so you are saying there is yet another one? I'm doubting this as well, but sure why not enlighten me. |
I did specifically talk about rw access from About the data race, directly from Keep in mind, if you are interacting with Though I am tracking you and @hinto-janai discussion, to understand the situation. |
And sorry for being a bit rude, I'm just crabby that I have to review this DB change :/ |
No no, I didn't even realize or notice anything rude from you :) We were just discussing technical matters as always. Don't mention it at all. Looking forward to continuing our discussion. Because (possibility of) data race is a multi-level issue, we have to address it. |
@vtnerd is correct in his first reply #9193 (comment). Because of the first |
@0xFFFC0000 is this about the potential starvation issue that @selsta mentioned or some other data race? I still don't know why #9181 was created (perhaps the discussion should move to that PR). It sounds like #9181 was not created due to the specific issue referenced by @hinto-janai in this issue. |
Exactly, the 9181 is for a different issue. But a side-effect of that locking is, as I mentioned in my first comment, if there was a data race in I am available to discuss it on the PR page. |
@vtnerd Ah you are correct. I was looking at There's |
What
There may be a data race when needing to acquire mutual exclusive access to
monerod
's database, e.g. when resizing.Invariant
When resizing LMDB's memory map, the caller must ensure they have mutual exclusive access to it.
As per
mdb_env_set_mapsize()
docs:An error is returned if there are other write transactions and presumably UB will occur if read transaction(s) exist.
Implementation
The solution to this was implemented in #289, and is still used today. My understanding of this code is:
There are 2 atomic values used to achieve mutual exclusive access to the database:
monero/src/blockchain_db/lmdb/db_lmdb.cpp
Lines 354 to 355 in 059028a
The atomic bool is used to indicate "do not enter, we are resizing". Before starting a transaction, this bool will be spinned on until it is
false
. It is set totrue
when LMDB is resizing:monero/src/blockchain_db/lmdb/db_lmdb.cpp
Lines 448 to 451 in 059028a
When resizing, this prevents new transactions from starting.
To handle currently active transactions, each transaction will
num_active_txns++
after successfully passing the atomic bool, and willnum_active_txns--
when done. The thread resizing will spin untilnum_active_txns
is 0, indicating there are no more transactions:monero/src/blockchain_db/lmdb/db_lmdb.cpp
Lines 453 to 456 in 059028a
Now, with no new transactions allowed, and all current ones gone, we should have mutual exclusive access to the database and resizing should be OK.
Problem
2 atomic operations back-to-back are not atomic. Other threads are free to execute in-between 2 atomic operations. What may occur in the above implementation is such:
There is space in-between
Thread 2
succesfully entering a transaction and updatingnum_active_txns
.If
Thread 2
is scheduled out by the OS afterlmdb_txn_begin()
but beforenum_active_txns++
(unlikely but non-zero chance)Thread 1
will incorrectly assume it has mutual exclusive access and start the resize.monero/src/blockchain_db/lmdb/db_lmdb.cpp
Lines 606 to 608 in 059028a
Some crashes occurring near a resize that could be explained by this:
The text was updated successfully, but these errors were encountered: