-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cryptonote_core: introducing rwlock mechanism #9181
base: master
Are you sure you want to change the base?
cryptonote_core: introducing rwlock mechanism #9181
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you answered in IRC already, but why is this needed? The LMDB class already has proper locking, so what is this doing? And if not, maybe the LMDB class needs to be updated instead of adding yet another layer of locking ontop of the existing locks.
I believe if you look at the code carefully, you would realize it does not add another layer. It replaces an old locking mechanism which does not support read-write with newer. |
Yes, I skimmed the code way too quickly. But still, same question, any idea what these mutexes are protecting? |
Provides reader-writer locking for |
That object shouldn't need protection, unless the pointer itself changes (like with It looks like a reader lock isn't needed at all for these other functions; I don't think this is helping anything and is just further cluttering the code. |
No. We still need locking. For example take a value like this [1].
Thanks to @moneromooo-monero for his detailed explanation. Update: link was fixed, should point to
|
As extra optimization: There few methods that we can remove locking for example. Which I have to do. For example Update: Done. Details at [1]. |
70ce45a
to
21b68cd
Compare
3f86cfb
to
494d6be
Compare
After carefully reading through your discussion with @vtnerd unfortunately I still could not form something like a "high level" understanding of the purpose and the aim of this PR. I feel that such an understanding is necessary to judge the "value" of this PR, and for going into any review with correct expectations. I suspect the following: With current code, there are no critical problems with locking. Things are always locked when they should get locked. There are no known races and / or deadlocks. Basically, nothing known that you could call "bug". But, because now all locks are exclusive, even for operations that could run in parallel without any problems, performance of DB access is lower than it could be, as all access gets strictly serialized. With introducing a distinction between non-exclusive read locks and exclusive read/write locks performance gains can be achieved. For introducing this distinction, no new layers or other drastic complications of the code are needed. Is my understanding correct? |
Important Please keep in mind in the following comment serialization I am talking in the context of synchronization [1]. @rbrunner7 Yes, you are correct.
Let me go through a complete example, keep in mind this is for illustration purposes and might some details. Imagine 2 threads are listening to RPC:
Ref: |
@0xFFFC0000 is correct in that a reader lock is necessary to access the db, since another function could modify the pointer, etc. Although that doesn't seem to happen often, as this would've needed changing a long time ago. |
It is not about the pointer. Take a look at the example I provided to |
contrib/epee/include/syncobj.h
Outdated
write_mode(false), | ||
read_mode(0) {}; | ||
|
||
bool start_read() noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use boost::shared_mutex
with boost::shared_lock
and boost::unique_lock
directly? Why the extra logic? It already handles all the waiting logic, etc. Is it so the thread can re-enter the write lock? This doesn't appear possible in the two current use cases.
And if you want the extra logic, why not use the same naming scheme so that boost::unique_lock
and boost::shared_lock
can be used instead of the custom RAII object/macro (that allocates memory, whereas the boost lock templates do not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the core problem of what makes this complex. Blockchain
APIs in blockchain.cpp
can call each other recursively.
Now you can redesign all of them, and remove recursive locking. But that would be much more work than this. And needs a lot of logical/semantical changes.
For example, A
->B
->C
.
A
acquires write-lock
.
B
acquires read-lock
.
C
acquires read-lock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then follow the existing naming scheme for boost::scoped_lock
and boost::shared_lock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What naming scheme?
And keep in mind, these four methods:
start_read
end_read
start_write
end_write
Should not be used directly. The main interface to use this lock is RLOCK
and RWLOCK
. The only reason we need those 4 methods is in case the user wants to do something custom with the lock. For example in this specific case [1] we have to use these methods instead of RLOCK
and RWLOCK
since we are locking and unlocking the lock inside the scope [2]. Keeping the logic exactly like the original implementation [3].
- https://github.com/0xFFFC0000/monero/blob/494d6be27e9de576f291428d42fc401d4196327f/src/cryptonote_core/blockchain.cpp#L5045
- https://github.com/0xFFFC0000/monero/blob/494d6be27e9de576f291428d42fc401d4196327f/src/cryptonote_core/blockchain.cpp#L5065
monero/src/cryptonote_core/blockchain.cpp
Line 5074 in 7b7958b
m_blockchain_lock.unlock();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What naming scheme?
If you used the names that shared_mutex
uses, then unique_lock
and shared_lock
could be used instead of the RLOCK
and RWLOCK
macros. This does not prevent direct calls into those functions.
So:
start_read
->lock_shared
end_read
->unlock_shared
start_write
->lock
end_write
->unlock
Then RLOCK
becomes boost::shared_lock<reader_writer_lock>
and RWLOCK
becomes boost::unique_lock<reader_writer_lock>
. I think this is preferable; no allocation is performed.
However, using those classes may require other functions (such as try_shared_lock
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. Few points to consider:
RLOCK
/RWLOCK
was @moneromooo-monero suggestion and appeared to be a good solution.- What you are saying makes sense to me too. But We have to go one level deeper. And possibility introduce a lot of complexity where we exactly don't want any complexity. For example, we have to evaluate what are the exact requirements of
boost::shared_lock
andboost::unique_lock
[1]. That includes support for all the operations they provide. On top of that, that code should be debugged and make sure 100% of those features (that we don't want 99% of them) don't have any bugs or unintended side effects. That looks to me like a slippery slope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind, that we eventually want to remove recursive locking. This PR is a mid-term solution. Until we clean all the APIs inside the blockchain
. Once all of those APIs are cleaned. We can simply use std::shared_mutex
and std::mutex
.
Cleaning those APIs will introduce a lot of semantical/logical changes to the code. So it is better to go slowly IMHO,
contrib/epee/include/syncobj.h
Outdated
rw_mutex.unlock_shared(); | ||
read_mode--; | ||
if (rw_mutex.try_lock()) { | ||
CHECK_AND_ASSERT_MES2(!read_mode, "Reader is not zero but goes to read_mode = 0 by " << boost::this_thread::get_id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function returns here without calling unlock()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something. But I specifically used CHECK_AND_ASSERT_MES2
to not return and only assert fail and report it [1].
and the unlock
for rw_mutex
is at line 220.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I looked up the definition for the macro, and it doesn't return
(whereas I thought it does).
contrib/epee/include/syncobj.h
Outdated
} | ||
|
||
void lock_reader() noexcept { | ||
CHECK_AND_ASSERT_MES2(read_mode < UINT32_MAX, "Maximum number of readers reached."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks a post condition of lock_reader
... ? And doesn't the check need to be made inside of the scoped_lock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I am using CHECK_AND_ASSERT_MES2
and I believe it does not throw. So even if assert fails, it only logs and moves ahead [1].
One thing to keep in mind is, that the check for the UINT32_MAX
number of readers is auxiliary.
For example, the Linux kernel uses this function to calculate the maximum number of threads allowed [2] and its value is int
[3]. I cannot think of any scenario in which we can surpass UINT32_MAX
.
- https://github.com/monero-project/monero/blob/7b7958bbd9d76375c47dc418b4adabba0f0b1785/contrib/epee/include/misc_log_ex.h#L205C64-L205C73
- https://github.com/torvalds/linux/blob/04b8076df2534f08bb4190f90a24e0f7f8930aca/kernel/fork.c#L1014
- https://github.com/torvalds/linux/blob/04b8076df2534f08bb4190f90a24e0f7f8930aca/kernel/fork.c#L132
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, CHECK_AND_ASSERT_MES2
doesn't return
immediately? Whereas CHECK_AND_ASSERT_MES_NO_RET
does? Confusing.
And I still think you have to perform the check after acquiring the mutex.
For example, the Linux kernel uses this function to calculate the maximum number of threads allowed [2] and its value is int [3]. I cannot think of any scenario in which we can surpass UINT32_MAX.
What's the purpose of the check then? Just to inform the user? The only other option is to spin loop block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I still think you have to perform the check after acquiring the mutex.
The reason is the mutex is blocking. and for the debugging, I wanted to see the checks before blocks. But this is one of the cases that I am open to either way. I will put it on my TODO list. In the next push will move the check inside the lock.
Just to inform the user?
Yes. Exactly. Since debugging locks is hard and tricky I left it there if in an extremely rare scenario, in 15 years, if the user hits this problem, at least we can see it in the log files. I am open to removing that check too, since not directly related to core logic of the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if this check is auxiliary, technically read_mode
either needs to be atomic, or the check needs to happen when internal_mutex
is owned, if you want the loaded value of read_mode
to be well-defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vtnerd @jeffro256 . Fixed in new push.
You added locks where none were present before, that's why I keep going back to that example. But yes, it also changed to reader locks where a full mutex was present before. |
Or at least so I thought - I can't find any now. |
Sorry for spamming - |
I've written this before but I'll also add a comment here. There was or is an issue where the daemon would fall behind if there's really heavy RPC traffic. I've had my node fall behind like 5+ blocks, after some time it would catch up again. The assumption was that it's due to the current lock mechanism. moneromooo suggested a while ago that a read / write lock might help. That's why I suggested @0xFFFC0000 to implement this. This daemon falling behind issue happens rarely and is difficult to reproduce, so I don't know if this PR improves things, but I just wanted to add context why this was implemented. |
Those 8 methods are different. I have put them here [1]. The design we had was to not use any locks there and put the burden of locking/consistency on the caller. That is understandable. But IMHO we have to use Overall that is like extremely tiny part of the PR. Maybe 0.5%? or less! |
This PR might (and might not) directly improve that falling behind. But this is necessary work that we have to eventually take care of. The first step is to introduce a
|
No worries. We are evaluating a serious PR, and these discussions are normal. Yes |
Please let me know. Because this is a serious mistake in the code and completely against my design goal. One of my main goals was to not introduce any logical/semantical changes to the code while changing the underlying locking behaviour. |
contrib/epee/include/syncobj.h
Outdated
read_mode(0) {}; | ||
|
||
bool start_read() noexcept { | ||
if (have_write()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the behavior that we want: that we can't call a read function inside of a write function, even inside the same thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind: this returns whether lock was just now acquired, not whether overall acquisition is a success
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. have_write
is for cases where you do have write-lock
and recursively want to acquire read-lock
too.
contrib/epee/include/syncobj.h
Outdated
do { | ||
boost::mutex::scoped_lock lock(internal_mutex); | ||
if (!read_mode && rw_mutex.try_lock()) { | ||
writer_id = new_owner; | ||
write_mode = true; | ||
read_mode = 0; | ||
return; | ||
} | ||
condition.wait(lock); | ||
} while(true); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this locking/waiting pattern may cause more issues than the current code because we are now much more likely to experience writer starvation. Before, if there were 1000 readers and one writer, the writer was on equal footing as the rest of the read threads. Current code is still not fair, but the writer thread has the same chance as acquiring exclusive access as all other threads, which are attempting to acquire exclusively as well. However, now, the write lock will continually thrash and refuse to even attempt the lock if there is even one active reader, indefinitely. As long as new readers keep coming in, the writer is never even guaranteed the chance to even ask for the lock. In practice, this may cause the chain to never move forward with enough readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starvation might happen in extremely rare cases. That is why I am insisting this is an intermediary step until most of the APIs inside Blockchain
are fixed.
But your analysis is not 100% accurate. In the current locking mechanism, we lock the entire blockchain for every read. Let me explain a hypothetical scenario:
There are multiple read transactions: R1
, R2
, R3
, R4
, R5
, R6
, R7
, and R8
.
There are single write transactions: W1
.
In the current locking mechanism:
Core1
: R1
-> R2
-> R3
-> R4
-> R5
-> R6
-> R7
-> R8
-> W1
Core2
:
Core3
:
Core4
:
Core5
:
Core6
:
Core7
:
Core8
:
But in reader-writer locking mechanism:
Core1
: R1
W1
Core2
: R2
Core3
: R3
Core4
: R4
Core5
: R5
Core6
: R6
Core7
: R7
Core8
: R8
Now what you saying is a stream of read
transactions coming to starve the write
. It is a theoretical possibility.
Basically, an attacker should brute-force you with read
transactions to the point that all of your cores are 100%
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
McKenney, Paul E., Silas Boyd-Wickizer, and Jonathan Walpole. "RCU usage in the linux kernel: One decade later." Technical report (2013).
Talks about a similar issue in the section 5.3 Retry Readers
.
One other good discussion is: openssl/openssl#20715 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fixed in a new push. The reader_writer
lock does have an internal queue
right now. The queue
is for keeping track of work.
Let me give you an example of how it works now:
There are multiple read transactions: R1
, R2
, R3
, R4
, R5
, R6
, R7
, R8
, R9
, R10
, R11
.
There are single write transactions: W1
.
The order of arrival is: R1
, R2
, R3
, R4
, R5
, R6
, R7
, R8
, W1
, R9
, R10
, R11
.
In the current locking mechanism:
Core1
: R1
-> R2
-> R3
-> R4
-> R5
-> R6
-> R7
-> R8
-> W1
-> R9
-> R10
-> R11
Core2
:
Core3
:
Core4
:
Core5
:
Core6
:
Core7
:
Core8
:
But in new reader-writer locking mechanism:
Core1
: R1
W1
R9
Core2
: R2
____ R10
Core3
: R3
____ R11
Core4
: R4
Core5
: R5
Core6
: R6
Core7
: R7
Core8
: R8
494d6be
to
88673fb
Compare
Updated the implementation of
|
I took a quick look at the code and I don't see any issues (yet). I'll check more thoroughly later. What would be really nice, is if the unit test was compiled with |
Thanks. Great suggestion. Few things to consider.
|
I think it's better to separate it to another executable.
But I'm not sure if such mix will work properly. P.S. After a bit more thinking, it's still better to make it a separate binary and run the test with a timeout in CI builds, because it can potentially deadlock. |
Separate binary is preferable. Sadly Gtest doesn't provide any kind of timeout. So we have to make it separate. About Cmake hack. I am not sure, since all of those source files are going to combined to a object file and optimized by linker. So passing those flags on source file basis might not work. Worth trying but. Will try it. |
Gtest doesn't provide a timeout, but you can set the timeout in .yml files for the CI in https://github.com/monero-project/monero/tree/master/.github/workflows |
Yes. I agree with you. Separate binary is more preferable here. |
Looks like this implementation still suffers from reproducible writer starvation (though I haven't debugged the reason this time). Here's a test case I wrote for it: jeffro256@3ae55c6. I wrote an alternative implementation for the recursive read/write lock in this commit here: jeffro256@98cb62a. It only needs 1 |
Just to double check. Are you sure you are using the new version? Because the lock I see in your test is the older one. |
I think your description is not accurate. I tried your test with the new implementation but was not able to show Here is a modified version of your test [1]. Which includes serialized prints to debug the order of requests and acquires. It perfectly demonstrates that the
I have to debug this lock. As we discussed in our private conversation, the problem is not writing a lock, as it is a very simple task, the problem is supporting every use case of the recursive locking/unlocking we have in the APIs, while not changing much of the code inside |
This repo contains performance benchmarks for two lock versions. Mine is not optimized. But overall queue inefficiency does show itself in numbers. Although it is much simpler, overall this is not an accurate benchmark, but can give us a clue. For testing purposes, I will use @jeffro256 lock in PR next to see if it does all the tests of the
|
I tried this PR on a live node with heavy traffic and it causes one core to always be at 100%, which made the node almost unusable. I'll wait for @0xFFFC0000 to compare the two implementations and wait for more optimizations. |
88673fb
to
6284c1f
Compare
Changelog: a. As the numbers from [1] show, @jeffro256 implementation of core lock was more performant than the original queue base lock. In this push, the lock is completely replaced by the new lock. |
6284c1f
to
859fb32
Compare
Changing the title to reflect the PR name rules we have [1]. |
This will fix existing locking issues, and provide a foundation to implement more fine-grained locking mechanisms in future works. reader_writer_lock include wait_queue to prevent writer starvation. In this design, order between read(s) and write will be preserved. Co-authored-by: Jeffro <[email protected]>
With this PR, performance improvements of ReadWriteLock should be obvious.
859fb32
to
dea2e4b
Compare
I have an intuition that the spinning contributes to why AFAIK, all active reader threads will be spinning while the writer writes, and the writer thread will be spinning until all readers exit. Even if for a brief moment, the CPU waste gets multiplied by the thread count. |
Which spinlock? |
There's a struct acting as an atomic spinlock for multiple reads or 1 write: monero/src/blockchain_db/lmdb/db_lmdb.h Line 128 in c821478
which gets constructed ( monero/src/blockchain_db/lmdb/db_lmdb.cpp Lines 2916 to 2920 in c821478
@0xFFFC0000 I'm realizing that changing this too would add a lot more work, didn't mean to add pressure. Hopefully it can be changed in a different PR. |
The writer case should only trigger when the mmap is being resized - so the readers are typically only stalled while an atomic counter is being incremented. So the wait should be fairly short. Other projects in this situation will try to acquire the lock ~1000 times then yield CPU time. This might help in situations where the locking thread is suspended during the atomic increment. I doubt this happens often, but it could help a bit. I wouldn't recommend anything more than a yield in this specific situation, unless I missed something else. And the difference in throughput is unlikely to be much, unless the OS frequently pauses executions during the short lock period. |
No pressure at all. Thanks for bringing it up actually. I am investigating it right now, and will talk about it with @vtnerd , I believe he is reference person in |
@@ -4662,7 +4646,7 @@ bool Blockchain::add_new_block(const block& bl, block_verification_context& bvc) | |||
{ | |||
|
|||
LOG_PRINT_L3("Blockchain::" << __func__); | |||
crypto::hash id = get_block_hash(bl); | |||
crypto::hash id = get_block_hash(bl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert whitespace change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Will be fixed in the new push.
|
||
return handle_block_to_main_chain(bl, id, bvc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert whitespace change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Will be fixed in the new push.
@@ -2202,7 +2200,7 @@ bool Blockchain::get_blocks(uint64_t start_offset, size_t count, std::vector<std | |||
bool Blockchain::handle_get_objects(NOTIFY_REQUEST_GET_OBJECTS::request& arg, NOTIFY_RESPONSE_GET_OBJECTS::request& rsp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to use RLOCK
here, I'd recommended marking this method as const
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is great suggestion. At first, I was hesitant, but after a little bit of recollecting, that might introduce some tiny (but important) optimization + simplify the code (which we need).
I will add const
qualifier for the methods we need. And do a benchmark with it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just an optimization. It is to prevent further bugs in the code when someone tries to modify something inside a read-only lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SChernykh exactly. I am adding these (const
qualifiers) to the final version of this PR.
Draft : 9173
This is a low-level explanation of design decisions. For simple use, just use
RWLOCK
andRLOCK
macros which take care of all the details.RecursiveReadWriteLock Design
Since we are recursively calling read/write operations, we needed a new locking mechanism to take care of that. All the design decisions are exactly the same as canonical ReadWrite locks, with a few extra decisions to make it fit other needs. Here are general ideas about this lock:
UINT32_MAX
.write
lock you can startread
operation(s) inside it too. Meaning awrite
transaction can haveread
transaction(s). Butread
transactions, cannot havewrite
unless allread
s are released the lock.RWLOCK
andRLOCK
macros are provided which abstracts all the details. Generally, you only need to use these macros and should not need to touch low-level locking API.RWLOCK
andRLOCK
are valid for the scope.API Design
As I mentioned
write
transaction can include aread
transaction too. For example, this is a low-level example of acquiring and releasing the lock:E.g.
Or if we use macros to take care of this:
Another example of recursive reads:
Another example of recursive writes:
Testing
This PR contains a testing suite for the lock. The test runs 4 different cases with
10
,50
,100
and1000
threads, and each one has two iterations. In each test, a random number of writers and readers will start, and each reader and writer will run a random number of cycles. Each cycle is a wait and randomly (~20%) decide to recurse or not.