-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race caused by design of native Neo validators/committee cache #3110
Conversation
And I'd add this PR into 0.102.0 milestone because of the last commit. We don't want this bug to be present in our next release. |
Codecov Report
@@ Coverage Diff @@
## master #3110 +/- ##
==========================================
- Coverage 84.91% 84.87% -0.04%
==========================================
Files 330 330
Lines 44337 44365 +28
==========================================
+ Hits 37648 37657 +9
- Misses 5172 5198 +26
+ Partials 1517 1510 -7
... and 6 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Inspired by #3110. Signed-off-by: Roman Khimov <[email protected]>
5df29aa
to
7db8a5a
Compare
@roman-khimov, I've fixed the latter problem, let's wait until the PR tests will finish. Unfortunately, I can't reproduce these tests locally even with |
OK, now we have race problem with config.Version between different tests runs, so the first fix doesn't work:
|
a908829
to
d27cc55
Compare
9b83c52
to
4184e40
Compare
pkg/consensus/consensus.go
Outdated
@@ -725,7 +725,7 @@ func (s *service) newBlockFromContext(ctx *dbft.Context) block.Block { | |||
var err error | |||
cfg := s.Chain.GetConfig().ProtocolConfiguration | |||
if cfg.ShouldUpdateCommitteeAt(ctx.BlockIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this one as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this one is definitely not the optimisation as far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here as well.
555e3e8
to
8103179
Compare
0592801
to
e067f73
Compare
OK, @roman-khimov, something is wrong with the last commit that optimizes cache usage. The rest of commits work fine, although I haven't tested them in hybrid network scenario. Could you take a look, please? |
@roman-khimov, the overall solution that is shown in the |
e3fadde has some commented out code that is removed afterwards. But otherwise it seems to be OK. |
c8d3fe1
to
b2a69d8
Compare
OK, I've rechecked this code and the logic seems to be correct to me. Hope, that there's no bugs in the updated logic. |
b2a69d8
to
beba0f0
Compare
Blockchain passes his own pure unwrapped DAO to (*Blockchain).ComputeNextBlockValidators which means that native RW NEO cache structure stored inside this DAO can be modified by anyone who uses exported ComputeNextBlockValidators Blockchain API, and technically it's valid, and we should allow this, because it's the only purpose of `validators` caching. However, at the same time some RPC server is allowed to request a subsequent wrapped DAO for some test invocation. It means that descendant wrapped DAO eventually will request RW NEO cache and try to `Copy()` the underlying's DAO cache which is in direct use of ComputeNextBlockValidators. Here's the race: ComputeNextBlockValidators called by Consensus service tries to update cached `validators` value, and descendant wrapped DAO created by the RPC server tries to copy DAO's native cache and read the cached `validators` value. So the problem is that native cache not designated to handle concurrent access between parent DAO layer and derived (wrapped) DAO layer. I've carefully reviewed all the usages of native cache, and turns out that the described situation is the only place where parent DAO is used directly to modify its cache concurrently with some descendant DAO that is trying to access the cache. All other usages of native cache (not only NEO, but also all other native contrcts) strictly rely on the hierarchical DAO structure and don't try to perform these concurrent operations between DAO layers. There's also persist operation, but it keeps cache RW lock taken, so it doesn't have this problem as far. Thus, in this commit we rework NEO's `validators` cache value so that it always contain the relevant list for upper Blockchain's DAO and is updated every PostPersist (if needed). Note: we must be very careful extending our native cache in the future, every usage of native cache must be checked against the described problem. Close #2989. Signed-off-by: Anna Shaleva <[email protected]>
We have two similar blockchain APIs: GetNextBlockValidators and GetValidators. It's hard to distinguish them, thus renaming it to match the meaning, so what we have now is: GetNextBlockValidators literally just returns the top of the committee that was elected in the start of batch of CommitteeSize blocks batch. It doesn't change its valie every block. ComputeNextBlockValidators literally computes the list of validators based on the most fresh committee members information got from the NeoToken's storage and based on the latest register/unregister/vote events. The list returned by this method may be updated every block. Signed-off-by: Anna Shaleva <[email protected]>
Recalculate them once per epoch. Consensus is aware of it and must call CalculateNextValidators exactly when needed. Signed-off-by: Anna Shaleva <[email protected]>
…ors` Adjust all comments, make the field name match its meaning. Signed-off-by: Anna Shaleva <[email protected]>
No funcional changes, just refactoring. It doesn't need the whole cache, only the set of committee keys with votes. Signed-off-by: Anna Shaleva <[email protected]>
Do not recalculate new committee/validators value in the start of every subsequent epoch. Use values that was calculated in the PostPersist method of the previously processed block in the end of the previous epoch. Signed-off-by: Anna Shaleva <[email protected]>
If it's the end of epoch, then it contains the updated validators list recalculated during the last block's PostPersist. If it's middle of the epoch, then it contains previously calculated value (value for the previous completed epoch) that is equal to the current nextValidators cache value. Signed-off-by: Anna Shaleva <[email protected]>
beba0f0
to
d964420
Compare
Tests are checked, |
Refactored native NeoToken cache scheme introduced in #3110 sometimes requires validators list recalculation during native cache initialization process (when initializing with the existing storage from the block that preceeds each N-th block). To recalculate validators from candidates, native NeoToken needs an access to cached native Policy blocked accounts. By the moment of native Neo initialization, the cache of native Policy is not yet initialized, thus we need a direct DAO access for Policy to handle blocked account check. Close #3181. Signed-off-by: Anna Shaleva <[email protected]>
Refactored native NeoToken cache scheme introduced in #3110 sometimes requires validators list recalculation during native cache initialization process (when initializing with the existing storage from the block that is preceded each N-th block). To recalculate validators from candidates, native NeoToken needs an access to cached native Policy blocked accounts. By the moment of native Neo initialization, the cache of native Policy is not yet initialized, thus we need a direct DAO access for Policy to handle blocked account check. Close #3181. Signed-off-by: Anna Shaleva <[email protected]>
Fix validators/committee cache of native Neo contract. See the bug description in the commit messages.
Close #2989 along the way.