-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Remove potential P2P deadlock #1056
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #1056 +/- ##
===========================================
+ Coverage 68.10% 68.19% +0.08%
===========================================
Files 172 173 +1
Lines 16246 16330 +84
===========================================
+ Hits 11065 11136 +71
- Misses 4249 4262 +13
Partials 932 932
|
net/dag.go
Outdated
job.session.Done() | ||
continue | ||
} | ||
go func(j *dagJob) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: All the slow operations (I/O stuff on the txn) appear to be single threaded (due to the mutex on txn), do you see any benefit in doing this in a go routine, or can we simplify things a bit and call handleChildBlocks
synchronously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be called synchronously, I tried it while working on this PR, but I reverted because it was out of scope in my opinion. One benefit of having it as a go routine is that the DAG sync can happening concurrently (at least the networking side of it). I like the idea of simplifying things though so we can certainly open up the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that is useful here is that using a go routine at this stage allows the worker to be released and useable by children within handleChildBlocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those children share the same transaction though no? So releasing the worker for children quite possibly has no performance benefit at all, or even a negative one once stuff like thread context switching comes into play?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DB interactions are quite fast compared to the network calls. So allowing the network calls to happen concurrently might actually be a performance benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the two can be dissociated. I would be interested in your suggestion if you have one though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not?
for each doc/log-item/whatever {
doLocalStuffSync()
go doRemoteStuffAsync()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because local stuff needs to happen with the result fo the remote stuff both before and after. What we have is this:
for each doc/log-item/whatever {
go doStuffAsync()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for each doc/log-item/whatever {
doLocalStuffSync1()
chan1 := make{...}
go doRemoteStuffAsync1() {
chan1 <- result1
}()
r1 <- chan
doLocalStuffSync2()
chan2 := make{...}
go doRemoteStuffAsync2() {
chan2 <- result2
}()
r2 <- chan2
}
(syntax being very approximate lol) the pattern is quite common in .Net and Rust, I'd be surprised if it was new to Go? It allows any async stuff to be much more limited in scope and can remove a lot of wide-scoped (and hard to debug) synchronization stuff (like hidden mutexes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the pattern in common in Go as well. I'm not sure if we will be gaining much for this particular case but I'm willing to look into in at some point in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I follow this, but I would like some more documentation before I can be sure enough to approve :) It is a complicated part of the codebase.
net/dag.go
Outdated
@@ -57,6 +58,8 @@ type dagJob struct { | |||
dockey core.DataStoreKey // dockey of our document | |||
fieldName string // field of the subgraph our node belongs to | |||
|
|||
txn datastore.Txn // transaction common to a pushlog event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: I think it would be good to document why we want a txn common to a pushlog event here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the dagWorker
struct is the right place to document that though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a bit more information. Hopefully it makes it more clear.
datastore/badger/v3/datastore.go
Outdated
@@ -57,6 +57,8 @@ type txn struct { | |||
// Whether this transaction has been implicitly created as a result of a direct Datastore | |||
// method invocation. | |||
implicit bool | |||
|
|||
mu sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: Given that the badger devs didnt feel the need for this mutex I think it would be good to document why we feel the need to add it in here. Even within the context of this PR it was not obvious to me (although I was groggy having just woken up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Let me know if it satisfactory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - I do understand it now. I do think this is the wrong solution though.
- This datastore is a general solution, it is used everywhere throughout Defra, and potentially outside of Defra.
- Database transactions are generally in my experience not designed to be built up concurrently, and I'm not sure if I have ever come across such an option. I assume this is part of why the badger one doesnt bother with it.
- Wrapping every function in a shared mutex does not make it concurrent. It IMO represents a hidden bottleneck where stuff looks concurrent, but in reality multiple threads are being forced into a single.
I really have to suggest if the P2P system needs to operate on the same transaction, then it is the P2P code that needs to handle any thread synchronisation - not a widely scoped datastore implementation used by many other systems.
It should also be made really really obvious that this bottleneck exists. Such a thing can be far more damaging to performance than any mess one can write in a single-threaded system, and much harder to spot and correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Database transactions are generally in my experience not designed to be built up concurrently, and I'm not sure if I have ever come across such an option. I assume this is part of why the badger one doesnt bother with it.
I'm not changing the datastore itself. Just the wrapper which can add any functionality we find useful. Adding a mutex there has very minor implications and in most cases can be completely ignored.
Wrapping every function in a shared mutex does not make it concurrent. It IMO represents a hidden bottleneck where stuff looks concurrent, but in reality multiple threads are being forced into a single.
It doesn't make it concurrent. It add concurrency safety. Go routines move freely until the point where they might hit a DB interaction at the same time as another routine. So it's not forcing anything into a single thread. The is a bottleneck point though.
I really have to suggest if the P2P system needs to operate on the same transaction, then it is the P2P code that needs to handle any thread synchronisation - not a widely scoped datastore implementation used by many other systems.
I would very much avoid that approach. The only bottleneck is at the datastore level. The datastore wrapper is a pretty good place to handle that.
It should also be made really really obvious that this bottleneck exists. Such a thing can be far more damaging to performance than any mess one can write in a single-threaded system, and much harder to spot and correct.
The bottleneck is explained in the sendJobWorker function. Not sure I see the possible increased damage to performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the wrapper which can add any functionality we find useful
The wrapper that is used across Defra, where the P2P issue has no impact.
Adding a mutex there has very minor implications and in most cases can be completely ignored.
It really isn't a minor implication - it bottles X threads into a one at a time model. It can have an absolutely massive impact on performance.
So it's not forcing anything into a single thread.
It blocks all but one thread/routine from executing at any given time (for the transaction). Threads that prior to this change could potentially be running concurrently (depending on where and how they may clash in the badger code ofc).
The only bottleneck is at the datastore level.
When used in the P2P code. And that is a pretty large location, and one of the most common places to want to make async. And despite the fact that this is a P2P only issue at the moment, you have introduced that potential bottleneck to anywhere that uses the badger datastore (i.e. most of Defra). If you want P2P specific wrapper then it is fine (but needs to be really obvious in the bottleneck it represents), but there seems no reason to introduce it to all of Defra.
The bottleneck is explained in the sendJobWorker function.
Again, the code in the file containing the bottleneck is used in far more places than just the sendJobWorker function. I dont want to have to look at sendJobWorker when debugging weird performance issues in the version fetcher for example.
Not sure I see the possible increased damage to performance.
I'm repeating myself a little bit here, but the current code blocks threads from concurrently executing calls to the same transaction. These calls (particularly reads), could quite possibly succeed in develop - X threads that could be quite happily executing concurrently are now forced to run one at a time. That is quite a significant performance loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this, but prior to this PR any other sub-system wishing to do so was free to make the most optimal use of concurrent ops within a badger txn, so far as the badger txn could handle.
This simply isn't true. Any use of concurrent transactions prior to this PR would've caused the race detector to flag it, and never would be able to merge. No sub-system could've done this.
so far as the badger txn could handle
Not sure how many ways we can explain this - but Badger Transactions can't handle this. Again, they'll trigger the race detector.
With the mutex in, if another sub-system wishes to operate concurrently on a single txn it can't, regardless of how safe badger can otherwise handle that.
See above :)
I think this is the fundamentally misunderstanding at the moment.
badgers existing concurrent capabilities
None :) See above :)
We are introducing a very significant limitation into the code in a shared location for the sake of a single sub-system.
We are introducing no limitation, and infact making what your describing actually possible without triggering the race detector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, but it doesn't even need to be that complex depending on where it needs to be called - could just curry it:
Not sure how this would make things simpler tbh, actually feels more complex lol. Especially since fundamentally its doing the same thing - lock, do action, unlock.
But now this curry function call needs to be sprinked everywhere the specific transaction is used that needs concurrent safety....also generics ;) Especially since the transaction gets passed to subsystems outside our repo (ie: ipfs Blockstore) - which we wouldn't be able to sprinkle in this func.
My suggested wrapper is exceedingly simple to understand / implement, is called a single time when the txn is created, and no downstream use of the txn needs to ever be aware of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any use of concurrent transactions prior to this PR would've caused the race detector to flag it, and never would be able to merge.
🤣 not at all? Seems odd for the badger code to be like that, but I trust you guys ran in to this lol 😁
If that's the case, I have no reason to care about the mutex, other than the fact that it only solves it for badger and we'd probably be better off wrapping all datastore implementations (for at least P2P - I also think I like that it makes it clear that P2P requires this, instead of blanket applying it everywhere) as you suggest a few comments ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤣 not at all? Seems odd for the badger code to be like that, but I trust you guys ran in to this lol 😁
From the Badger code doc
// Running transactions concurrently is OK. However, a transaction itself isn't thread safe, and
// should only be run serially. It doesn't matter if a transaction is created by one goroutine and
// passed down to other, as long as the Txn APIs are called serially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is just a thin wrapper around the existing txn, implementing the same API, but with a shared mutex. Which would scope things only to when needed.
Done. Let me know what you think.
595f3b6
to
36bdd52
Compare
p.jobQueue <- j | ||
|
||
case newJob := <-p.sendJobs: | ||
jobs, ok := docWorkerQueue[newJob.dockey.DocKey] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: Having a separate thread-pool per document seems quite odd, and seems unlikely to have a positive impact on performance (even if the hard-coding size = 5 is made configurable or dynamic). It seems improbable that the host machine will have so many cores idling (5 per document?).
I really wish to emphasise that, especially with the single threaded transaction shared across them, that it looks very likely to me that this (and the inner routines) will have a negative impact on the performance of the P2P system (context switching/overheads/etc), and will quite probably leak it's performance problems into the rest of Defra (by consuming far too many resources and starving Defra-core of cores/memory).
I think the maintenance cost is also very high here, and the gains extremely dubious. And I would suggest we simplify this quite a bit.
This comment is marked as a thought, as I am unsure if it is in scope (this PR does force a lot into a single thread, which makes the above even worse).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate thread-pool per document is a great way to increase throughput as transactions are scoped per document. As to the number of workers, we can have less per document if you're worried about resources.
Can you elaborate more on the maintenance cost you are worried about? Also, what is extremely dubious about the gains?
It's already quite simple. What would you like to simplify further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate thread-pool per document is a great way to increase throughput as transactions are scoped per document.
Only if there are free cores. Otherwise the routines will share cores - they will interleave around each other with a core only able to execute a single execution at a time (a simplification, but you get the point hopefully) - this interleaving is really quite expensive and can very significantly degrade system performance (I was surprised how much when someone first showed me). Plus the relatively minor overhead of the routines and mem copying etc.
Can you elaborate more on the maintenance cost you are worried about?
I find it far easier to read linear code executing in a single thread, than concurrent code sprinkled with a bunch of synchronisation stuff shared between various contexts.
It is both far more performant and easier to read to break things up at sensible level, where stuff can be run as synchronously as possible whilst still making optimal use of machine resources, than to write interleaving code full of shared synchronisation stuff that optimises consumption of machine resources that forgets that the host machine has finite set of cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your concerns. I'm willing to look at reducing the number of go routine started within the P2P system but I'd rather not do it in this PR. I'll open a new issue and get to it once we've covered v0.5 priorities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised how much when someone first showed me
I'd love some context here - what language / concurrency primitives were being used.
There are a few fundemantal things that Go tries to do correctly, and lightweight concurrency is exceedingly one of them.
Couple of reasons why concurrency in the p2p system is very favourable.
A significant portion of CPU time in the p2p code is IO bound (waiting on network requests), of which the scheduler is very good at recognizing when one goroutine is blocked and immediately swithing to a non-blocked one.
Each "job" here in the p2p code is reponsible for fetching a block (locally or more likely from the network), parsing the block, and finding the dependant children in the DAG, of which each get their own jobs, and the process continues.
The reason why its scoped to a per document worker pool, is that its possible for all the workers of a single doc to block waiting on the lock for the shared transaction.
So having a pool of workers isolated per document means that even if one document is blocked (all workers) the other incoming updates from other documents can continue.
Moreover, it won't make a significant change in the number of running goroutines as there is likely several dozen floating around, some from Defra code and some from library code (especially the libp2p stuff has a bunch of goroutines internal). Go's design is such that until it becomes meaninfully necessary (via explicit measurments), goroutines are pretty damn safe to use many of.
We fundementally want the design to be well designed in its concurrency, regardless of the number of underlying threads/hypthreads/vCPUs/cores/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...its somewhat moot.
:) I do disagree with that sentiment - it can always get worse lol
If there is a deeper debate, lets focus on that (linear code being easier to read/follow etc).
Sounds sensible, and we don't have to do it here if people don't want to continue this conversation in this PR. It is also the more important aspect IMO at the moment - the current setup is in my mind a premature optimization that complicates the code without having measured any perceived gains from doing so - I think there are decent enough odds that it makes things slower (and worse, impacts other sub-systems).
I also want to make it clear that I consider the mutex in the badger txn to be a separate issue, and one that does need to be addressed within this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested the net code integration tests, theres hundreds if not thousand+ goroutines running at various points in time (note: theres also 2-3 "nodes" running in the same process)
Lol - I didnt say it would stop/crash 😁 Just be slower. Our integration tests also deal with a tiny amount of data (even when all added up I would expect at most a thousand or two docs total).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldve added: That measurement was on develop
not this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds sensible, and we don't have to do it here if people don't want to continue this conversation in this PR
Moved to discord 👍
https://discord.com/channels/427944769851752448/1068278072912126092/1068278075252551730
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in cased it was missed, the ongoing convo regarding the mutex is in another thread on this PR:
#1056 (comment)
Added some thoughts/suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, a few minor todos, and questions/thoughts:
The addition of the mutex on the datastore txn means that everywhere we use it even outside P2P net stuff will now be forced to be thread safe (potential performance issue in places we don't that?).
Have you thought about perhaps making thread safety optional? so only the use cases that need it can utilize it and rest work like before?
I'm not sure why you and Andy thinks this will affect performance and operations elsewhere. Maybe I looked at this for too long and I'm missing something obvious. But as far as I can tell, everything else works the same way it did before regardless of the added mutex. Mutex are involved on a per transaction basis and the only time it really does anything is when there is a need for concurrency safety such as the P2P situation. The datastore as a whole is not affected by that mutex. Only the isolated transaction is, if db operations are done concurrently. |
e975f91
to
c6fe397
Compare
Moving our discord DM conversation here (for indexability): Fred:
13ns is basically irrelevant This answered the first question I had, the locking and unlocking itself isn't that costly (13ns), and it only makes a difference for when there is need for concurrency safety, which is only in P2P stuff right now. This is a good transition into my second discussion point: Shahzad:
I would suggest to make this explicit to indicate that P2P usage requires this lock safety, but others don't. This way we can also avoid the 13ns cost we saw earlier for non-P2P stuff (or stuff that doesn't need the locking mechanic). Could achieve this in several ways, the cleanest might be to have the transaction logic be the same as before, but instead, wrap the transaction into another struct that contains the lock safety mechanic (and introduce the wrapper functions to handle the locking/unlocking logic). This way implementations (like P2P) that need to use a concurrent transaction can make use of that structure and all other implementations can use the existing stuff (i.e. make concurrent transactions optional and more explicit.) EDIT: Just saw @jsimnz has already suggested something similar in #1056 (comment):
|
2d0053a
to
e31d512
Compare
@@ -31,6 +31,7 @@ type DB interface { | |||
Blockstore() blockstore.Blockstore | |||
|
|||
NewTxn(context.Context, bool) (datastore.Txn, error) | |||
NewConcurrentTxn(context.Context, bool) (datastore.Txn, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought/praise: In the standup this sounded worse than it looks - I think it fits very nicely in here, and makes it clear to users of this interface that we provide no guarantees around the concurrency-safety of the existing NewTxn
. As users are free to construct and work with txns outside of the defra codebase is is quite a nice addition IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Fred! Couple of minor points on the txn type, but looks good - thanks for taking the time :)
datastore/concurrent_txn.go
Outdated
type concurrentTxn struct { | ||
ds.Txn | ||
|
||
// A given Badger transaction isn't made to be concurrently operated on. It is therefore not thread safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I am not sure badger should be mentioned in this fashion here - this is general purpose and may be needed by many other store implementations. Badger could be listed as an example, but not the explicit target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There from copy pasting. Will remove.
} | ||
|
||
// NewConcurrentTxnFrom creates a new Txn from rootstore that supports concurrent API calls | ||
func NewConcurrentTxnFrom(ctx context.Context, rootstore ds.TxnDatastore, readonly bool) (Txn, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: In the future this function could be modified to return the underlying 'normal' txn if the rootstore is threadsafe, e.g. :
switch rootstore.(type) {
case pebble:
return rootstore.NewTxn()
default:
// wrapping stuff that this PR/func does now
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a way to know, it could be done yes.
var err error | ||
|
||
// check if our datastore natively supports iterable transaction, transactions or batching | ||
if iterableTxnStore, ok := rootstore.(iterable.IterableTxnDatastore); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: This is a bit painful (sorry :)), hopefully we can remove the need for this func to know this at somepoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why you say it's painful. I'd be interested in knowing though so we can plan for a change in the future. The NewTxnFrom
function used this exact pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterable vs non-iterable is irrelevant noise here, we should be able to just call a NewTransaction
func and it should return a txn (iterable if it can)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iterable if it can
Isn't that exactly what's happening here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes here and every other location that wants a new transaction. but instead of calling code having to know/distinguish between the two this could be wrapped up in a func that hides this noise and removes the duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 locations where this happens. So basically what you are saying is not that it's irrelevant noise (because it's not) but that maybe we could have a function dedicated to that task instead of duplicating the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it's not
How so? This func just needs a rootTxn, it doesnt do anything differently with it depending on the type. (also, reminder - I'm absolutely not asking you to change this now)
rootConcurentTxn := &concurrentTxn{Txn: rootTxn} | ||
root := AsDSReaderWriter(rootConcurentTxn) | ||
multistore := MultiStoreFrom(root) | ||
return &txn{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: This looks a bit odd, why are you not returning a concurrentTxn
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering the same thing. Guessing some interface embedding pattern or need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because txn
is not what is being operated on. The transaction API functions are called from the multistore entities. The rootTxn
is the one being operated on. concurrentTxn
is just a wrapper on the rootTxn
that enables the concurrency safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing the suggested change. Would be nice to see answers to @AndrewSisley 's comments as i have some of the same ones.
This changes makes the purpose of the function clearer. This function doesn’t push a log. It handles the push log.
e31d512
to
4e60ca2
Compare
Relevant issue(s) Resolves #1028 Resolves #470 Resolves #1053 Description The main purpose of this PR is to resolve the potential deadlock. The deadlock can happen if a PushLog cycle doesn't receive the whole missing block history before an error occurs. This leaves the DAG incomplete but there is no clear way, at the moment, for Defra to be aware of that and for it to try to fill in the gap at some point. To solve the deadlock situation, John and I had discussed implementing a transaction that would cover the whole PushLog cycle. This means that any error occurring during the cycle will discard the transaction and thus leave the DAG unaffected. As side effects, we add thread safety to the badger transactions and we manage DAG workers on a per PushLog cycle basis.
Relevant issue(s) Resolves sourcenetwork#1028 Resolves sourcenetwork#470 Resolves sourcenetwork#1053 Description The main purpose of this PR is to resolve the potential deadlock. The deadlock can happen if a PushLog cycle doesn't receive the whole missing block history before an error occurs. This leaves the DAG incomplete but there is no clear way, at the moment, for Defra to be aware of that and for it to try to fill in the gap at some point. To solve the deadlock situation, John and I had discussed implementing a transaction that would cover the whole PushLog cycle. This means that any error occurring during the cycle will discard the transaction and thus leave the DAG unaffected. As side effects, we add thread safety to the badger transactions and we manage DAG workers on a per PushLog cycle basis.
Relevant issue(s)
Resolves #1028
Resolves #470
Resolves #1053
Description
The main purpose of this PR is to resolve the potential deadlock. The deadlock can happen if a
PushLog
cycle doesn't receive the whole missing block history before an error occurs. This leaves the DAG incomplete but there is no clear way, at the moment, for Defra to be aware of that and for it to try to fill in the gap at some point.To solve the deadlock situation, John and I had discussed implementing a transaction that would cover the whole
PushLog
cycle. This means that any error occurring during the cycle will discard the transaction and thus leave the DAG unaffected.As side effects, we add thread safety to the badger transactions and we manage DAG workers on a per
PushLog
cycle basis.Tasks
How has this been tested?
integration tests
Specify the platform(s) on which this was tested: