chainntnfs: handle historical confs and spends asynchronously #1628

wpaulino · 2018-07-26T02:40:31Z

In this PR, we modify our bitcoind and btcd chain notifiers to handle historical confirmations and spends asynchronously. This was motivated by users experiencing slow startups due to blocking during long rescans.

Fixes #1569.
Fixes #1616.
Fixes #1648.

Roasbeef · 2018-07-26T03:34:56Z

Looks to have PR-specific failures on the itests rn.

halseth · 2018-07-26T08:42:09Z

chainntnfs/bitcoindnotify/bitcoind.go

+		}
+	}
+
+	return fmt.Errorf("spending transaction not found within block range "+


Doesn't seem like this should be printed as an error? Can potentially make this a concrete error type we can filter out.

halseth · 2018-07-26T08:43:39Z

chainntnfs/interface_test.go

@@ -867,69 +867,14 @@ func testSpendBeforeNtfnRegistration(miner *rpctest.Harness,
 	// concrete implementations.
 	//
 	// To do so, we first create a new output to our test target address.
-	txid, err := getTestTxId(miner)


halseth · 2018-07-26T09:03:20Z

chainntnfs/neutrinonotify/neutrino.go

-					msg.heightHint)
+				// Look up whether the transaction is already
+				// included in the active chain.
+				confDetails, err := n.historicalConfDetails(


why not async?

halseth · 2018-07-26T09:21:42Z

chainntnfs/txconfnotifier.go

+// the confirmation details must be provided with the UpdateConfDetails method,
+// otherwise we will wait for the transaction to confirm even though it already
+// has.
+func (tcn *TxConfNotifier) Register(ntfn *ConfNtfn) error {


Commit message says "registrations for unconfirmed transactions only", but seems like this will also be called fro already confirmed transactions?

Clarified things a bit, let me know what you think.

halseth · 2018-07-26T09:23:48Z

chainntnfs/bitcoindnotify/bitcoind.go

@@ -245,24 +245,37 @@ out:
 				_, currentHeight, err := b.chainConn.GetBestBlock()


seems like this could be moved inside the go routine

halseth · 2018-07-26T09:28:14Z

chainntnfs/txconfnotifier.go

+			// indicating that the transaction has already been
+			// confirmed.
+			select {
+			case ntfn.Event.Updates <- 0:


Can we risk the channel having a buffer one element to small in this case? (since it is created with numconfs capacity)

It's safe to do so as we don't send an update indicating the required number of confirmations. Now that I think about this though, it should do that in place of sending an update indicating there aren't any confirmations left as that's already handled by the Confirmed channel. Will address it in a follow-up PR.

halseth · 2018-07-26T09:39:28Z

chainntnfs/txconfnotifier.go

+			if details.BlockHeight <= tcn.currentHeight {
+				numConfsLeft := confHeight - tcn.currentHeight
+				select {
+				case ntfn.Event.Updates <- numConfsLeft:


Seems like there might be an issue here if multiple callers register for confirmation of the same txid. Each time UpdateConfDetails is called as a result of this, all clients will be sent updates? This will lead to each client receiving the updates multiple times, and the channel might even get full.

From what I can gather, we shouldn't need to iterate the ntfn clients in this method, but pass in the one client that requested the historical dispatch, and only notify that one.

Yeah, I spent quite some time debugging this. Ended up adding an id so that we can uniquely track each notification. This is something that will be needed anyway once we support canceling confirmation notifications.

halseth · 2018-07-27T08:32:00Z

chainntnfs/btcdnotify/btcd.go

@@ -800,6 +801,7 @@ func (b *BtcdNotifier) RegisterConfirmationsNtfn(txid *chainhash.Hash,

 	ntfn := &confirmationNotification{
 		ConfNtfn: chainntnfs.ConfNtfn{
+			ID:               atomic.AddUint64(&b.confClientCounter, 1),


confID for consistency

halseth · 2018-07-27T08:45:21Z

chainntnfs/txconfnotifier.go

+	return nil
+}
+
+// UpdateConfDetails attempts to mark an existing unconfirmed transaction as


I can't make this godoc match what the method is actually doing.

Forgot to update it. Fixed.

halseth · 2018-07-27T08:51:04Z

chainntnfs/bitcoindnotify/bitcoind.go

+				// included in the active chain. We'll do this
+				// in a goroutine to prevent blocking
+				// potentially long rescans.
+				go func() {


maybe introduce a waitgroup to wait for during shutdowns?

Not sure if it's really needed here 🤔

Yeah I don't think it's needed here. Otherwise a very distant rescan can cause the shutdown process to be held up.

we could thread a quit chan down to the rescan if we want. thinking it might be a good idea actually, to prevent the rescan from causing an accidental panic after the chain conns are deinitialized

I think it's a nice pattern to follow, as also test failures tend to be harder to trace if there are stray go routines remaining.

halseth · 2018-07-27T08:53:18Z

chainntnfs/bitcoindnotify/bitcoind.go

-				}
+					if confDetails != nil {
+						err := b.txConfNotifier.UpdateConfDetails(
+							*msg.TxID, msg.ID,


just pass the msg.ConfNtfn itself.

I think it's better to keep it as is to make sure the notification has been registered with the notifier first.

if moving the Register call right before, that would make those call more tightly coupled, and i think even the whole ConfID can be removed?

Sure, but then it's expected for the caller to know this, rather than just providing an interface that ensures the notification has been registered first. Having an ID will also come in handy when adding support for cancelling existing notifications (there's an existing TODO for this).

halseth · 2018-07-27T08:56:06Z

chainntnfs/neutrinonotify/neutrino.go

 					rescanUpdate := []neutrino.UpdateOption{
 						neutrino.AddTxIDs(*msg.TxID),
 						neutrino.Rewind(currentHeight),
 					}
-					if err := n.chainView.Update(rescanUpdate...); err != nil {
-						chainntnfs.Log.Errorf("unable to update rescan: %v", err)
+					err = n.chainView.Update(rescanUpdate...)


just a reminder to make sure we cannot potentially miss a confimation if a block comes in while this go routine is running.

What do you mean by this @halseth ?

We are passing in currentHeight, which might be outdated at this point (because of potential long-running historicalConfDetails). I think this should be okay, but just meant that we should check that it indeed is.

Good point, it's likely we'd rescan all of the manual dispatch done before. Since current height is behind a mutex, should be safe to read again?

currentHeight is a copy so it should remain unchanged even if the actual height changes.

halseth · 2018-07-27T09:07:35Z

chainntnfs/btcdnotify/btcd.go

+				// included in the active chain. We'll do this
+				// in a goroutine to prevent blocking
+				// potentially long rescans.
+				go func() {


I think there might be a potential race here:

Imagine tx A is confirmed in block 101, and our currentHeight is 104. We want a notification on 5 confirmations.

[100] <- [101: A] <- [102] <- [103] <- [104]

We start a rescan from heightHint = 100, to 104. While we do this block 105 comes in. In ConnectTip we check ntfn.details == nil and skip it. The rescan finishes, ntfn.details is set, and the notification for block 105 is never sent.

I'm still thinking about how to best solve this without essentially ending back up at our sync rescan behaviour. One potential approach would be to always do a rescan from heightHint to bestHeight without registering the notification client first (that way new blocks can come in without altering our client), and then keep doing this till you know for sure that you have scanned all they way to the current best height (can be enforced with a mutex). Then you register the client (within the same scope of the mutex), making sure no block can come in in the meantime.

I was probably wrong, looks like the tcn.Lock acquired in UpdateConfDetails makes sure that no new block gets processed before we are done with the notifying, so we should be good 👍

halseth · 2018-07-27T17:06:35Z

chainntnfs/neutrinonotify/neutrino.go

 			TxID:             txid,
 			NumConfirmations: numConfs,
 			Event:            chainntnfs.NewConfirmationEvent(numConfs),
 		},
 		heightHint: heightHint,
 	}

+	if err := n.txConfNotifier.Register(&ntfn.ConfNtfn); err != nil {


Can we just move this call to where we are handling the confirmationsNotification, before we start the historical dispatch?

We can, but I figured it made more sense having it here like we do with NotifySpent in RegisterSpendNtfn.

For readability I think it would make sense to move it there, as they always need to be called in that order, right after one another, right?

Register must now be called first. Since it's possible that it fails, we should return the error to the caller rather than just logging it, as otherwise the caller will be expecting to receive the confirmation details for a failed registration. At the moment, it's unlikely for it to return an error other than when shutting down, but it could prevent an obscure bug later down the road.

halseth · 2018-07-27T17:14:07Z

chainntnfs/neutrinonotify/neutrino.go

+
+					if confDetails != nil {
+						err := n.txConfNotifier.UpdateConfDetails(
+							*msg.TxID, msg.ID,


Not sure if we actually need the new ID if we can just pass the ConfNtfn in here instead.

Same comment as above w.r.t. ensuring the notification has been registered first.

Roasbeef · 2018-07-30T23:28:16Z

chainntnfs/txconfnotifier.go

+		return nil
+	}
+
+	// The notifier has yet to reach the height at which the transaction was


This is a bit confusing, how would we ever know in advance if the conf intent is already satisfied? Seems like it should read DispatchBlockHeight here instead of just a block height.

If it's the block height that the transaction was initially confirmed at, why can we determine this independent of the notifier itself advancing? Do we have a test to exercise this case?

The reason behind the check is due to the lag between the TxConfNotifier and the backend. For example, let's say that the notifier starts at height 100 and the the transaction confirms at 102. The backend moves up to 102, but the TxConfNotifier is still processing height 101. In the event that UpdateConfDetails is called before ConnectTip(..., 102, ...), the notifier isn't yet aware of this height, so we defer until handling the notification dispatch until then.

Not sure what you mean by DispatchBlockHeight. Are you referring to renaming the method UpdateConfDetails to that?

Roasbeef · 2018-07-30T23:31:06Z

chainntnfs/bitcoindnotify/bitcoind.go

+				// included in the active chain. We'll do this
+				// in a goroutine to prevent blocking
+				// potentially long rescans.
+				go func() {


Yeah I don't think it's needed here. Otherwise a very distant rescan can cause the shutdown process to be held up.

Roasbeef · 2018-07-30T23:40:09Z

chainntnfs/neutrinonotify/neutrino.go

 					rescanUpdate := []neutrino.UpdateOption{
 						neutrino.AddTxIDs(*msg.TxID),
 						neutrino.Rewind(currentHeight),
 					}
-					if err := n.chainView.Update(rescanUpdate...); err != nil {
-						chainntnfs.Log.Errorf("unable to update rescan: %v", err)
+					err = n.chainView.Update(rescanUpdate...)


What do you mean by this @halseth ?

Roasbeef · 2018-07-30T23:45:53Z

chainntnfs/bitcoindnotify/bitcoind.go

@@ -237,7 +241,7 @@ out:
 					b.spendNotifications[op] = make(map[uint64]*spendNotification)
 				}
 				b.spendNotifications[op][msg.spendID] = msg
-				b.chainConn.NotifySpent([]*wire.OutPoint{&op})


Looks like this was a duplicate call even? We already do this at notification registration site.

Roasbeef

Tested this on my testnet nodes with 50+ channels. Before this commit, the start up time was measured in the minutes (time from startup to responding to getinfo). After this commit, the start up time is measured in seconds. The one lingering issue which we've discussed offline, is the fact that the Start() method is blocking which can cause contention with the main server mutex. Once that final issue is patched, restarts will be much, much speedier.

LGTM 🦑

cfromknecht

Nice job @wpaulino, looks pretty complete to me!

cfromknecht · 2018-07-31T01:44:44Z

chainntnfs/bitcoindnotify/bitcoind.go

 	started int32 // To be used atomically.
 	stopped int32 // To be used atomically.

+	confClientCounter  uint64 // To be used atomically.


recommend putting 64-bit atomic vars before 32-bit ones, as this pattern always produces valid alignments on 32-bit machines

cfromknecht · 2018-07-31T01:48:31Z

chainntnfs/btcdnotify/btcd.go

 	started int32 // To be used atomically.
 	stopped int32 // To be used atomically.

+	confClientCounter  uint64 // To be used aotmically.


same comment here about moving 64-bit vals before 32's, s/aotmically/atomically

cfromknecht · 2018-07-31T02:02:25Z

chainntnfs/txconfnotifier.go

+	// The notifier has yet to reach the height at which the transaction was
+	// included in a block, so we should defer until handling it then within
+	// ConnectTip.
+	if details.BlockHeight > tcn.currentHeight {


this line will panic if details is nil, which is possible as written. Might want to add a test for this

cfromknecht · 2018-07-31T02:03:51Z

chainntnfs/txconfnotifier.go

+var (
+	// ErrTxConfNotifierExiting is an error returned when attempting to
+	// interact with the TxConfNotifier but it been shut down.
+	ErrTxConfNotifierExiting = errors.New("TxConfNotifier is exiting")


cfromknecht · 2018-07-31T02:12:39Z

chainntnfs/bitcoindnotify/bitcoind.go

+				// included in the active chain. We'll do this
+				// in a goroutine to prevent blocking
+				// potentially long rescans.
+				go func() {


we could thread a quit chan down to the rescan if we want. thinking it might be a good idea actually, to prevent the rescan from causing an accidental panic after the chain conns are deinitialized

halseth · 2018-07-31T06:54:58Z

chainntnfs/txconfnotifier.go

+	return nil
+}
+
+// UpdateConfDetails attempts to update the confirmation details for the an


nit: ~~the~~

halseth · 2018-07-31T07:03:26Z

chainntnfs/neutrinonotify/neutrino.go

 					rescanUpdate := []neutrino.UpdateOption{
 						neutrino.AddTxIDs(*msg.TxID),
 						neutrino.Rewind(currentHeight),
 					}
-					if err := n.chainView.Update(rescanUpdate...); err != nil {
-						chainntnfs.Log.Errorf("unable to update rescan: %v", err)
+					err = n.chainView.Update(rescanUpdate...)


We are passing in currentHeight, which might be outdated at this point (because of potential long-running historicalConfDetails). I think this should be okay, but just meant that we should check that it indeed is.

halseth · 2018-07-31T07:17:47Z

Mostly LGTM now (after addressing Conner's comments). I still think the Register should be moved though, as that will entail a smaller structural change from the existing code, and aids readability IMO. It also makes it possible to simplify a bit by removing the ConfIDs.

… async In this commit, we modify our TxConfNotifier struct to allow handling notification registrations asynchronously. The Register method has been refactored into two: Register and UpdateConfDetails. In the case that a transaction we registered for notifications on has already confirmed, we'll need to determine its confirmation details on our own. Once done, this can be provided within UpdateConfDetails. This change will pave down the road for our different chain notifiers to handle potentially long rescans asynchronously to prevent blocking the caller.

Roasbeef

LGTM 💣

wpaulino added enhancement Improvements to existing features / behaviour notifications backend Related to the node backend software/interface (e.g. btcd, bitcoin-core) P2 should be fixed if one has time labels Jul 26, 2018

halseth suggested changes Jul 26, 2018

View reviewed changes

halseth suggested changes Jul 27, 2018

View reviewed changes

halseth reviewed Jul 27, 2018

View reviewed changes

Roasbeef mentioned this pull request Jul 30, 2018

lnd take ages to start up #1648

Closed

Roasbeef requested changes Jul 30, 2018

View reviewed changes

Roasbeef previously approved these changes Jul 31, 2018

View reviewed changes

cfromknecht reviewed Jul 31, 2018

View reviewed changes

halseth reviewed Jul 31, 2018

View reviewed changes

wpaulino added 6 commits July 31, 2018 18:23

chainntnfs/txconfnotifier: use concrete ErrTxConfNotifierExiting error

ad904eb

chainntnfs: add unique ID field to track conf ntfns within notifier

c43506d

chainntnfs: make historical confirmation rescans async

12816a9

chainntnfs/bitcoindnotify: make historical spend rescans async

65b6257

chainntnfs/interface_test: remove code found within helper functions

8f5f3bc

wpaulino dismissed Roasbeef’s stale review via 8f5f3bc August 1, 2018 01:33

Roasbeef approved these changes Aug 1, 2018

View reviewed changes

Roasbeef merged commit ad7f87e into lightningnetwork:master Aug 1, 2018

wpaulino deleted the async-rescans branch August 2, 2018 19:46

		@@ -245,24 +245,37 @@ out:
		_, currentHeight, err := b.chainConn.GetBestBlock()

chainntnfs: handle historical confs and spends asynchronously #1628

chainntnfs: handle historical confs and spends asynchronously #1628

Conversation

wpaulino commented Jul 26, 2018 • edited by Roasbeef Loading

Roasbeef commented Jul 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cfromknecht Jul 31, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wpaulino Jul 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Roasbeef left a comment

Choose a reason for hiding this comment

cfromknecht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

halseth commented Jul 31, 2018

Roasbeef left a comment

Choose a reason for hiding this comment

wpaulino commented Jul 26, 2018 •

edited by Roasbeef

Loading

cfromknecht Jul 31, 2018 •

edited

Loading

wpaulino Jul 30, 2018 •

edited

Loading