kv: cleanup txn more eagerly #34387

andreimatei · 2019-01-30T03:40:06Z

Before this patch, when a txn got a non-retriable error, its heartbeat
loop (if any) was left running until the client sent a rollback.
This patch makes the txn cleanup more eager - we do it immediately on
receiving the error.
Besides seeming sane (why wait for the client when we know what must
happen), this also makes the TxnCoordSender state more internally
consistent: if a non-retriable error contains an Aborted txn, we now
stop the hb loop before that loop has the opportunity to freakout about
running with an Aborted txn. It's unclear if non-retriable errors could
contain Aborted txns, but see below.

This patch also refactors the state update code in an attempt to make it
more readable.

In #34337 we see a crash due to the fact that a heartbeat is running for
a transaction whose proto status is no longer PENDING. It's not entirely
clear to me how that can happen since we "clean up the txn" - i.e. stop
the hb loop - after commits and roll backs as well on
TransactionAbortedErrors, but it's also not very convincing that it
can't happen. The thing is that the protocol between the "client" and
the "server" wrt communicating txn updates from the server is lax and
it's not very clear what kind of responses can carry an Aborted or
Committed proto in them.

This patch also makes leaf TxnCoordSender nimbler by not using
interceptors needed only by roots.

Release note: None

cockroach-teamcity · 2019-01-30T03:40:12Z

This change is

nvanbenschoten

Reviewed 2 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)

pkg/kv/txn_coord_sender.go, line 678 at r1 (raw file):

	}

	assertConsistentState := func() {

Is the goal to keep this forever? Adding in these assertions with no immediate plan to remove them doesn't seem great to me. They end up sitting around, making code harder to read, and preventing other cleanup.

pkg/kv/txn_coord_sender.go, line 687 at r1 (raw file):

	// Move to the error state on non-retriable errors.
	if pErr != nil {

Why isn't this logic in updateStateLocked?

pkg/kv/txn_interceptor_heartbeat.go, line 39 at r1 (raw file):

// eliding EndTransaction requests on read-only transactions.
//
// txnHeartbeat should only be used for root transactions; leafs don't perform

Why remove this?

pkg/kv/txn_interceptor_heartbeat.go, line 488 at r1 (raw file):

		// We need to be prepared here to handle the case of a
		// TransactionAbortedError with no transaction proto in it.

Could you leave a TODO for me to make this the only case where we get back an ABORTED error?

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/txn_coord_sender.go, line 678 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is the goal to keep this forever? Adding in these assertions with no immediate plan to remove them doesn't seem great to me. They end up sitting around, making code harder to read, and preventing other cleanup.

I've removed the assertion. Let's hope that the new cleanup makes the crash go away.

pkg/kv/txn_coord_sender.go, line 687 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why isn't this logic in updateStateLocked?

yeah... I put some makeup on the pig; see if you like it now.

pkg/kv/txn_interceptor_heartbeat.go, line 39 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why remove this?

cause it wasn't true. The leaf TCS still has all the interceptors.

pkg/kv/txn_interceptor_heartbeat.go, line 488 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you leave a TODO for me to make this the only case where we get back an ABORTED error?

done

andreimatei

friendly ping :)

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

nvanbenschoten

Reviewed 4 of 5 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

pkg/kv/txn_coord_sender.go, line 873 at r2 (raw file):

// starting tracking the transaction - i.e. this is the case for DistSQL, which
// just does reads and passes 0.
func (tc *TxnCoordSender) updateStateLocked(

This logic is much cleaner!

pkg/kv/txn_coord_sender.go, line 900 at r2 (raw file):

	if pErr == nil {
		if br.Txn != nil {

I was wondering about these checks a few days ago. What do you think about them? Now that we have the CrossRangeTxnWrapperSender, should anything in the TxnCoordSender have to deal with non-transactional batches or responses? If we're worried about malformed responses, maybe we just perform a single validation in the txnLockGatekeeper?

Or for now, just get rid of the check here, because Update is already a no-op for nil txns.

pkg/kv/txn_coord_sender.go, line 911 at r2 (raw file):

			// transaction is not supposed to be used any more after a retriable
			// error.
			// Separately, the error needs to make its way back to the root.

Re-wrap?

pkg/kv/txn_coord_sender.go, line 922 at r2 (raw file):

			cp := pErr.GetTxn().Clone()
			tc.mu.txn.Update(&cp)
			tc.cleanupTxnLocked(ctx)

This is ok even on leafs? Hmm, I wonder if that could cause issues like counting metrics for the same transaction on a leaf node and a root node. All the more reason to have a restricted set of interceptors on leaf nodes.

pkg/kv/txn_interceptor_heartbeat.go, line 39 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

cause it wasn't true. The leaf TCS still has all the interceptors.

Let's fix that then! Maybe leave a TODO when we're constructing the interceptor stack that we can skip certain interceptors on leaf nodes. That's part of what makes the design appealing.

Are we currently relying on the leaf txns to not perform any writes? If one accidentally did, could we have two heartbeat loops?

Before this patch, when a txn got a non-retriable error, its heartbeat loop (if any) was left running until the client sent a rollback. This patch makes the txn cleanup more eager - we do it immediately on receiving the error. Besides seeming sane (why wait for the client when we know what must happen), this also makes the TxnCoordSender state more internally consistent: if a non-retriable error contains an Aborted txn, we now stop the hb loop before that loop has the opportunity to freakout about running with an Aborted txn. It's unclear if non-retriable errors could contain Aborted txns, but see below. This patch also refactors the state update code in an attempt to make it more readable. In cockroachdb#34337 we see a crash due to the fact that a heartbeat is running for a transaction whose proto status is no longer PENDING. It's not entirely clear to me how that can happen since we "clean up the txn" - i.e. stop the hb loop - after commits and roll backs as well on TransactionAbortedErrors, but it's also not very convincing that it can't happen. The thing is that the protocol between the "client" and the "server" wrt communicating txn updates from the server is lax and it's not very clear what kind of responses can carry an Aborted or Committed proto in them. This patch also makes leaf TxnCoordSender nimbler by not using interceptors needed only by roots. Release note: None

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/txn_coord_sender.go, line 900 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I was wondering about these checks a few days ago. What do you think about them? Now that we have the CrossRangeTxnWrapperSender, should anything in the TxnCoordSender have to deal with non-transactional batches or responses? If we're worried about malformed responses, maybe we just perform a single validation in the txnLockGatekeeper?

Or for now, just get rid of the check here, because Update is already a no-op for nil txns.

done the latter

pkg/kv/txn_coord_sender.go, line 911 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Re-wrap?

done

pkg/kv/txn_coord_sender.go, line 922 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is ok even on leafs? Hmm, I wonder if that could cause issues like counting metrics for the same transaction on a leaf node and a root node. All the more reason to have a restricted set of interceptors on leaf nodes.

good catch. I removed the metrics and other interceptor from leaves.

pkg/kv/txn_interceptor_heartbeat.go, line 39 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's fix that then! Maybe leave a TODO when we're constructing the interceptor stack that we can skip certain interceptors on leaf nodes. That's part of what makes the design appealing.

Are we currently relying on the leaf txns to not perform any writes? If one accidentally did, could we have two heartbeat loops?

fixed!

nvanbenschoten

nice cleanup!

Reviewed 2 of 2 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)

pkg/kv/txn_coord_sender.go, line 459 at r3 (raw file):

		mu:      &tcs.mu,
	}
	if typ == client.RootTxn {

👍 let's see if this breaks anything.

andreimatei

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)

34387: kv: cleanup txn more eagerly r=andreimatei a=andreimatei Before this patch, when a txn got a non-retriable error, its heartbeat loop (if any) was left running until the client sent a rollback. This patch makes the txn cleanup more eager - we do it immediately on receiving the error. Besides seeming sane (why wait for the client when we know what must happen), this also makes the TxnCoordSender state more internally consistent: if a non-retriable error contains an Aborted txn, we now stop the hb loop before that loop has the opportunity to freakout about running with an Aborted txn. It's unclear if non-retriable errors could contain Aborted txns, but see below. This patch also refactors the state update code in an attempt to make it more readable. In #34337 we see a crash due to the fact that a heartbeat is running for a transaction whose proto status is no longer PENDING. It's not entirely clear to me how that can happen since we "clean up the txn" - i.e. stop the hb loop - after commits and roll backs as well on TransactionAbortedErrors, but it's also not very convincing that it can't happen. The thing is that the protocol between the "client" and the "server" wrt communicating txn updates from the server is lax and it's not very clear what kind of responses can carry an Aborted or Committed proto in them. This patch also makes leaf TxnCoordSender nimbler by not using interceptors needed only by roots. Release note: None 34828: roachtest: disable follower_reads test for versions prior to v2.2.0 r=ajwerner a=ajwerner Fixed #34814 Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

craig · 2019-02-12T16:36:58Z

Build succeeded

GitHub CI (Cockroach)

andreimatei requested review from nvanbenschoten and a team January 30, 2019 03:40

andreimatei force-pushed the small.assert-inconsistent-TCS branch from d1af373 to 8c9e0d6 Compare January 30, 2019 03:42

nvanbenschoten reviewed Jan 30, 2019

View reviewed changes

andreimatei force-pushed the small.assert-inconsistent-TCS branch from 8c9e0d6 to da2ae00 Compare January 31, 2019 07:52

andreimatei commented Jan 31, 2019

View reviewed changes

andreimatei commented Feb 6, 2019

View reviewed changes

nvanbenschoten reviewed Feb 6, 2019

View reviewed changes

andreimatei force-pushed the small.assert-inconsistent-TCS branch from da2ae00 to 07e46d7 Compare February 8, 2019 00:19

andreimatei force-pushed the small.assert-inconsistent-TCS branch from 07e46d7 to 1088016 Compare February 8, 2019 00:21

andreimatei commented Feb 8, 2019

View reviewed changes

nvanbenschoten approved these changes Feb 11, 2019

View reviewed changes

andreimatei commented Feb 12, 2019

View reviewed changes

craig bot merged commit 1088016 into cockroachdb:master Feb 12, 2019

andreimatei deleted the small.assert-inconsistent-TCS branch February 13, 2019 18:57

andreimatei mentioned this pull request Feb 13, 2019

roachtest: scaledata/distributed_semaphore/nodes=6 failed #34695

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: cleanup txn more eagerly #34387

kv: cleanup txn more eagerly #34387

andreimatei commented Jan 30, 2019 •

edited

Loading

cockroach-teamcity commented Jan 30, 2019

nvanbenschoten left a comment

andreimatei left a comment

andreimatei left a comment

nvanbenschoten left a comment

andreimatei left a comment

nvanbenschoten left a comment

andreimatei left a comment

craig bot commented Feb 12, 2019

kv: cleanup txn more eagerly #34387

kv: cleanup txn more eagerly #34387

Conversation

andreimatei commented Jan 30, 2019 • edited Loading

cockroach-teamcity commented Jan 30, 2019

nvanbenschoten left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

craig bot commented Feb 12, 2019

Build succeeded

andreimatei commented Jan 30, 2019 •

edited

Loading