Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: cleanup txn more eagerly #34387

Merged
merged 1 commit into from
Feb 12, 2019

Conversation

andreimatei
Copy link
Contributor

@andreimatei andreimatei commented Jan 30, 2019

Before this patch, when a txn got a non-retriable error, its heartbeat
loop (if any) was left running until the client sent a rollback.
This patch makes the txn cleanup more eager - we do it immediately on
receiving the error.
Besides seeming sane (why wait for the client when we know what must
happen), this also makes the TxnCoordSender state more internally
consistent: if a non-retriable error contains an Aborted txn, we now
stop the hb loop before that loop has the opportunity to freakout about
running with an Aborted txn. It's unclear if non-retriable errors could
contain Aborted txns, but see below.

This patch also refactors the state update code in an attempt to make it
more readable.

In #34337 we see a crash due to the fact that a heartbeat is running for
a transaction whose proto status is no longer PENDING. It's not entirely
clear to me how that can happen since we "clean up the txn" - i.e. stop
the hb loop - after commits and roll backs as well on
TransactionAbortedErrors, but it's also not very convincing that it
can't happen. The thing is that the protocol between the "client" and
the "server" wrt communicating txn updates from the server is lax and
it's not very clear what kind of responses can carry an Aborted or
Committed proto in them.

This patch also makes leaf TxnCoordSender nimbler by not using
interceptors needed only by roots.

Release note: None

@andreimatei andreimatei requested review from nvanbenschoten and a team January 30, 2019 03:40
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei andreimatei force-pushed the small.assert-inconsistent-TCS branch from d1af373 to 8c9e0d6 Compare January 30, 2019 03:42
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)


pkg/kv/txn_coord_sender.go, line 678 at r1 (raw file):

	}

	assertConsistentState := func() {

Is the goal to keep this forever? Adding in these assertions with no immediate plan to remove them doesn't seem great to me. They end up sitting around, making code harder to read, and preventing other cleanup.


pkg/kv/txn_coord_sender.go, line 687 at r1 (raw file):

	// Move to the error state on non-retriable errors.
	if pErr != nil {

Why isn't this logic in updateStateLocked?


pkg/kv/txn_interceptor_heartbeat.go, line 39 at r1 (raw file):

// eliding EndTransaction requests on read-only transactions.
//
// txnHeartbeat should only be used for root transactions; leafs don't perform

Why remove this?


pkg/kv/txn_interceptor_heartbeat.go, line 488 at r1 (raw file):

		// We need to be prepared here to handle the case of a
		// TransactionAbortedError with no transaction proto in it.

Could you leave a TODO for me to make this the only case where we get back an ABORTED error?

@andreimatei andreimatei force-pushed the small.assert-inconsistent-TCS branch from 8c9e0d6 to da2ae00 Compare January 31, 2019 07:52
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/txn_coord_sender.go, line 678 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is the goal to keep this forever? Adding in these assertions with no immediate plan to remove them doesn't seem great to me. They end up sitting around, making code harder to read, and preventing other cleanup.

I've removed the assertion. Let's hope that the new cleanup makes the crash go away.


pkg/kv/txn_coord_sender.go, line 687 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why isn't this logic in updateStateLocked?

yeah... I put some makeup on the pig; see if you like it now.


pkg/kv/txn_interceptor_heartbeat.go, line 39 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why remove this?

cause it wasn't true. The leaf TCS still has all the interceptors.


pkg/kv/txn_interceptor_heartbeat.go, line 488 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you leave a TODO for me to make this the only case where we get back an ABORTED error?

done

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

friendly ping :)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 5 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/kv/txn_coord_sender.go, line 873 at r2 (raw file):

// starting tracking the transaction - i.e. this is the case for DistSQL, which
// just does reads and passes 0.
func (tc *TxnCoordSender) updateStateLocked(

This logic is much cleaner!


pkg/kv/txn_coord_sender.go, line 900 at r2 (raw file):

	if pErr == nil {
		if br.Txn != nil {

I was wondering about these checks a few days ago. What do you think about them? Now that we have the CrossRangeTxnWrapperSender, should anything in the TxnCoordSender have to deal with non-transactional batches or responses? If we're worried about malformed responses, maybe we just perform a single validation in the txnLockGatekeeper?

Or for now, just get rid of the check here, because Update is already a no-op for nil txns.


pkg/kv/txn_coord_sender.go, line 911 at r2 (raw file):

			// transaction is not supposed to be used any more after a retriable
			// error.
			// Separately, the error needs to make its way back to the root.

Re-wrap?


pkg/kv/txn_coord_sender.go, line 922 at r2 (raw file):

			cp := pErr.GetTxn().Clone()
			tc.mu.txn.Update(&cp)
			tc.cleanupTxnLocked(ctx)

This is ok even on leafs? Hmm, I wonder if that could cause issues like counting metrics for the same transaction on a leaf node and a root node. All the more reason to have a restricted set of interceptors on leaf nodes.


pkg/kv/txn_interceptor_heartbeat.go, line 39 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

cause it wasn't true. The leaf TCS still has all the interceptors.

Let's fix that then! Maybe leave a TODO when we're constructing the interceptor stack that we can skip certain interceptors on leaf nodes. That's part of what makes the design appealing.

Are we currently relying on the leaf txns to not perform any writes? If one accidentally did, could we have two heartbeat loops?

@andreimatei andreimatei force-pushed the small.assert-inconsistent-TCS branch from da2ae00 to 07e46d7 Compare February 8, 2019 00:19
Before this patch, when a txn got a non-retriable error, its heartbeat
loop (if any) was left running until the client sent a rollback.
This patch makes the txn cleanup more eager - we do it immediately on
receiving the error.
Besides seeming sane (why wait for the client when we know what must
happen), this also makes the TxnCoordSender state more internally
consistent: if a non-retriable error contains an Aborted txn, we now
stop the hb loop before that loop has the opportunity to freakout about
running with an Aborted txn. It's unclear if non-retriable errors could
contain Aborted txns, but see below.

This patch also refactors the state update code in an attempt to make it
more readable.

In cockroachdb#34337 we see a crash due to the fact that a heartbeat is running for
a transaction whose proto status is no longer PENDING. It's not entirely
clear to me how that can happen since we "clean up the txn" - i.e. stop
the hb loop - after commits and roll backs as well on
TransactionAbortedErrors, but it's also not very convincing that it
can't happen. The thing is that the protocol between the "client" and
the "server" wrt communicating txn updates from the server is lax and
it's not very clear what kind of responses can carry an Aborted or
Committed proto in them.

This patch also makes leaf TxnCoordSender nimbler by not using
interceptors needed only by roots.

Release note: None
@andreimatei andreimatei force-pushed the small.assert-inconsistent-TCS branch from 07e46d7 to 1088016 Compare February 8, 2019 00:21
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/txn_coord_sender.go, line 900 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I was wondering about these checks a few days ago. What do you think about them? Now that we have the CrossRangeTxnWrapperSender, should anything in the TxnCoordSender have to deal with non-transactional batches or responses? If we're worried about malformed responses, maybe we just perform a single validation in the txnLockGatekeeper?

Or for now, just get rid of the check here, because Update is already a no-op for nil txns.

done the latter


pkg/kv/txn_coord_sender.go, line 911 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Re-wrap?

done


pkg/kv/txn_coord_sender.go, line 922 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is ok even on leafs? Hmm, I wonder if that could cause issues like counting metrics for the same transaction on a leaf node and a root node. All the more reason to have a restricted set of interceptors on leaf nodes.

good catch. I removed the metrics and other interceptor from leaves.


pkg/kv/txn_interceptor_heartbeat.go, line 39 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's fix that then! Maybe leave a TODO when we're constructing the interceptor stack that we can skip certain interceptors on leaf nodes. That's part of what makes the design appealing.

Are we currently relying on the leaf txns to not perform any writes? If one accidentally did, could we have two heartbeat loops?

fixed!

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: nice cleanup!

Reviewed 2 of 2 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/kv/txn_coord_sender.go, line 459 at r3 (raw file):

		mu:      &tcs.mu,
	}
	if typ == client.RootTxn {

👍 let's see if this breaks anything.

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)

craig bot pushed a commit that referenced this pull request Feb 12, 2019
34387: kv: cleanup txn more eagerly r=andreimatei a=andreimatei

Before this patch, when a txn got a non-retriable error, its heartbeat
loop (if any) was left running until the client sent a rollback.
This patch makes the txn cleanup more eager - we do it immediately on
receiving the error.
Besides seeming sane (why wait for the client when we know what must
happen), this also makes the TxnCoordSender state more internally
consistent: if a non-retriable error contains an Aborted txn, we now
stop the hb loop before that loop has the opportunity to freakout about
running with an Aborted txn. It's unclear if non-retriable errors could
contain Aborted txns, but see below.

This patch also refactors the state update code in an attempt to make it
more readable.

In #34337 we see a crash due to the fact that a heartbeat is running for
a transaction whose proto status is no longer PENDING. It's not entirely
clear to me how that can happen since we "clean up the txn" - i.e. stop
the hb loop - after commits and roll backs as well on
TransactionAbortedErrors, but it's also not very convincing that it
can't happen. The thing is that the protocol between the "client" and
the "server" wrt communicating txn updates from the server is lax and
it's not very clear what kind of responses can carry an Aborted or
Committed proto in them.

This patch also makes leaf TxnCoordSender nimbler by not using
interceptors needed only by roots.

Release note: None

34828: roachtest: disable follower_reads test for versions prior to v2.2.0 r=ajwerner a=ajwerner

Fixed #34814

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
@craig
Copy link
Contributor

craig bot commented Feb 12, 2019

Build succeeded

@craig craig bot merged commit 1088016 into cockroachdb:master Feb 12, 2019
@andreimatei andreimatei deleted the small.assert-inconsistent-TCS branch February 13, 2019 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants