-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: cleanup txn more eagerly #34387
kv: cleanup txn more eagerly #34387
Conversation
d1af373
to
8c9e0d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/txn_coord_sender.go, line 678 at r1 (raw file):
} assertConsistentState := func() {
Is the goal to keep this forever? Adding in these assertions with no immediate plan to remove them doesn't seem great to me. They end up sitting around, making code harder to read, and preventing other cleanup.
pkg/kv/txn_coord_sender.go, line 687 at r1 (raw file):
// Move to the error state on non-retriable errors. if pErr != nil {
Why isn't this logic in updateStateLocked
?
pkg/kv/txn_interceptor_heartbeat.go, line 39 at r1 (raw file):
// eliding EndTransaction requests on read-only transactions. // // txnHeartbeat should only be used for root transactions; leafs don't perform
Why remove this?
pkg/kv/txn_interceptor_heartbeat.go, line 488 at r1 (raw file):
// We need to be prepared here to handle the case of a // TransactionAbortedError with no transaction proto in it.
Could you leave a TODO for me to make this the only case where we get back an ABORTED error?
8c9e0d6
to
da2ae00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/txn_coord_sender.go, line 678 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is the goal to keep this forever? Adding in these assertions with no immediate plan to remove them doesn't seem great to me. They end up sitting around, making code harder to read, and preventing other cleanup.
I've removed the assertion. Let's hope that the new cleanup makes the crash go away.
pkg/kv/txn_coord_sender.go, line 687 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why isn't this logic in
updateStateLocked
?
yeah... I put some makeup on the pig; see if you like it now.
pkg/kv/txn_interceptor_heartbeat.go, line 39 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why remove this?
cause it wasn't true. The leaf TCS still has all the interceptors.
pkg/kv/txn_interceptor_heartbeat.go, line 488 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you leave a TODO for me to make this the only case where we get back an ABORTED error?
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
friendly ping :)
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 5 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/kv/txn_coord_sender.go, line 873 at r2 (raw file):
// starting tracking the transaction - i.e. this is the case for DistSQL, which // just does reads and passes 0. func (tc *TxnCoordSender) updateStateLocked(
This logic is much cleaner!
pkg/kv/txn_coord_sender.go, line 900 at r2 (raw file):
if pErr == nil { if br.Txn != nil {
I was wondering about these checks a few days ago. What do you think about them? Now that we have the CrossRangeTxnWrapperSender
, should anything in the TxnCoordSender have to deal with non-transactional batches or responses? If we're worried about malformed responses, maybe we just perform a single validation in the txnLockGatekeeper
?
Or for now, just get rid of the check here, because Update
is already a no-op for nil txns.
pkg/kv/txn_coord_sender.go, line 911 at r2 (raw file):
// transaction is not supposed to be used any more after a retriable // error. // Separately, the error needs to make its way back to the root.
Re-wrap?
pkg/kv/txn_coord_sender.go, line 922 at r2 (raw file):
cp := pErr.GetTxn().Clone() tc.mu.txn.Update(&cp) tc.cleanupTxnLocked(ctx)
This is ok even on leafs? Hmm, I wonder if that could cause issues like counting metrics for the same transaction on a leaf node and a root node. All the more reason to have a restricted set of interceptors on leaf nodes.
pkg/kv/txn_interceptor_heartbeat.go, line 39 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
cause it wasn't true. The leaf TCS still has all the interceptors.
Let's fix that then! Maybe leave a TODO when we're constructing the interceptor stack that we can skip certain interceptors on leaf nodes. That's part of what makes the design appealing.
Are we currently relying on the leaf txns to not perform any writes? If one accidentally did, could we have two heartbeat loops?
da2ae00
to
07e46d7
Compare
Before this patch, when a txn got a non-retriable error, its heartbeat loop (if any) was left running until the client sent a rollback. This patch makes the txn cleanup more eager - we do it immediately on receiving the error. Besides seeming sane (why wait for the client when we know what must happen), this also makes the TxnCoordSender state more internally consistent: if a non-retriable error contains an Aborted txn, we now stop the hb loop before that loop has the opportunity to freakout about running with an Aborted txn. It's unclear if non-retriable errors could contain Aborted txns, but see below. This patch also refactors the state update code in an attempt to make it more readable. In cockroachdb#34337 we see a crash due to the fact that a heartbeat is running for a transaction whose proto status is no longer PENDING. It's not entirely clear to me how that can happen since we "clean up the txn" - i.e. stop the hb loop - after commits and roll backs as well on TransactionAbortedErrors, but it's also not very convincing that it can't happen. The thing is that the protocol between the "client" and the "server" wrt communicating txn updates from the server is lax and it's not very clear what kind of responses can carry an Aborted or Committed proto in them. This patch also makes leaf TxnCoordSender nimbler by not using interceptors needed only by roots. Release note: None
07e46d7
to
1088016
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/txn_coord_sender.go, line 900 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I was wondering about these checks a few days ago. What do you think about them? Now that we have the
CrossRangeTxnWrapperSender
, should anything in the TxnCoordSender have to deal with non-transactional batches or responses? If we're worried about malformed responses, maybe we just perform a single validation in thetxnLockGatekeeper
?Or for now, just get rid of the check here, because
Update
is already a no-op for nil txns.
done the latter
pkg/kv/txn_coord_sender.go, line 911 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Re-wrap?
done
pkg/kv/txn_coord_sender.go, line 922 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This is ok even on leafs? Hmm, I wonder if that could cause issues like counting metrics for the same transaction on a leaf node and a root node. All the more reason to have a restricted set of interceptors on leaf nodes.
good catch. I removed the metrics and other interceptor from leaves.
pkg/kv/txn_interceptor_heartbeat.go, line 39 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Let's fix that then! Maybe leave a TODO when we're constructing the interceptor stack that we can skip certain interceptors on leaf nodes. That's part of what makes the design appealing.
Are we currently relying on the leaf txns to not perform any writes? If one accidentally did, could we have two heartbeat loops?
fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r3.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/kv/txn_coord_sender.go, line 459 at r3 (raw file):
mu: &tcs.mu, } if typ == client.RootTxn {
👍 let's see if this breaks anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)
34387: kv: cleanup txn more eagerly r=andreimatei a=andreimatei Before this patch, when a txn got a non-retriable error, its heartbeat loop (if any) was left running until the client sent a rollback. This patch makes the txn cleanup more eager - we do it immediately on receiving the error. Besides seeming sane (why wait for the client when we know what must happen), this also makes the TxnCoordSender state more internally consistent: if a non-retriable error contains an Aborted txn, we now stop the hb loop before that loop has the opportunity to freakout about running with an Aborted txn. It's unclear if non-retriable errors could contain Aborted txns, but see below. This patch also refactors the state update code in an attempt to make it more readable. In #34337 we see a crash due to the fact that a heartbeat is running for a transaction whose proto status is no longer PENDING. It's not entirely clear to me how that can happen since we "clean up the txn" - i.e. stop the hb loop - after commits and roll backs as well on TransactionAbortedErrors, but it's also not very convincing that it can't happen. The thing is that the protocol between the "client" and the "server" wrt communicating txn updates from the server is lax and it's not very clear what kind of responses can carry an Aborted or Committed proto in them. This patch also makes leaf TxnCoordSender nimbler by not using interceptors needed only by roots. Release note: None 34828: roachtest: disable follower_reads test for versions prior to v2.2.0 r=ajwerner a=ajwerner Fixed #34814 Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
Build succeeded |
Before this patch, when a txn got a non-retriable error, its heartbeat
loop (if any) was left running until the client sent a rollback.
This patch makes the txn cleanup more eager - we do it immediately on
receiving the error.
Besides seeming sane (why wait for the client when we know what must
happen), this also makes the TxnCoordSender state more internally
consistent: if a non-retriable error contains an Aborted txn, we now
stop the hb loop before that loop has the opportunity to freakout about
running with an Aborted txn. It's unclear if non-retriable errors could
contain Aborted txns, but see below.
This patch also refactors the state update code in an attempt to make it
more readable.
In #34337 we see a crash due to the fact that a heartbeat is running for
a transaction whose proto status is no longer PENDING. It's not entirely
clear to me how that can happen since we "clean up the txn" - i.e. stop
the hb loop - after commits and roll backs as well on
TransactionAbortedErrors, but it's also not very convincing that it
can't happen. The thing is that the protocol between the "client" and
the "server" wrt communicating txn updates from the server is lax and
it's not very clear what kind of responses can carry an Aborted or
Committed proto in them.
This patch also makes leaf TxnCoordSender nimbler by not using
interceptors needed only by roots.
Release note: None