-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvcoord: avoid concurrent rollbacks when making parallel commits explicit #69966
kvcoord: avoid concurrent rollbacks when making parallel commits explicit #69966
Conversation
Opened #69965 to fix the invalid error reason. As outlined on the issue, I think we should consider combining all of these related status errors into a single status error that contains the actual txn status. Wdyt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice find! Do you intend to backport this to previous release branches as well?
As outlined on the issue, I think we should consider combining all of these related status errors into a single status error that contains the actual txn status. Wdyt?
Cleaning up the treatment of TransactionStatusError
would be lovely. It seems to be used in a handful of different situations, and only a few actually expect the error to be structured In many other places, it's hardly more than a transaction-related errors.AssertionFailedf
. The place to start might be to switch some of the uses over to errors.AssertionFailedf
and then get rid of REASON_UNKNOWN
in the remaining ones by introducing new reasons.
I'm less sure about merging TransactionAbortedError
into TransactionStatusError
. TransactionAbortedError
has a more well-defined meaning and is an important part of the transaction retry story. The Reason
field it carries is also important to clients. So I don't imagine it being particularly easy to switch over, especially given migration concerns. It's also not clear to me that the finished state would be cleaner than what we have today if it meant that clients would need to peek inside TransactionStatusError
s to detect aborts.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you intend to backport this to previous release branches as well?
Yes, to 21.2. It does not apply to 21.1 and earlier, since they explicitly allow concurrent EndTxn(commit=false)
requests -- the bug fix that serialized them was not backported.
Cleaning up the treatment of
TransactionStatusError
would be lovely. It seems to be used in a handful of different situations, and only a few actually expect the error to be structured In many other places, it's hardly more than a transaction-relatederrors.AssertionFailedf
. The place to start might be to switch some of the uses over toerrors.AssertionFailedf
and then get rid ofREASON_UNKNOWN
in the remaining ones by introducing new reasons.
Good idea, I'll add this to the issue and do a pass.
I'm less sure about merging
TransactionAbortedError
intoTransactionStatusError
.TransactionAbortedError
has a more well-defined meaning and is an important part of the transaction retry story. TheReason
field it carries is also important to clients. So I don't imagine it being particularly easy to switch over, especially given migration concerns. It's also not clear to me that the finished state would be cleaner than what we have today if it meant that clients would need to peek insideTransactionStatusErrors
to detect aborts.
Sure, might be more trouble than it's worth.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)
306cb79
to
75a94c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)
bors r=nvanbenschoten TFTR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read #68643 (comment), but do you mind explaining to me how the rollback is sent if the makeTxnCommitExplicitAsync()
point was reached? It seems to me that the ConnExecutor
should consider the txn to be committed, so I'm surprised that a rollback is sent. I understand that there's a ctx cancellation at play, I'd expect that if makeTxnCommitExplicitAsync()
is reached then cancellation should not play a role.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too familiar with connExecutor
and haven't looked closely at the failure modes there. However, it doesn't seem entirely implausible that a context cancellation while the response is in flight could cause a race where a rollback is sent. It's also the only explanation I could find for the log messages in the original issue, but it's absolutely possible that there are other failure modes that I haven't found yet. If you think it's worthwhile I can spend some time to dig through the connExecutor
code and see if I find an explanation.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)
bors r- |
Canceled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 440 at r2 (raw file):
<-readyC // From the TxnCoordSender's point of view, the txn is implicitly committed,
I'd remove the talk of the connExecutor
here since it's not clear that the connExecutor
behaves as this implies.
1aed891
to
f578804
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Canceled to add stack traces for the unexpected EndTxn(commit=false)
requests as well, at @andreimatei's request.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 440 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I'd remove the talk of the
connExecutor
here since it's not clear that theconnExecutor
behaves as this implies.
Done.
I see that Txn.Rollback()
can be called indirectly in a handful of other places as well, mainly during COPY
operations and range merges -- it's worth giving these a closer look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)
bors r=nvanbenschoten,andreimatei TFTRs! |
Build failed (retrying...): |
Build failed: |
…icit `TxnCoordSender` allows `EndTxn(commit=false)` rollback requests even if the transaction state is finalized, since clients can send multiple rollbacks (e.g. due to context cancellation). However, it allowed this even when the transaction was committed. This could pass the request through while the `txnCommitter` was asynchronously making an implicit commit explicit, which would violate the `txnLockGatekeeper` requirement that transaction requests are synchronous (non-concurrent) which would return an unexpected error for the rollback. This patch rejects additional `EndTxn(commit=false)` requests if the finalized transaction is known to be committed, to prevent this race condition. If rejected, the returned error is of the same type that would be returned by `EndTxn` evaluation, although with a different message string. Note that even though the returned error should really have `REASON_TXN_COMMITTED` in this case, which is also what `txn.Rollback()` expects in order to omit logging, the current `EndTxn` code incorrectly returns `REASON_TXN_UNKNOWN` in this case. This behavior is retained to minimize the change, but should be corrected separately. Release justification: fixes for high-priority or high-severity bugs in existing functionality Release note: None
f578804
to
40ab220
Compare
Failure seems unrelated, even though the same test failed for both bors runs ( bors r=nvanbenschoten,andreimatei |
Build failed (retrying...): |
Build succeeded: |
TxnCoordSender
allowsEndTxn(commit=false)
rollback requests even ifthe transaction state is finalized, since clients can send multiple
rollbacks (e.g. due to context cancellation). However, it allowed this
even when the transaction was committed. This could pass the request
through while the
txnCommitter
was asynchronously making an implicitcommit explicit, which would violate the
txnLockGatekeeper
requirementthat transaction requests are synchronous (non-concurrent) which would
return an unexpected error for the rollback.
This patch rejects additional
EndTxn(commit=false)
requests if thefinalized transaction is known to be committed, to prevent this race
condition. If rejected, the returned error is of the same type that
would be returned by
EndTxn
evaluation, although with a differentmessage string.
Note that even though the returned error should really have
REASON_TXN_COMMITTED
in this case, which is also whattxn.Rollback()
expects in order to omit logging, the current
EndTxn
code incorrectlyreturns
REASON_TXN_UNKNOWN
in this case. This behavior is retained tominimize the change, but should be corrected separately.
Resolves #68643.
Informs #69965.
Release justification: fixes for high-priority or high-severity bugs in existing functionality
Release note: None