Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: avoid concurrent rollbacks when making parallel commits explicit #69966

Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Sep 9, 2021

TxnCoordSender allows EndTxn(commit=false) rollback requests even if
the transaction state is finalized, since clients can send multiple
rollbacks (e.g. due to context cancellation). However, it allowed this
even when the transaction was committed. This could pass the request
through while the txnCommitter was asynchronously making an implicit
commit explicit, which would violate the txnLockGatekeeper requirement
that transaction requests are synchronous (non-concurrent) which would
return an unexpected error for the rollback.

This patch rejects additional EndTxn(commit=false) requests if the
finalized transaction is known to be committed, to prevent this race
condition. If rejected, the returned error is of the same type that
would be returned by EndTxn evaluation, although with a different
message string.

Note that even though the returned error should really have
REASON_TXN_COMMITTED in this case, which is also what txn.Rollback()
expects in order to omit logging, the current EndTxn code incorrectly
returns REASON_TXN_UNKNOWN in this case. This behavior is retained to
minimize the change, but should be corrected separately.

Resolves #68643.
Informs #69965.

Release justification: fixes for high-priority or high-severity bugs in existing functionality
Release note: None

@erikgrinaker erikgrinaker self-assigned this Sep 9, 2021
@erikgrinaker erikgrinaker requested a review from a team as a code owner September 9, 2021 16:18
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Sep 9, 2021

Opened #69965 to fix the invalid error reason. As outlined on the issue, I think we should consider combining all of these related status errors into a single status error that contains the actual txn status. Wdyt?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: nice find! Do you intend to backport this to previous release branches as well?

As outlined on the issue, I think we should consider combining all of these related status errors into a single status error that contains the actual txn status. Wdyt?

Cleaning up the treatment of TransactionStatusError would be lovely. It seems to be used in a handful of different situations, and only a few actually expect the error to be structured In many other places, it's hardly more than a transaction-related errors.AssertionFailedf. The place to start might be to switch some of the uses over to errors.AssertionFailedf and then get rid of REASON_UNKNOWN in the remaining ones by introducing new reasons.

I'm less sure about merging TransactionAbortedError into TransactionStatusError. TransactionAbortedError has a more well-defined meaning and is an important part of the transaction retry story. The Reason field it carries is also important to clients. So I don't imagine it being particularly easy to switch over, especially given migration concerns. It's also not clear to me that the finished state would be cleaner than what we have today if it meant that clients would need to peek inside TransactionStatusErrors to detect aborts.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you intend to backport this to previous release branches as well?

Yes, to 21.2. It does not apply to 21.1 and earlier, since they explicitly allow concurrent EndTxn(commit=false) requests -- the bug fix that serialized them was not backported.

Cleaning up the treatment of TransactionStatusError would be lovely. It seems to be used in a handful of different situations, and only a few actually expect the error to be structured In many other places, it's hardly more than a transaction-related errors.AssertionFailedf. The place to start might be to switch some of the uses over to errors.AssertionFailedf and then get rid of REASON_UNKNOWN in the remaining ones by introducing new reasons.

Good idea, I'll add this to the issue and do a pass.

I'm less sure about merging TransactionAbortedError into TransactionStatusError. TransactionAbortedError has a more well-defined meaning and is an important part of the transaction retry story. The Reason field it carries is also important to clients. So I don't imagine it being particularly easy to switch over, especially given migration concerns. It's also not clear to me that the finished state would be cleaner than what we have today if it meant that clients would need to peek inside TransactionStatusErrors to detect aborts.

Sure, might be more trouble than it's worth.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)

@erikgrinaker
Copy link
Contributor Author

bors r=nvanbenschoten

TFTR!

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read #68643 (comment), but do you mind explaining to me how the rollback is sent if the makeTxnCommitExplicitAsync() point was reached? It seems to me that the ConnExecutor should consider the txn to be committed, so I'm surprised that a rollback is sent. I understand that there's a ctx cancellation at play, I'd expect that if makeTxnCommitExplicitAsync() is reached then cancellation should not play a role.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with connExecutor and haven't looked closely at the failure modes there. However, it doesn't seem entirely implausible that a context cancellation while the response is in flight could cause a race where a rollback is sent. It's also the only explanation I could find for the log messages in the original issue, but it's absolutely possible that there are other failure modes that I haven't found yet. If you think it's worthwhile I can spend some time to dig through the connExecutor code and see if I find an explanation.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)

@erikgrinaker
Copy link
Contributor Author

bors r-

@craig
Copy link
Contributor

craig bot commented Sep 13, 2021

Canceled.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 440 at r2 (raw file):

	<-readyC

	// From the TxnCoordSender's point of view, the txn is implicitly committed,

I'd remove the talk of the connExecutor here since it's not clear that the connExecutor behaves as this implies.

@erikgrinaker erikgrinaker force-pushed the txncoordsender-explicit-commit-race branch 2 times, most recently from 1aed891 to f578804 Compare September 13, 2021 17:37
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canceled to add stack traces for the unexpected EndTxn(commit=false) requests as well, at @andreimatei's request.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 440 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'd remove the talk of the connExecutor here since it's not clear that the connExecutor behaves as this implies.

Done.

I see that Txn.Rollback() can be called indirectly in a handful of other places as well, mainly during COPY operations and range merges -- it's worth giving these a closer look.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)

@erikgrinaker
Copy link
Contributor Author

bors r=nvanbenschoten,andreimatei

TFTRs!

@craig
Copy link
Contributor

craig bot commented Sep 14, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Sep 14, 2021

Build failed:

…icit

`TxnCoordSender` allows `EndTxn(commit=false)` rollback requests even if
the transaction state is finalized, since clients can send multiple
rollbacks (e.g. due to context cancellation). However, it allowed this
even when the transaction was committed. This could pass the request
through while the `txnCommitter` was asynchronously making an implicit
commit explicit, which would violate the `txnLockGatekeeper` requirement
that transaction requests are synchronous (non-concurrent) which would
return an unexpected error for the rollback.

This patch rejects additional `EndTxn(commit=false)` requests if the
finalized transaction is known to be committed, to prevent this race
condition. If rejected, the returned error is of the same type that
would be returned by `EndTxn` evaluation, although with a different
message string.

Note that even though the returned error should really have
`REASON_TXN_COMMITTED` in this case, which is also what `txn.Rollback()`
expects in order to omit logging, the current `EndTxn` code incorrectly
returns `REASON_TXN_UNKNOWN` in this case. This behavior is retained to
minimize the change, but should be corrected separately.

Release justification: fixes for high-priority or high-severity bugs in existing functionality
Release note: None
@erikgrinaker erikgrinaker force-pushed the txncoordsender-explicit-commit-race branch from f578804 to 40ab220 Compare September 14, 2021 18:46
@erikgrinaker
Copy link
Contributor Author

Failure seems unrelated, even though the same test failed for both bors runs (acceptance/cli/node-status). It passes locally. It's failing on master as well, see #70169.

bors r=nvanbenschoten,andreimatei

@craig
Copy link
Contributor

craig bot commented Sep 14, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Sep 14, 2021

Build succeeded:

@craig craig bot merged commit b367897 into cockroachdb:master Sep 14, 2021
@erikgrinaker erikgrinaker deleted the txncoordsender-explicit-commit-race branch September 22, 2021 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: async rollback failed: concurrent txn use detected
4 participants