Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql,kv: bubble up retry errors when creating leaf transactions #98713

Merged
merged 1 commit into from
Mar 16, 2023

Conversation

arulajmani
Copy link
Collaborator

Previously, if we detected that the transaction was aborted when trying to construct leaf transaction state, we would handle the retry error instead of bubbling it up to the caller. When a transaction is aborted, the TransactionRetryWithProtoRefreshError carries with it a new transaction that should be used for subsequent attempts. Handling the retry error entailed swapping out the old TxnCoordSender with a new one -- one that is associated with this new transaction.

This is bug prone when trying to create multiple leaf transactions in parallel if the root has been aborted. We would expect the first leaf transaction to handle the error and all subsequent leaf transactions to point to the new transaction, as the TxnCoordSender has been swapped out. This wasn't an issue before as we never really created multiple leaf transactions in parallel. This recently change in 0f4b431, which started parallelizing FK and uniqueness checks. With this change, we could see FK or uniqueness violations when in fact the transaction needed to be retried.

This patch fixes the issue described above by not handling the retry error when creating leaf transactions. Instead, we expect the ConnExecutor to retry the entire transaction and prepare it for another iteration.

Fixes #97141

Epic: none

Release note: None

@arulajmani arulajmani requested review from a team as code owners March 15, 2023 20:34
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job tracking this down and a nice test!

I see 22.1 backport label, but I don't plan to backport parallelization of checks to 22.1, so unless you think this change is beneficial on its own to be backported there, we can skip that. Up to you.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


pkg/sql/distsql_running_test.go line 248 at r1 (raw file):

	createPlannerAndRunQuery := func(ctx context.Context, txn *kv.Txn, query string) error {
		execCfg := s.ExecutorConfig().(ExecutorConfig)
		// Plan the statemeent.

nit: s/statemeent/statement/g.


pkg/sql/distsql_running_test.go line 282 at r1 (raw file):

		evalCtx := p.ExtendedEvalContext()
		// We need distribute = true so that executing the plan involves marshaling

I'm confused by this - why do we need the query to be distributable?


pkg/sql/distsql_running_test.go line 284 at r1 (raw file):

		// We need distribute = true so that executing the plan involves marshaling
		// the root txn meta to leaf txns.
		planCtx := execCfg.DistSQLPlanner.NewPlanningCtx(ctx, evalCtx, p, nil,

nit: add inlined comment nil /* txn */, or maybe just pass txn that we have in scope.


pkg/sql/distsql_running_test.go line 285 at r1 (raw file):

		// the root txn meta to leaf txns.
		planCtx := execCfg.DistSQLPlanner.NewPlanningCtx(ctx, evalCtx, p, nil,
			DistributionTypeSystemTenantOnly)

Why not run this in multi-tenant setup?


pkg/sql/distsql_running_test.go line 347 at r1 (raw file):

		if iter == 1 {
			// On the first iteration, abort the txn by setting the abortTxn function.
			abortTxn = func(txnID uuid.UUID) {

Do we need to synchronize access to abortTxn? IIUC any query (including internal ones) can read this function from the testing knob. It might be the case that we're unlikely to run checks or cascades internally, it still seems like a good idea to add synchronization.


pkg/sql/distsql_running_test.go line 370 at r1 (raw file):

		err := createPlannerAndRunQuery(ctx, txn, query)
		if iter == 1 {
			// clear the abortTxn function before returning.

nit: it might be cleaner to defer this right after we set abortTxn to a non-nil function above.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, nice debugging and nice test!

@yuzefovich I think we should consider this as a backport target. Even if we didn't backport parallelization, weren't we still subject to bugs like this when using the leaf txn API? Or was it never possible to call either GetLeafTxnInputStateOrRejectClient or UpdateStateOnRemoteRetryableErr concurrently with another call to GetLeafTxnInputState{OrRejectClient} because of the separation DistSQL flow setup and execution?

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)


pkg/kv/txn.go line 996 at r1 (raw file):

	log.VEventf(ctx, 2, "retrying transaction: %s because of a retryable error: %s",
		txn.debugNameLocked(), retryErr)
	txn.handleRetryableErrLocked(ctx, retryErr)

Now that we only have one caller of handleRetryableErrLocked, can we inline it here?


pkg/kv/txn.go line 1256 at r1 (raw file):

responsible for handling the error before another attempt

To me, this seems to imply that bad things will happen if the caller doesn't do this. That's not quite true though. The caller simply won't be able to use the poisoned transaction until it has handled the error. Consider mentioning this. Something like "the caller is responsible for handling the error by calling PrepareForRetry. Use of the transaction before this will be rejected".


pkg/kv/txn.go line 1337 at r1 (raw file):

	pErr = txn.mu.sender.UpdateStateOnRemoteRetryableErr(ctx, pErr)

nit: stray line


pkg/sql/distsql_running_test.go line 200 at r1 (raw file):

}

// TestDistSQLRunningParallelFKChecksAfterAbort simulates a SQL transaction

This is a great integration test! Do you think we would also benefit from some unit tests around kv.Txn? For example, should we add a test that calls (*kv.Txn).UpdateStateOnRemoteRetryableErr with different forms of errors and ensures that the txn behaves correctly?

Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick reviews. Addressed all comments, RFAL.

Dismissed @yuzefovich from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @yuzefovich)


pkg/kv/txn.go line 996 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Now that we only have one caller of handleRetryableErrLocked, can we inline it here?

Good call, done.


pkg/kv/txn.go line 1256 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

responsible for handling the error before another attempt

To me, this seems to imply that bad things will happen if the caller doesn't do this. That's not quite true though. The caller simply won't be able to use the poisoned transaction until it has handled the error. Consider mentioning this. Something like "the caller is responsible for handling the error by calling PrepareForRetry. Use of the transaction before this will be rejected".

Makes sense, done.


pkg/sql/distsql_running_test.go line 200 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is a great integration test! Do you think we would also benefit from some unit tests around kv.Txn? For example, should we add a test that calls (*kv.Txn).UpdateStateOnRemoteRetryableErr with different forms of errors and ensures that the txn behaves correctly?

Good call, I added one. You're more familiar with this code than I am, so if there's other areas that deserve better testing, let me know -- happy to add them.

Separately, while working on this test, I realized we aren't correctly accounting for IntentMissingError in PrepareTransactionForRetry. For now that test case is commented out, but I plan to come back and fix this. I'm not sure why we haven't seen this in the wild, though.


pkg/sql/distsql_running_test.go line 282 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I'm confused by this - why do we need the query to be distributable?

I saw another test say this above, but I think that comment is rotten now. Seems like it doesn't really matter what we use here, so I'll change this to DistributionTypeNone. Let me know if you prefer something else.


pkg/sql/distsql_running_test.go line 285 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Why not run this in multi-tenant setup?

Superseded by the comment thread above.


pkg/sql/distsql_running_test.go line 347 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Do we need to synchronize access to abortTxn? IIUC any query (including internal ones) can read this function from the testing knob. It might be the case that we're unlikely to run checks or cascades internally, it still seems like a good idea to add synchronization.

You're right, added a mutex around it. Should've realized we need something when I changed the knob to take in a transaction ID. Thanks!


pkg/sql/distsql_running_test.go line 370 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: it might be cleaner to defer this right after we set abortTxn to a non-nil function above.

Done.

@arulajmani arulajmani force-pushed the investigate-97141 branch 2 times, most recently from 36ae3d6 to dabc4d8 Compare March 16, 2023 03:06
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'll defer to Nathan for approval.


Or was it never possible to call either GetLeafTxnInputStateOrRejectClient or UpdateStateOnRemoteRetryableErr concurrently with another call to GetLeafTxnInputState{OrRejectClient} because of the separation DistSQL flow setup and execution?

Yes, that was the case.

In 22.2 and before we only called UpdateStateOnRemoteRetryableErr in DistSQLReceiver.pushMeta which means that the retryable error must have occurred while the distributed query was running, after we have successfully performed the setup of the flow on the gateway.

In 22.2 and before we always waited for the SetupFlow RPCs to come back. In 23.1 (#89649) we made so that we start running the gateway flow before the RPCs come back, so we now call UpdateStateOnRemoteRetryableErr in DistSQLReceiver.setErrorWithoutStatusUpdate which is called in two places, and one of those is in regards to the RPC response. However, regardless whether it's possible for that setup RPC response error to be retryable, we have already performed the setup of gateway flow, successfully, so it wasn't rejected.

Reviewed 1 of 3 files at r2, 1 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


pkg/kv/txn_external_test.go line 707 at r4 (raw file):

}

// TestUpdateStateOnRemoteRetryErr ensures transaction state is updated and a

nit: s/TestUpdateStateOnRemoteRetryErr/TestUpdateStateOnRemoteRetryableErr/g.


pkg/sql/distsql_running_test.go line 282 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I saw another test say this above, but I think that comment is rotten now. Seems like it doesn't really matter what we use here, so I'll change this to DistributionTypeNone. Let me know if you prefer something else.

Yeah, SGTM. The comment is now stale, consider just dropping it altogether.


pkg/sql/distsql_running_test.go line 295 at r4 (raw file):

		evalCtxFactory := func(bool) *extendedEvalContext {
			factoryEvalCtx := extendedEvalContext{Tracing: evalCtx.Tracing}
			//factoryEvalCtx.Context = evalCtx.Context

nit: seems like a leftover?

Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @yuzefovich from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @yuzefovich)


pkg/sql/distsql_running_test.go line 282 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Yeah, SGTM. The comment is now stale, consider just dropping it altogether.

Done.

@arulajmani arulajmani force-pushed the investigate-97141 branch 2 times, most recently from 88a0691 to e4f90d6 Compare March 16, 2023 04:14
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 2 files at r4, 1 of 2 files at r5, 2 of 2 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)


pkg/kv/txn.go line 1343 at r6 (raw file):

// different from retryErr.Transaction - the latter might be a new transaction.
//
// TODO(arul): Now that we only expect this to happen on the PrepareForRetry

This method has such a specific and non-trivial task that it probably makes sense to keep it separate, if only to give a home for a nice big comment. But I'll defer to you.


pkg/sql/distsql_running_test.go line 200 at r1 (raw file):

I realized we aren't correctly accounting for IntentMissingError in PrepareTransactionForRetry. For now that test case is commented out, but I plan to come back and fix this. I'm not sure why we haven't seen this in the wild, though.

I think this is because IntentMissingError will be translated to a TransactionRetryError(RETRY_ASYNC_WRITE_FAILURE) in the txnPipeliner, so it will never reach this point. That's true for Root and Leaf txns, which both use a txnPipeliner interceptor.

The relationship between concrete error types that implement transactionRestartError and variants of TransactionRetryError could probably use a fresh pair of eyes after years of not changing. I can't say that I recall ever understanding the rationale behind the split, but I'm sure there are good things to learn from a bit of archeology in this area.

Previously, if we detected that the transaction was aborted when
trying to construct leaf transaction state, we would handle the retry
error instead of bubbling it up to the caller. When a transaction is
aborted, the `TransactionRetryWithProtoRefreshError` carries with it a
new transaction that should be used for subsequent attempts. Handling
the retry error entailed swapping out the old `TxnCoordSender` with a
new one -- one that is associated with this new transaction.

This is bug prone when trying to create multiple leaf transactions in
parallel if the root has been aborted. We would expect the first leaf
transaction to handle the error and all subsequent leaf transactions to
point to the new transaction, as the `TxnCoordSender` has been swapped
out. This wasn't an issue before as we never really created multiple
leaf transactions in parallel. This recently change in
0f4b431, which started parallelizing FK
and uniqueness checks. With this change, we could see FK or uniqueness
violations when in fact the transaction needed to be retried.

This patch fixes the issue described above by not handling the retry
error when creating leaf transactions. Instead, we expect the
ConnExecutor to retry the entire transaction and prepare it for another
iteration.

Fixes cockroachdb#97141

Epic: none

Release note: None
Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @yuzefovich)


pkg/kv/txn.go line 1343 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This method has such a specific and non-trivial task that it probably makes sense to keep it separate, if only to give a home for a nice big comment. But I'll defer to you.

I want to take a stab at trying to clean this up a bit, but given this is a release blocker, I'll do it in a followup patch. For now, I'll keep this TODO around.

Specifically, this part of this function seems to be a vestige from the past:

cockroach/pkg/kv/txn.go

Lines 1366 to 1371 in 81a9f1d

if !retryErr.PrevTxnAborted() {
// We don't need a new transaction as a result of this error, but we may
// have a retryable error that should be cleared.
txn.mu.sender.ClearTxnRetryableErr(ctx)
return
}

It also feels out of place given what this function is called. If we pull everything above this function out into PrepareForRetry, maybe you're more amenable to in-lining the actual handling of the TransactionAbortedError? If not, we can keep this function around and add some scary commentary.


pkg/sql/distsql_running_test.go line 200 at r1 (raw file):
Yeah, I cam about this list by looking at the error types that implement the transactionRestartError. I'll remove the commented out test case I had for IntentMissingErrors. I've instead added a case (that fatals, but with a specific message) and a comment in PrepareTransactionForRetry.

I can't say that I recall ever understanding the rationale behind the split, but I'm sure there are good things to learn from a bit of archeology in this area.

+1. Good thing tomorrow is a Friday and we're merging the last release blocker ;)

@craig
Copy link
Contributor

craig bot commented Mar 16, 2023

Build succeeded:

@craig craig bot merged commit 898a32a into cockroachdb:master Mar 16, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 16, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 0461b00 to blathers/backport-release-22.1-98713: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


error creating merge commit from 0461b00 to blathers/backport-release-22.2-98713: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@yuzefovich
Copy link
Member

Should we backport this to 22.2? I don't think we've seen fallout from this.

@arulajmani
Copy link
Collaborator Author

@yuzefovich the test in this PR relies on #96123. It's also why the backport isn't clean. IIRC, you were planning on backporting that PR, right?

Would it make sense to create a single backport PR with commits from that patch and this one?

@yuzefovich
Copy link
Member

Sounds good to me, included this commit into #100520 (currently a draft, will ask for your review once I get a green CI).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: tpcc/headroom failed during release qualification
4 participants