sql,kv: bubble up retry errors when creating leaf transactions #98713

arulajmani · 2023-03-15T20:34:08Z

Previously, if we detected that the transaction was aborted when trying to construct leaf transaction state, we would handle the retry error instead of bubbling it up to the caller. When a transaction is aborted, the TransactionRetryWithProtoRefreshError carries with it a new transaction that should be used for subsequent attempts. Handling the retry error entailed swapping out the old TxnCoordSender with a new one -- one that is associated with this new transaction.

This is bug prone when trying to create multiple leaf transactions in parallel if the root has been aborted. We would expect the first leaf transaction to handle the error and all subsequent leaf transactions to point to the new transaction, as the TxnCoordSender has been swapped out. This wasn't an issue before as we never really created multiple leaf transactions in parallel. This recently change in 0f4b431, which started parallelizing FK and uniqueness checks. With this change, we could see FK or uniqueness violations when in fact the transaction needed to be retried.

This patch fixes the issue described above by not handling the retry error when creating leaf transactions. Instead, we expect the ConnExecutor to retry the entire transaction and prepare it for another iteration.

Fixes #97141

Epic: none

Release note: None

cockroach-teamcity · 2023-03-15T20:34:19Z

This change is

yuzefovich

Nice job tracking this down and a nice test!

I see 22.1 backport label, but I don't plan to backport parallelization of checks to 22.1, so unless you think this change is beneficial on its own to be backported there, we can skip that. Up to you.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

pkg/sql/distsql_running_test.go line 248 at r1 (raw file):

	createPlannerAndRunQuery := func(ctx context.Context, txn *kv.Txn, query string) error {
		execCfg := s.ExecutorConfig().(ExecutorConfig)
		// Plan the statemeent.

nit: s/statemeent/statement/g.

pkg/sql/distsql_running_test.go line 282 at r1 (raw file):

		evalCtx := p.ExtendedEvalContext()
		// We need distribute = true so that executing the plan involves marshaling

I'm confused by this - why do we need the query to be distributable?

pkg/sql/distsql_running_test.go line 284 at r1 (raw file):

		// We need distribute = true so that executing the plan involves marshaling
		// the root txn meta to leaf txns.
		planCtx := execCfg.DistSQLPlanner.NewPlanningCtx(ctx, evalCtx, p, nil,

nit: add inlined comment nil /* txn */, or maybe just pass txn that we have in scope.

pkg/sql/distsql_running_test.go line 285 at r1 (raw file):

		// the root txn meta to leaf txns.
		planCtx := execCfg.DistSQLPlanner.NewPlanningCtx(ctx, evalCtx, p, nil,
			DistributionTypeSystemTenantOnly)

Why not run this in multi-tenant setup?

pkg/sql/distsql_running_test.go line 347 at r1 (raw file):

		if iter == 1 {
			// On the first iteration, abort the txn by setting the abortTxn function.
			abortTxn = func(txnID uuid.UUID) {

Do we need to synchronize access to abortTxn? IIUC any query (including internal ones) can read this function from the testing knob. It might be the case that we're unlikely to run checks or cascades internally, it still seems like a good idea to add synchronization.

pkg/sql/distsql_running_test.go line 370 at r1 (raw file):

		err := createPlannerAndRunQuery(ctx, txn, query)
		if iter == 1 {
			// clear the abortTxn function before returning.

nit: it might be cleaner to defer this right after we set abortTxn to a non-nil function above.

nvanbenschoten

Agreed, nice debugging and nice test!

@yuzefovich I think we should consider this as a backport target. Even if we didn't backport parallelization, weren't we still subject to bugs like this when using the leaf txn API? Or was it never possible to call either GetLeafTxnInputStateOrRejectClient or UpdateStateOnRemoteRetryableErr concurrently with another call to GetLeafTxnInputState{OrRejectClient} because of the separation DistSQL flow setup and execution?

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)

pkg/kv/txn.go line 996 at r1 (raw file):

	log.VEventf(ctx, 2, "retrying transaction: %s because of a retryable error: %s",
		txn.debugNameLocked(), retryErr)
	txn.handleRetryableErrLocked(ctx, retryErr)

Now that we only have one caller of handleRetryableErrLocked, can we inline it here?

pkg/kv/txn.go line 1256 at r1 (raw file):

responsible for handling the error before another attempt

To me, this seems to imply that bad things will happen if the caller doesn't do this. That's not quite true though. The caller simply won't be able to use the poisoned transaction until it has handled the error. Consider mentioning this. Something like "the caller is responsible for handling the error by calling PrepareForRetry. Use of the transaction before this will be rejected".

pkg/kv/txn.go line 1337 at r1 (raw file):

	pErr = txn.mu.sender.UpdateStateOnRemoteRetryableErr(ctx, pErr)

nit: stray line

pkg/sql/distsql_running_test.go line 200 at r1 (raw file):

}

// TestDistSQLRunningParallelFKChecksAfterAbort simulates a SQL transaction

This is a great integration test! Do you think we would also benefit from some unit tests around kv.Txn? For example, should we add a test that calls (*kv.Txn).UpdateStateOnRemoteRetryableErr with different forms of errors and ensures that the txn behaves correctly?

arulajmani

Thanks for the quick reviews. Addressed all comments, RFAL.

Dismissed @yuzefovich from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @yuzefovich)

pkg/kv/txn.go line 996 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Now that we only have one caller of handleRetryableErrLocked, can we inline it here?

Good call, done.

pkg/kv/txn.go line 1256 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

responsible for handling the error before another attempt

To me, this seems to imply that bad things will happen if the caller doesn't do this. That's not quite true though. The caller simply won't be able to use the poisoned transaction until it has handled the error. Consider mentioning this. Something like "the caller is responsible for handling the error by calling PrepareForRetry. Use of the transaction before this will be rejected".

Makes sense, done.

pkg/sql/distsql_running_test.go line 200 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is a great integration test! Do you think we would also benefit from some unit tests around kv.Txn? For example, should we add a test that calls (*kv.Txn).UpdateStateOnRemoteRetryableErr with different forms of errors and ensures that the txn behaves correctly?

Good call, I added one. You're more familiar with this code than I am, so if there's other areas that deserve better testing, let me know -- happy to add them.

Separately, while working on this test, I realized we aren't correctly accounting for IntentMissingError in PrepareTransactionForRetry. For now that test case is commented out, but I plan to come back and fix this. I'm not sure why we haven't seen this in the wild, though.

pkg/sql/distsql_running_test.go line 282 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I'm confused by this - why do we need the query to be distributable?

I saw another test say this above, but I think that comment is rotten now. Seems like it doesn't really matter what we use here, so I'll change this to DistributionTypeNone. Let me know if you prefer something else.

pkg/sql/distsql_running_test.go line 285 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Why not run this in multi-tenant setup?

Superseded by the comment thread above.

pkg/sql/distsql_running_test.go line 347 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Do we need to synchronize access to abortTxn? IIUC any query (including internal ones) can read this function from the testing knob. It might be the case that we're unlikely to run checks or cascades internally, it still seems like a good idea to add synchronization.

You're right, added a mutex around it. Should've realized we need something when I changed the knob to take in a transaction ID. Thanks!

pkg/sql/distsql_running_test.go line 370 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: it might be cleaner to defer this right after we set abortTxn to a non-nil function above.

Done.

yuzefovich

LGTM, but I'll defer to Nathan for approval.

Or was it never possible to call either GetLeafTxnInputStateOrRejectClient or UpdateStateOnRemoteRetryableErr concurrently with another call to GetLeafTxnInputState{OrRejectClient} because of the separation DistSQL flow setup and execution?

Yes, that was the case.

In 22.2 and before we only called UpdateStateOnRemoteRetryableErr in DistSQLReceiver.pushMeta which means that the retryable error must have occurred while the distributed query was running, after we have successfully performed the setup of the flow on the gateway.

In 22.2 and before we always waited for the SetupFlow RPCs to come back. In 23.1 (#89649) we made so that we start running the gateway flow before the RPCs come back, so we now call UpdateStateOnRemoteRetryableErr in DistSQLReceiver.setErrorWithoutStatusUpdate which is called in two places, and one of those is in regards to the RPC response. However, regardless whether it's possible for that setup RPC response error to be retryable, we have already performed the setup of gateway flow, successfully, so it wasn't rejected.

Reviewed 1 of 3 files at r2, 1 of 2 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

pkg/kv/txn_external_test.go line 707 at r4 (raw file):

}

// TestUpdateStateOnRemoteRetryErr ensures transaction state is updated and a

nit: s/TestUpdateStateOnRemoteRetryErr/TestUpdateStateOnRemoteRetryableErr/g.

pkg/sql/distsql_running_test.go line 282 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I saw another test say this above, but I think that comment is rotten now. Seems like it doesn't really matter what we use here, so I'll change this to DistributionTypeNone. Let me know if you prefer something else.

Yeah, SGTM. The comment is now stale, consider just dropping it altogether.

pkg/sql/distsql_running_test.go line 295 at r4 (raw file):

		evalCtxFactory := func(bool) *extendedEvalContext {
			factoryEvalCtx := extendedEvalContext{Tracing: evalCtx.Tracing}
			//factoryEvalCtx.Context = evalCtx.Context

nit: seems like a leftover?

arulajmani

Dismissed @yuzefovich from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @yuzefovich)

pkg/sql/distsql_running_test.go line 282 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Yeah, SGTM. The comment is now stale, consider just dropping it altogether.

Done.

nvanbenschoten

Reviewed 1 of 2 files at r4, 1 of 2 files at r5, 2 of 2 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)

pkg/kv/txn.go line 1343 at r6 (raw file):

// different from retryErr.Transaction - the latter might be a new transaction.
//
// TODO(arul): Now that we only expect this to happen on the PrepareForRetry

This method has such a specific and non-trivial task that it probably makes sense to keep it separate, if only to give a home for a nice big comment. But I'll defer to you.

pkg/sql/distsql_running_test.go line 200 at r1 (raw file):

I realized we aren't correctly accounting for IntentMissingError in PrepareTransactionForRetry. For now that test case is commented out, but I plan to come back and fix this. I'm not sure why we haven't seen this in the wild, though.

I think this is because IntentMissingError will be translated to a TransactionRetryError(RETRY_ASYNC_WRITE_FAILURE) in the txnPipeliner, so it will never reach this point. That's true for Root and Leaf txns, which both use a txnPipeliner interceptor.

The relationship between concrete error types that implement transactionRestartError and variants of TransactionRetryError could probably use a fresh pair of eyes after years of not changing. I can't say that I recall ever understanding the rationale behind the split, but I'm sure there are good things to learn from a bit of archeology in this area.

Previously, if we detected that the transaction was aborted when trying to construct leaf transaction state, we would handle the retry error instead of bubbling it up to the caller. When a transaction is aborted, the `TransactionRetryWithProtoRefreshError` carries with it a new transaction that should be used for subsequent attempts. Handling the retry error entailed swapping out the old `TxnCoordSender` with a new one -- one that is associated with this new transaction. This is bug prone when trying to create multiple leaf transactions in parallel if the root has been aborted. We would expect the first leaf transaction to handle the error and all subsequent leaf transactions to point to the new transaction, as the `TxnCoordSender` has been swapped out. This wasn't an issue before as we never really created multiple leaf transactions in parallel. This recently change in 0f4b431, which started parallelizing FK and uniqueness checks. With this change, we could see FK or uniqueness violations when in fact the transaction needed to be retried. This patch fixes the issue described above by not handling the retry error when creating leaf transactions. Instead, we expect the ConnExecutor to retry the entire transaction and prepare it for another iteration. Fixes cockroachdb#97141 Epic: none Release note: None

arulajmani

TFTRs!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @yuzefovich)

pkg/kv/txn.go line 1343 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This method has such a specific and non-trivial task that it probably makes sense to keep it separate, if only to give a home for a nice big comment. But I'll defer to you.

I want to take a stab at trying to clean this up a bit, but given this is a release blocker, I'll do it in a followup patch. For now, I'll keep this TODO around.

Specifically, this part of this function seems to be a vestige from the past:

cockroach/pkg/kv/txn.go

Lines 1366 to 1371 in 81a9f1d

    
           if !retryErr.PrevTxnAborted() { 
        
           	// We don't need a new transaction as a result of this error, but we may 
        
           	// have a retryable error that should be cleared. 
        
           	txn.mu.sender.ClearTxnRetryableErr(ctx) 
        
           	return 
        
           }

It also feels out of place given what this function is called. If we pull everything above this function out into PrepareForRetry, maybe you're more amenable to in-lining the actual handling of the TransactionAbortedError? If not, we can keep this function around and add some scary commentary.

pkg/sql/distsql_running_test.go line 200 at r1 (raw file):
Yeah, I cam about this list by looking at the error types that implement the transactionRestartError. I'll remove the commented out test case I had for IntentMissingErrors. I've instead added a case (that fatals, but with a specific message) and a comment in PrepareTransactionForRetry.

I can't say that I recall ever understanding the rationale behind the split, but I'm sure there are good things to learn from a bit of archeology in this area.

+1. Good thing tomorrow is a Friday and we're merging the last release blocker ;)

craig · 2023-03-16T17:11:23Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-03-16T17:11:31Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 0461b00 to blathers/backport-release-22.1-98713: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

error creating merge commit from 0461b00 to blathers/backport-release-22.2-98713: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

yuzefovich · 2023-04-03T16:30:19Z

Should we backport this to 22.2? I don't think we've seen fallout from this.

arulajmani · 2023-04-03T20:03:36Z

@yuzefovich the test in this PR relies on #96123. It's also why the backport isn't clean. IIRC, you were planning on backporting that PR, right?

Would it make sense to create a single backport PR with commits from that patch and this one?

yuzefovich · 2023-04-03T21:06:01Z

Sounds good to me, included this commit into #100520 (currently a draft, will ask for your review once I get a green CI).

arulajmani requested review from yuzefovich and nvanbenschoten March 15, 2023 20:34

arulajmani requested review from a team as code owners March 15, 2023 20:34

nvanbenschoten added backport-22.1.x labels Mar 15, 2023

yuzefovich reviewed Mar 15, 2023

View reviewed changes

nvanbenschoten reviewed Mar 15, 2023

View reviewed changes

arulajmani force-pushed the investigate-97141 branch from 582b601 to fadfe40 Compare March 16, 2023 02:46

arulajmani commented Mar 16, 2023

View reviewed changes

arulajmani requested review from nvanbenschoten and yuzefovich March 16, 2023 02:50

arulajmani force-pushed the investigate-97141 branch 2 times, most recently from 36ae3d6 to dabc4d8 Compare March 16, 2023 03:06

yuzefovich reviewed Mar 16, 2023

View reviewed changes

arulajmani force-pushed the investigate-97141 branch from dabc4d8 to 702016e Compare March 16, 2023 03:18

arulajmani commented Mar 16, 2023

View reviewed changes

arulajmani force-pushed the investigate-97141 branch 2 times, most recently from 88a0691 to e4f90d6 Compare March 16, 2023 04:14

nvanbenschoten approved these changes Mar 16, 2023

View reviewed changes

arulajmani force-pushed the investigate-97141 branch from e4f90d6 to 0461b00 Compare March 16, 2023 16:10

arulajmani commented Mar 16, 2023

View reviewed changes

craig bot merged commit 898a32a into cockroachdb:master Mar 16, 2023

This was referenced Mar 16, 2023

roachtest: tpcc/mixed-headroom/multiple-upgrades/n5cpu16 failed #97870

Closed

roachtest: weekly/tpcc/headroom failed #97676

Closed

roachtest: tpcc/headroom/n4cpu16 failed #97395

Closed

renatolabs mentioned this pull request Mar 16, 2023

roachtest: cdc/tpcc-1000 failed #97169

Closed

This was referenced Apr 3, 2023

release-22.2: sql: parallelize FK and UNIQUE constraints #97091

Closed

release-22.2: sql: parallelize FK and UNIQUE constraints #100520

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql,kv: bubble up retry errors when creating leaf transactions #98713

sql,kv: bubble up retry errors when creating leaf transactions #98713

arulajmani commented Mar 15, 2023

cockroach-teamcity commented Mar 15, 2023

yuzefovich left a comment

nvanbenschoten left a comment

arulajmani left a comment

yuzefovich left a comment

arulajmani left a comment

nvanbenschoten left a comment

arulajmani left a comment

craig bot commented Mar 16, 2023

blathers-crl bot commented Mar 16, 2023

yuzefovich commented Apr 3, 2023

arulajmani commented Apr 3, 2023

yuzefovich commented Apr 3, 2023

	if !retryErr.PrevTxnAborted() {
	// We don't need a new transaction as a result of this error, but we may
	// have a retryable error that should be cleared.
	txn.mu.sender.ClearTxnRetryableErr(ctx)
	return
	}

sql,kv: bubble up retry errors when creating leaf transactions #98713

sql,kv: bubble up retry errors when creating leaf transactions #98713

Conversation

arulajmani commented Mar 15, 2023

cockroach-teamcity commented Mar 15, 2023

yuzefovich left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

arulajmani left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

arulajmani left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

arulajmani left a comment

Choose a reason for hiding this comment

craig bot commented Mar 16, 2023

blathers-crl bot commented Mar 16, 2023

yuzefovich commented Apr 3, 2023

arulajmani commented Apr 3, 2023

yuzefovich commented Apr 3, 2023