Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: automatically retry the first batch after a BEGIN #16719

Merged
merged 2 commits into from
Jul 5, 2017

Conversation

andreimatei
Copy link
Contributor

Before this patch, in case of retryable errors, the server (i.e. the
Executor) would automatically retry a batch of statements if the batch
was the prefix of a transaction (or if the batch contained the whole
txn). For example, if the following SELECT would get an error, it'd be
retried if all of the following statements arrived to the server in one
batch: BEGIN; ...; SELECT foo; [... COMMIT]. The rationale in retrying
these prefixes, but not otherwise, was that, in the case of a prefix
batch, we know that the client had no conditional logic based on reads
performed in the current txn.

This patch extends this reasoning to statements executed in the first
batch arriving after the batch with the BEGIN if the BEGIN had been
trailing a previous batch (more realistically, if the BEGIN is sent
alone as a batch). As a further optimization, the SAVEPOINT statement
doesn't change the retryable character of the next range). So, if you do
something like (different lines are different batches):
BEGIN
SELECT foo;

or

BEGIN; SAVEPOINT cockroach_restart;
SELECT foo

or

BEGIN
SAVEPOINT cockroach_restart
[...;] SELECT FOO

the SELECTs will be retried automatically.

Besides being generally a good idea to hide retryable errors more, this
change was motivated by ORMs getting retryable errors from a BEGIN;
CREATE TABLE ...; COMMIT; sequence (with the BEGIN being a separate
batch). This ORM code is not under our control and we can't teach it
about user-directed retries.

This is implemented by creating a new txnState.State - FirstBatch.
Auto-retry is enabled for batches executed in this state.

Fixes #16450
Fixes #16200
See also forum discussion about it:
https://forum.cockroachlabs.com/t/automatically-retrying-the-first-batch-of-statements-after-a-begin/759

cc @tristan-ohlson

@andreimatei andreimatei requested a review from knz June 26, 2017 19:36
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor

knz commented Jun 26, 2017

Looks good!

Also check the CLI shell and make sure that the FirstBatch status is reported as OPEN in the prompt.


Reviewed 11 of 11 files at r1.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


pkg/sql/session.go, line 744 at r1 (raw file):

	// batches sent by the client, but results being sent by the server to the
	// client (i.e. the client can send 100 batches but, if we haven't sent it any
	// results yet, we know that we can still retry them all).

Extend your comment here to explain that this is actually not as trivial as implementing the conditional, because we first need to research+determine whether the mere ACK of an operation (and/or e.g. the timing thereof) can cause clients to use different code paths in practice.


Comments from Reviewable

@bdarnell
Copy link
Contributor

LGTM


Reviewed 11 of 11 files at r1.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


pkg/internal/client/txn.go, line 768 at r1 (raw file):

		return
	}

Spurious change to this file.


pkg/sql/executor.go, line 2042 at r1 (raw file):

// SAVEPOINT has this property.
//
// TODO(andrei): what statements, besides SAVEPOINT, can we accept?

SET TRANSACTION ISOLATION, SET TRANSACTION PRIORITY, and SET TIME ZONE seem like the most useful ones to consider here.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Jun 28, 2017

If you go down the road of allowing SET, I think that any SET statement is fair game.

@knz
Copy link
Contributor

knz commented Jun 29, 2017

Ok what about merging this now, with a TODO to suggest perhaps the various forms of SET are also fair game.

(Actually not only SET: PREPARE, DISCARD, RESET, SHOW (not TRACE), EXPLAIN are probably all fine.)

@andreimatei andreimatei force-pushed the retry-batch-after-begin branch from 5090a5f to 9043105 Compare June 29, 2017 17:35
@andreimatei
Copy link
Contributor Author

added SET and a TODO for all the others.
But since we're talking - should we ever retry PREPARE automatically at all? (regardless of the change in this PR)
Won't it fail the second time? Does it have any transactional semantics?


Review status: 6 of 10 files reviewed at latest revision, 3 unresolved discussions.


pkg/sql/executor.go, line 2042 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

SET TRANSACTION ISOLATION, SET TRANSACTION PRIORITY, and SET TIME ZONE seem like the most useful ones to consider here.

added SET


pkg/sql/session.go, line 744 at r1 (raw file):

Previously, knz (kena) wrote…

Extend your comment here to explain that this is actually not as trivial as implementing the conditional, because we first need to research+determine whether the mere ACK of an operation (and/or e.g. the timing thereof) can cause clients to use different code paths in practice.

Mmmm I don't know... I wouldn't add that. For one, I doubt that we can actually do such research (or at least I don't know how).
But more importantly because I don't quite see a point about the ack or timing. The ACKs , to the extent that they mean anything, maintain their meaning even if the server does some retries; the client hasn't observed the end of the previous statement, so I don't see what the retries change.
But perhaps you want to argue something I'm not seeing.


pkg/internal/client/txn.go, line 768 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Spurious change to this file.

Done.


Comments from Reviewable

@andreimatei andreimatei force-pushed the retry-batch-after-begin branch 3 times, most recently from ebe1edd to bf84edf Compare June 30, 2017 20:25
@andreimatei
Copy link
Contributor Author

I've removed SET (other that SET TRANSACTION ...) from the list of statements that allow the session to stay in FirstBatch because of concerns of how their non-transactional character interferes with retries. We already have a problem with them in the batches that are automatically retried, regardless of this patch. I'll start a thread about that.

I've added a separate commit adding RETURNING NOTHING to that list of statements.

I've added a first commit that makes the txn state an atomic. This is required because the transition from FirstBatch->Open can happen concurrently with the execution of statements in the parallelizeQueue. This is a problem regardless of the fact that these statements don't force this transition because they can be part of a batch that does force the transition because of other statements.
An option I've discussed with @knz is blocking at the end of the batch, before making said transition, for the paralellizeQueue to finish. But that would defeat the purpose of the RETURNING NOTHING statements.


Review status: 0 of 10 files reviewed at latest revision, 3 unresolved discussions.


Comments from Reviewable

@andreimatei andreimatei force-pushed the retry-batch-after-begin branch from bf84edf to 62ade32 Compare July 1, 2017 19:46
@andreimatei
Copy link
Contributor Author

I've made some changes to the 2nd commit to support transitioning from RestartWait -> FirstBatch. The intention of the code was already to do this transition, but it wasn't working because, after RestartWait -> FirstBatch, we'd immediately transition to Open. This was because ROLLBACK TO SAVEPOINT was not considered a statement that can leave the txn in FirstBatch. Now it is, but other changes were also necessary: instead of looking at all the statements in a batch and seeing if any of them need to do the FB->Open transition, now we conditionally do the transition after each stmt has been executed. This is because statements in a batch before ROLLBACK TO SAVEPOINT should not disqualify the batch from staying in FB (e.g. SELECT; ROLLBACK TO SAVEPOINT should leave the txn in FB). And also probably doing the transition after every statement, in a place where we were already doing state transitions, was the right thing to do in the first place, instead of creating a new place where transitions happen.
PTAL.


Review status: 0 of 10 files reviewed at latest revision, 3 unresolved discussions.


Comments from Reviewable

@andreimatei andreimatei force-pushed the retry-batch-after-begin branch from 62ade32 to 5dd7451 Compare July 3, 2017 18:41
@bdarnell
Copy link
Contributor

bdarnell commented Jul 3, 2017

Review status: 0 of 10 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/sql/executor.go, line 2019 at r4 (raw file):

		isSetTransaction(stmt) ||
		// Parallelized statements don't return results to clients, by definition.
		IsStmtParallelized(stmt) ||

Let's leave this out until we're able to test it.


pkg/sql/logictest/testdata/logic_test/txn, line 590 at r4 (raw file):


statement ok
SELECT CRDB_INTERNAL.FORCE_RETRY('100ms':::INTERVAL)

We need to test that after the retry the transaction has the priority that we set at the start of the transaction, and the retry process didn't forget about the first batch of the transaction.


pkg/sql/logictest/testdata/logic_test/txn, line 633 at r4 (raw file):

#
# statement ok
# SELECT 1; SELECT CRDB_INTERNAL.FORCE_RETRY('100ms':::INTERVAL)

What exactly gets retried here? Just this batch of SELECTs, or the entire list of statements since BEGIN? We need to retry the INSERT RETURNING NOTHING too, and the test needs to verify this (commit the transaction and make sure that the change was applied).


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Jul 5, 2017

Ok for the first two commits but please remove the last commit before this can merge.


Reviewed 11 of 11 files at r2, 10 of 10 files at r3.
Review status: 8 of 10 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Jul 5, 2017

Also please rebase so I can have a last look with the rebased changes merged in.


Review status: 8 of 10 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


Comments from Reviewable

@andreimatei andreimatei force-pushed the retry-batch-after-begin branch from 5dd7451 to 8c72366 Compare July 5, 2017 16:32
@andreimatei
Copy link
Contributor Author

Rebased.

@bdarnell you were right that the last commit - about the RETURNING NOTHING statements - was no good, because these statements needed to be replayed after a restart and this was not done. I've taken it out for now, and I'll work on adding replays.


Review status: 0 of 10 files reviewed at latest revision, 5 unresolved discussions.


pkg/sql/executor.go, line 2019 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Let's leave this out until we're able to test it.

Done.


pkg/sql/logictest/testdata/logic_test/txn, line 590 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We need to test that after the retry the transaction has the priority that we set at the start of the transaction, and the retry process didn't forget about the first batch of the transaction.

added test that the priority and isolation level stay the same.
This is not generally about "not forgetting" the first batch - we forget it; there's no replay for it. But now, for that "first batch" we only support statements that change the transaction status in such a way they'll be persisted after a restart without needed a replay.
I've removed the RETURNING NOTHING statements and added a comment about this to canStayInFirstBatchState().


pkg/sql/logictest/testdata/logic_test/txn, line 633 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What exactly gets retried here? Just this batch of SELECTs, or the entire list of statements since BEGIN? We need to retry the INSERT RETURNING NOTHING too, and the test needs to verify this (commit the transaction and make sure that the change was applied).

Ack.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Jul 5, 2017

:lgtm:


Reviewed 2 of 10 files at r5, 10 of 10 files at r6.
Review status: all files reviewed at latest revision, 5 unresolved discussions.


Comments from Reviewable

This is in anticipation of the next commit in which we add a state
transition that can happen concurrently with statement execution in the
parallizeQueue (which execution accesses the txn state). That state
transition is inconsequential for statement execution, so this atomic
field serves only to appease the race detector.
@bdarnell
Copy link
Contributor

bdarnell commented Jul 5, 2017

:lgtm:


Reviewed 10 of 10 files at r6.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

Before this patch, in case of retryable errors, the server (i.e. the
Executor) would automatically retry a batch of statements if the batch
was the prefix of a transaction (or if the batch contained the whole
txn). For example, if the following SELECT would get an error, it'd be
retried if all of the following statements arrived to the server in one
batch: BEGIN; ...; SELECT foo; [... COMMIT]. The rationale in retrying
these prefixes, but not otherwise, was that, in the case of a prefix
batch, we know that the client had no conditional logic based on reads
performed in the current txn.

This patch extends this reasoning to statements executed in the first
batch arriving after the batch with the BEGIN if the BEGIN had been
trailing a previous batch (more realistically, if the BEGIN is sent
alone as a batch). As a further optimization, the SAVEPOINT statement
doesn't change the retryable character of the next range). So, if you do
something like (different lines are different batches):
BEGIN
SELECT foo;

or

BEGIN; SAVEPOINT cockroach_restart;
SELECT foo

or

BEGIN
SAVEPOINT cockroach_restart
[...;] SELECT FOO

the SELECTs will be retried automatically.

Besides being generally a good idea to hide retryable errors more, this
change was motivated by ORMs getting retryable errors from a BEGIN;
CREATE TABLE ...; COMMIT; sequence (with the BEGIN being a separate
batch). This ORM code is not under our control and we can't teach it
about user-directed retries.

This is implemented by creating a new txnState.State - FirstBatch.
Auto-retry is enabled for batches executed in this state.

Fixes cockroachdb#16450
Fixes cockroachdb#16200
See also forum discussion about it:
https://forum.cockroachlabs.com/t/automatically-retrying-the-first-batch-of-statements-after-a-begin/759
@andreimatei andreimatei force-pushed the retry-batch-after-begin branch from 8c72366 to 3fc8e0b Compare July 5, 2017 16:48
@andreimatei andreimatei merged commit 5222ddc into cockroachdb:master Jul 5, 2017
@andreimatei andreimatei deleted the retry-batch-after-begin branch July 5, 2017 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment