Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: nil pointer panic in client.Txn.CleanupOnError #7881

Closed
bdarnell opened this issue Jul 18, 2016 · 21 comments
Closed

stability: nil pointer panic in client.Txn.CleanupOnError #7881

bdarnell opened this issue Jul 18, 2016 · 21 comments
Assignees
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@bdarnell
Copy link
Contributor

bdarnell commented Jul 18, 2016

On the register cluster running beta-20160629, two nodes have failed (ten minutes apart) with the following panic:

E160717 06:28:28.001279 internal/client/txn.go:364  failure aborting transaction: does not exist; abort caused by: does not exist
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x140 pc=0xb685a0]

goroutine 3094076 [running]:
panic(0x18ba060, 0xc82000e100)
        /usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/cockroachdb/cockroach/internal/client.(*Txn).sendEndTxnReq(0x0, 0xc830067400, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:440 +0x50
github.com/cockroachdb/cockroach/internal/client.(*Txn).Rollback(0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:433 +0x3b
github.com/cockroachdb/cockroach/internal/client.(*Txn).CleanupOnError(0x0, 0x7f09906bda70, 0xc835e66ba0)
        /go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:363 +0x92
github.com/cockroachdb/cockroach/sql.(*Executor).execRequest(0xc8201a3860, 0x7f099136cac8, 0xc823552780, 0xc82351d000, 0xc820a519cb, 0x27, 0x0, 0x0, 0x0, 0xc820a51900)
        /go/src/github.com/cockroachdb/cockroach/sql/executor.go:502 +0xd07
github.com/cockroachdb/cockroach/sql.(*Executor).ExecuteStatements(0xc8201a3860, 0x7f099136cac8, 0xc823552780, 0xc82351d000, 0xc820a519cb, 0x27, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/sql/executor.go:361 +0xf6
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).executeStatements(0xc826b4c000, 0x7f099136cac8, 0xc823552780, 0xc820a519cb, 0x27, 0x0, 0x0, 0x0, 0x0, 0x7f09913b7201, ...)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:640 +0x98
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).handleSimpleQuery(0xc826b4c000, 0x7f099136cac8, 0xc823552780, 0xc826b4c028, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:320 +0xe8
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).serve(0xc826b4c000, 0xc820d190e0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:275 +0x100c
github.com/cockroachdb/cockroach/sql/pgwire.(*Server).ServeConn(0xc820389a70, 0x7f09913711b0, 0xc8202de000, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/server.go:229 +0x98f
github.com/cockroachdb/cockroach/server.(*Server).Start.func8.1(0x7f09913bc9b0, 0xc820584210)
        /go/src/github.com/cockroachdb/cockroach/server/server.go:370 +0x42
github.com/cockroachdb/cockroach/util/netutil.(*Server).ServeWith.func1(0xc8200d02b0, 0x7f09913bc9b0, 0xc820584210, 0xc82034e010)
        /go/src/github.com/cockroachdb/cockroach/util/netutil/net.go:131 +0x62
created by github.com/cockroachdb/cockroach/util/netutil.(*Server).ServeWith
        /go/src/github.com/cockroachdb/cockroach/util/netutil/net.go:133 +0x333
@bdarnell bdarnell added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Jul 18, 2016
@petermattis
Copy link
Collaborator

Looks like we're calling Txn.CleanupOnError on a nil pointer. I'm still trying to trace through how this could happen. At the bottom we must have called commitSQLTransaction in order for txnState.txn to be set to nil. I'm not sure how to reconcile that with the failure aborting transaction error which seemed to occur on the line after where the panic occurred. Looks like the panic line numbers are messed up. Not sure what that is about.

@petermattis
Copy link
Collaborator

@andreimatei This is deep in code you've touched.

@petermattis
Copy link
Collaborator

@dt, @paperstreet can one of you take a closer look at this while @andreimatei is out of the office?

@danhhz danhhz self-assigned this Jul 18, 2016
@danhhz
Copy link
Contributor

danhhz commented Jul 18, 2016

Yeah, I can take a look tomorrow morning if no gets to it before me.

@danhhz
Copy link
Contributor

danhhz commented Jul 19, 2016

I'm still familiarizing myself with this code, but maybe something like the following could have caused that panic:

  • the for loop in execStmtsInCurrentTxn hits an Aborted or RestartWait
  • in execStmtInAbortedTxn the next stmt read is a CommitTransaction or RollbackTransaction so it calls rollbackSQLTransaction (and prints the error Pete mentioned), then sets txnState.txn to nil
  • then in (*txn).Exec it tries to autocommit but fails with a RetryableTxnError, which gets turned into an AutoCommitError
  • which causes execRequest to CleanupOnError

@bdarnell
Copy link
Contributor Author

This happened again on gamma, running beta-20160714.

W160718 20:45:07.289427 sql/lease.go:713  error releasing lease "51(\"blocks\") ver=1:1468509533172667968": does not exist
E160718 20:45:07.289708 internal/client/txn.go:371  failure aborting transaction: does not exist; abort caused by: does not exist
E160718 20:45:07.290614 internal/client/txn.go:371  failure aborting transaction: does not exist; abort caused by: does not exist
W160718 20:45:07.290666 sql/lease.go:713  error releasing lease "51(\"blocks\") ver=1:1468558242089988743": does not exist
E160718 20:45:07.291070 internal/client/txn.go:371  failure aborting transaction: does not exist; abort caused by: does not exist
W160718 20:45:07.291108 sql/lease.go:713  error releasing lease "55(\"comments\") ver=1:1468511843555426737": does not exist
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x140 pc=0xb7b680]

goroutine 7150662383 [running]:
panic(0x18e5680, 0xc820012080)
        /usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/cockroachdb/cockroach/internal/client.(*Txn).sendEndTxnReq(0x0, 0xc8c443e400, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:447 +0x50
github.com/cockroachdb/cockroach/internal/client.(*Txn).Rollback(0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:440 +0x3b
github.com/cockroachdb/cockroach/internal/client.(*Txn).CleanupOnError(0x0, 0x7f0340e3ff20, 0xc88d64f9f0)
        /go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:370 +0x92
github.com/cockroachdb/cockroach/sql.(*Executor).execRequest(0xc8201e5b30, 0x7f08915fb1d0, 0xc8c2ec9840, 0xc8545d6000, 0xc85901c2b0, 0x56, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/sql/executor.go:505 +0xd07
github.com/cockroachdb/cockroach/sql.(*Executor).ExecuteStatements(0xc8201e5b30, 0x7f08915fb1d0, 0xc8c2ec9840, 0xc8545d6000, 0xc85901c2b0, 0x56, 0xc84c8a7688, 0x0, 0x0, 0x0, .
..)
        /go/src/github.com/cockroachdb/cockroach/sql/executor.go:364 +0xf6
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).executeStatements(0xc89a03c300, 0x7f08915fb1d0, 0xc8c2ec9840, 0xc85901c2b0, 0x56, 0xc84c8a7688, 0x2a32498, 0x0, 0x0, 0x0,
 ...)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:640 +0x98
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).handleExecute(0xc89a03c300, 0x7f08915fb1d0, 0xc8c2ec9840, 0xc89a03c328, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:628 +0x429
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).serve(0xc89a03c300, 0xc87c16e1c0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:298 +0xf03
github.com/cockroachdb/cockroach/sql/pgwire.(*Server).ServeConn(0xc8203e5980, 0x7f08916902f8, 0xc86986e900, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/server.go:229 +0x98f
github.com/cockroachdb/cockroach/server.(*Server).Start.func8.1(0x7f08915be9b0, 0xc8c0162000)
        /go/src/github.com/cockroachdb/cockroach/server/server.go:370 +0x42
github.com/cockroachdb/cockroach/util/netutil.(*Server).ServeWith.func1(0xc820158178, 0x7f08915be9b0, 0xc8c0162000, 0xc8200120b0)
        /go/src/github.com/cockroachdb/cockroach/util/netutil/net.go:131 +0x62
created by github.com/cockroachdb/cockroach/util/netutil.(*Server).ServeWith
        /go/src/github.com/cockroachdb/cockroach/util/netutil/net.go:133 +0x333

@petermattis petermattis added this to the Q3 milestone Jul 21, 2016
@danhhz
Copy link
Contributor

danhhz commented Jul 21, 2016

Adding some debug logging for this in #7962

knz pushed a commit to knz/cockroach that referenced this issue Jul 21, 2016
knz pushed a commit to knz/cockroach that referenced this issue Jul 21, 2016
danhhz added a commit that referenced this issue Jul 21, 2016
sql: add more logging to troubleshoot #7881
@mberhault
Copy link
Contributor

Additional logging from the just relocated beta cluster:

E160723 17:36:40.065839 sql/executor.go:499  txnState not cleared while txn == nil: &{txn:<nil> State:Aborted retrying:false retryIntent:false autoRetry:true commitSeen:false schemaChangers:{curGroupNum:0 curStatementIdx:0 schemaChangers:[]} tr:0xc82195acc0 sqlTimestamp:{sec:63604889966 nsec:625089321 loc:0x26fb5e0}}, execOpt {AutoRetry:true AutoCommit:true MinInitialTimestamp:1469293166.625087851,0}, stmts INSERT INTO blocks(block_id, writer_id, block_num, raw_bytes) VALUES ($1, $2, $3, $4), remaining
E160723 17:36:40.065945 sql/executor.go:510  AutoCommitError on nil txn: does not exist, txnState &{txn:<nil> State:Aborted retrying:false retryIntent:false autoRetry:true commitSeen:false schemaChangers:{curGroupNum:0 curStatementIdx:0 schemaChangers:[]} tr:0xc82195acc0 sqlTimestamp:{sec:63604889966 nsec:625089321 loc:0x26fb5e0}}, execOpt {AutoRetry:true AutoCommit:true MinInitialTimestamp:1469293166.625087851,0}, stmts INSERT INTO blocks(block_id, writer_id, block_num, raw_bytes) VALUES ($1, $2, $3, $4), remaining
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x140 pc=0xba05e0]

goroutine 9259823 [running]:
panic(0x190b7a0, 0xc82000e090)
        /usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/cockroachdb/cockroach/util/stop.(*Stopper).Recover(0xc8203a2d80)
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:175 +0x70
panic(0x190b7a0, 0xc82000e090)
        /usr/local/go/src/runtime/panic.go:443 +0x4e9
github.com/cockroachdb/cockroach/internal/client.(*Txn).sendEndTxnReq(0x0, 0x7fad6985d300, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:449 +0x50
github.com/cockroachdb/cockroach/internal/client.(*Txn).Rollback(0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:442 +0x3b
github.com/cockroachdb/cockroach/internal/client.(*Txn).CleanupOnError(0x0, 0x7fad63ec27e0, 0xc8428e0630)
        /go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:372 +0x92
github.com/cockroachdb/cockroach/sql.(*Executor).execRequest(0xc820330000, 0x7fad698e6fe8, 0xc8304609c0, 0xc827766800, 0xc838bd2cdf, 0x56, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/sql/executor.go:514 +0x12d6
github.com/cockroachdb/cockroach/sql.(*Executor).ExecuteStatements(0xc820330000, 0x7fad698e6fe8, 0xc8304609c0, 0xc827766800, 0xc838bd2cdf, 0x56, 0xc8430a5680, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/sql/executor.go:364 +0xf6
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).executeStatements(0xc82144b980, 0x7fad698e6fe8, 0xc8304609c0, 0xc838bd2cdf, 0x56, 0xc8430a5680, 0x2a74488, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:640 +0x98
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).handleExecute(0xc82144b980, 0x7fad698e6fe8, 0xc8304609c0, 0xc82144b9a8, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:628 +0x429
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).serve(0xc82144b980, 0xc8246ee620, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:298 +0xf03
github.com/cockroachdb/cockroach/sql/pgwire.(*Server).ServeConn(0xc82029e5d0, 0x7fad698e7300, 0xc826daa000, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/server.go:229 +0x98f
github.com/cockroachdb/cockroach/server.(*Server).Start.func8.1(0x7fad6985f920, 0xc838ffc000)
        /go/src/github.com/cockroachdb/cockroach/server/server.go:370 +0x42
github.com/cockroachdb/cockroach/util/netutil.(*Server).ServeWith.func1(0xc8203a2d80, 0xc820178138, 0x7fad6985f920, 0xc838ffc000, 0xc82038e000)
        /go/src/github.com/cockroachdb/cockroach/util/netutil/net.go:132 +0x8d
created by github.com/cockroachdb/cockroach/util/netutil.(*Server).ServeWith
        /go/src/github.com/cockroachdb/cockroach/util/netutil/net.go:134 +0x346

Can be found on: [email protected]

@danhhz
Copy link
Contributor

danhhz commented Jul 25, 2016

Thanks! Looking into this again...

danhhz added a commit to danhhz/cockroach that referenced this issue Jul 25, 2016
The autocommit is run on the receiver of (*Txn).Exec. At the beginning of the
call, this is the same as txnState.txn, but it's not guarenteed to be the same
after Exec returns.

For cockroachdb#7881.
danhhz added a commit to danhhz/cockroach that referenced this issue Jul 25, 2016
The autocommit is run on the receiver of (*Txn).Exec. At the beginning of the
call, this is the same as txnState.txn, but it's not guarenteed to be the same
after Exec returns.

For cockroachdb#7881.
danhhz added a commit to danhhz/cockroach that referenced this issue Jul 25, 2016
The autocommit is run on the receiver of (*Txn).Exec. At the beginning of the
call, this is the same as txnState.txn, but it's not guarenteed to be the same
after Exec returns.

For cockroachdb#7881.
@tamird
Copy link
Contributor

tamird commented Jul 25, 2016

Did you mean to close this?

@tamird tamird reopened this Jul 25, 2016
@danhhz
Copy link
Contributor

danhhz commented Jul 25, 2016

I did not, thanks. I forgot to change the PR text, apparently.

@mberhault
Copy link
Contributor

Restarting beta with sha abcf0fd. Will ping this if I see those panics again

andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 19, 2016
I've decided to go medieval and invest in tracing all SQL execution in
the hope that cockroachdb#7881 will be triggered again. This builds on the prev
commit that added the ability to collect spans across the executor and a
client.Txn.
The tracing is gated by an env var. When turned on, every SQL txn
creates a new tracer that accumulates spans in the session's txnState
(this is similar to how "SQL tracing" currently works). When we detect
7881, we mark the root span of the current one of these traces for
"sampling", which means that later, when that span (and hence the trace)
is closed, we dump the trace with all the log messages in it (note that
currently this trace only has one span, since we're not very good at
starting child spans yet).
@vivekmenezes
Copy link
Contributor

@andreimatei should we close this issue?

@andreimatei
Copy link
Contributor

No, there was some funky fix put in that we need to get rid of.

@andreimatei
Copy link
Contributor

This seems to still be an issue, as seen in #14560. We somehow still attempt to AutoCommit while txnState.txn is nil. So somehow we reset the KV transaction, but txn.Exec does not get the error that caused the reset.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Apr 3, 2017
Fixes cockroachdb#14560

Because of cockroachdb#7881, we somehow get an AutoCommitError while txnState.txn
is nil. This shouldn't happen - we shouldn't be attempting to autocommit
if we're no longer in a KV transaction.
A logging statement that was added for cockroachdb#7881 failed to behave properly
for this supposedly-impossible case. This commit fixes the logging
statement.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Apr 3, 2017
Fixes cockroachdb#14560

Because of cockroachdb#7881, we somehow get an AutoCommitError while txnState.txn
is nil. This shouldn't happen - we shouldn't be attempting to autocommit
if we're no longer in a KV transaction.
A logging statement that was added for cockroachdb#7881 failed to behave properly
for this supposedly-impossible case. This commit fixes the logging
statement.
@vivekmenezes
Copy link
Contributor

just adding a fact here that is unlikely to be related: we do not the txn to Txnstate.txn when we call prepare within a transaction.

@andreimatei
Copy link
Contributor

Note to self: I can trigger the assertions we put in to catch this bug at an old commit: 7fff0a0

make test PKG=./pkg/sql/logictest TESTS=TestLogic/default/prepare TESTFLAGS="-v -show-logs"

and the 7881: assertions in the Executor fire.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 6, 2017
Before this patch, the Executor was delegating the handling of
auto-retries (the loop handling retriable errors) to the lower-level
txn.Exec() interface. With all due respect for that interface, it is now
time for the Executor to take control over these retries for a number of
reasons:
- conceptually, the Executor is already dealing with client-directed
retries, so it's already in the "retry" business. It was confusing that
the auto-retries were handled elsewhere.
- the txn.Exec() interfaces forces the Executor into an unusual
closure-passing programming style. If there's no reason for it, it'd
better do without.
- the Executor already had special needs which required txn.Exec() to be
extended with control options in a ExecOpt structure. SQL is the only
user of that control, so burdening the more general txn.Exec() interface
was already unfortunate. Which leads us to the most important:
- a future commit fixes cockroachdb#17592 - the fix is yet another condition
controlling whether or not we can auto-retry: we can't if we've already
streamed results to the client. This is something that can't cleanly be
put into that ExecOpt structure, so it's really time for the Executor to
control the retry loop.

I've also made other improvements in the code close to auto-retries:
- The state transition AutoRetry->Open moves up from runTxnAttempt() to
execParsed(); the new place makes more sense - we're now dealing with
this transition after we've done dealing with auto-retries.
- The code was doing something seemingly non-sensical: it was holding on
to a reference to the KV transaction before running some statements, and
doing stuff to that reference afterwards, despite the fact that the KV
txn might have been cleaned up and gone by the time control returned to
that layer. This was done in relation to cockroachdb#7881, to paper over an
unexpected state assertion firing. This code cannot be maintained now;
it's time to try again without it.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 7, 2017
Before this patch, the Executor was delegating the handling of
auto-retries (the loop handling retriable errors) to the lower-level
txn.Exec() interface. With all due respect for that interface, it is now
time for the Executor to take control over these retries for a number of
reasons:
- conceptually, the Executor is already dealing with client-directed
retries, so it's already in the "retry" business. It was confusing that
the auto-retries were handled elsewhere.
- the txn.Exec() interfaces forces the Executor into an unusual
closure-passing programming style. If there's no reason for it, it'd
better do without.
- the Executor already had special needs which required txn.Exec() to be
extended with control options in a ExecOpt structure. SQL is the only
user of that control, so burdening the more general txn.Exec() interface
was already unfortunate. Which leads us to the most important:
- a future commit fixes cockroachdb#17592 - the fix is yet another condition
controlling whether or not we can auto-retry: we can't if we've already
streamed results to the client. This is something that can't cleanly be
put into that ExecOpt structure, so it's really time for the Executor to
control the retry loop.

I've also made other improvements in the code close to auto-retries:
- The state transition AutoRetry->Open moves up from runTxnAttempt() to
execParsed(); the new place makes more sense - we're now dealing with
this transition after we've done dealing with auto-retries.
- The code was doing something seemingly non-sensical: it was holding on
to a reference to the KV transaction before running some statements, and
doing stuff to that reference afterwards, despite the fact that the KV
txn might have been cleaned up and gone by the time control returned to
that layer. This was done in relation to cockroachdb#7881, to paper over an
unexpected state assertion firing. This code cannot be maintained now;
it's time to try again without it.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 8, 2017
Before this patch, the Executor was delegating the handling of
auto-retries (the loop handling retriable errors) to the lower-level
txn.Exec() interface. With all due respect for that interface, it is now
time for the Executor to take control over these retries for a number of
reasons:
- conceptually, the Executor is already dealing with client-directed
retries, so it's already in the "retry" business. It was confusing that
the auto-retries were handled elsewhere.
- the txn.Exec() interfaces forces the Executor into an unusual
closure-passing programming style. If there's no reason for it, it'd
better do without.
- the Executor already had special needs which required txn.Exec() to be
extended with control options in a ExecOpt structure. SQL is the only
user of that control, so burdening the more general txn.Exec() interface
was already unfortunate. Which leads us to the most important:
- a future commit fixes cockroachdb#17592 - the fix is yet another condition
controlling whether or not we can auto-retry: we can't if we've already
streamed results to the client. This is something that can't cleanly be
put into that ExecOpt structure, so it's really time for the Executor to
control the retry loop.

I've also made other improvements in the code close to auto-retries:
- The state transition AutoRetry->Open moves up from runTxnAttempt() to
execParsed(); the new place makes more sense - we're now dealing with
this transition after we've done dealing with auto-retries.
- The code was doing something seemingly non-sensical: it was holding on
to a reference to the KV transaction before running some statements, and
doing stuff to that reference afterwards, despite the fact that the KV
txn might have been cleaned up and gone by the time control returned to
that layer. This was done in relation to cockroachdb#7881, to paper over an
unexpected state assertion firing. This code cannot be maintained now;
it's time to try again without it.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 9, 2017
Before this patch, the Executor was delegating the handling of
auto-retries (the loop handling retriable errors) to the lower-level
txn.Exec() interface. With all due respect for that interface, it is now
time for the Executor to take control over these retries for a number of
reasons:
- conceptually, the Executor is already dealing with client-directed
retries, so it's already in the "retry" business. It was confusing that
the auto-retries were handled elsewhere.
- the txn.Exec() interfaces forces the Executor into an unusual
closure-passing programming style. If there's no reason for it, it'd
better do without.
- the Executor already had special needs which required txn.Exec() to be
extended with control options in a ExecOpt structure. SQL is the only
user of that control, so burdening the more general txn.Exec() interface
was already unfortunate. Which leads us to the most important:
- a future commit fixes cockroachdb#17592 - the fix is yet another condition
controlling whether or not we can auto-retry: we can't if we've already
streamed results to the client. This is something that can't cleanly be
put into that ExecOpt structure, so it's really time for the Executor to
control the retry loop.

I've also made other improvements in the code close to auto-retries:
- The state transition AutoRetry->Open moves up from runTxnAttempt() to
execParsed(); the new place makes more sense - we're now dealing with
this transition after we've done dealing with auto-retries.
- The code was doing something seemingly non-sensical: it was holding on
to a reference to the KV transaction before running some statements, and
doing stuff to that reference afterwards, despite the fact that the KV
txn might have been cleaned up and gone by the time control returned to
that layer. This was done in relation to cockroachdb#7881, to paper over an
unexpected state assertion firing. This code cannot be maintained now;
it's time to try again without it.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 12, 2017
Before this patch, the Executor was delegating the handling of
auto-retries (the loop handling retriable errors) to the lower-level
txn.Exec() interface. With all due respect for that interface, it is now
time for the Executor to take control over these retries for a number of
reasons:
- conceptually, the Executor is already dealing with client-directed
retries, so it's already in the "retry" business. It was confusing that
the auto-retries were handled elsewhere.
- the txn.Exec() interfaces forces the Executor into an unusual
closure-passing programming style. If there's no reason for it, it'd
better do without.
- the Executor already had special needs which required txn.Exec() to be
extended with control options in a ExecOpt structure. SQL is the only
user of that control, so burdening the more general txn.Exec() interface
was already unfortunate. Which leads us to the most important:
- a future commit fixes cockroachdb#17592 - the fix is yet another condition
controlling whether or not we can auto-retry: we can't if we've already
streamed results to the client. This is something that can't cleanly be
put into that ExecOpt structure, so it's really time for the Executor to
control the retry loop.

I've also made other improvements in the code close to auto-retries:
- The state transition AutoRetry->Open moves up from runTxnAttempt() to
execParsed(); the new place makes more sense - we're now dealing with
this transition after we've done dealing with auto-retries.
- The code was doing something seemingly non-sensical: it was holding on
to a reference to the KV transaction before running some statements, and
doing stuff to that reference afterwards, despite the fact that the KV
txn might have been cleaned up and gone by the time control returned to
that layer. This was done in relation to cockroachdb#7881, to paper over an
unexpected state assertion firing. This code cannot be maintained now;
it's time to try again without it.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 15, 2017
Before this patch, the Executor was delegating the handling of
auto-retries (the loop handling retriable errors) to the lower-level
txn.Exec() interface. With all due respect for that interface, it is now
time for the Executor to take control over these retries for a number of
reasons:
- conceptually, the Executor is already dealing with client-directed
retries, so it's already in the "retry" business. It was confusing that
the auto-retries were handled elsewhere.
- the txn.Exec() interfaces forces the Executor into an unusual
closure-passing programming style. If there's no reason for it, it'd
better do without.
- the Executor already had special needs which required txn.Exec() to be
extended with control options in a ExecOpt structure. SQL is the only
user of that control, so burdening the more general txn.Exec() interface
was already unfortunate. Which leads us to the most important:
- a future commit fixes cockroachdb#17592 - the fix is yet another condition
controlling whether or not we can auto-retry: we can't if we've already
streamed results to the client. This is something that can't cleanly be
put into that ExecOpt structure, so it's really time for the Executor to
control the retry loop.

I've also made other improvements in the code close to auto-retries:
- The state transition AutoRetry->Open moves up from runTxnAttempt() to
execParsed(); the new place makes more sense - we're now dealing with
this transition after we've done dealing with auto-retries.
- The code was doing something seemingly non-sensical: it was holding on
to a reference to the KV transaction before running some statements, and
doing stuff to that reference afterwards, despite the fact that the KV
txn might have been cleaned up and gone by the time control returned to
that layer. This was done in relation to cockroachdb#7881, to paper over an
unexpected state assertion firing. This code cannot be maintained now;
it's time to try again without it.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 18, 2017
Before this patch, the Executor was delegating the handling of
auto-retries (the loop handling retriable errors) to the lower-level
txn.Exec() interface. With all due respect for that interface, it is now
time for the Executor to take control over these retries for a number of
reasons:
- conceptually, the Executor is already dealing with client-directed
retries, so it's already in the "retry" business. It was confusing that
the auto-retries were handled elsewhere.
- the txn.Exec() interfaces forces the Executor into an unusual
closure-passing programming style. If there's no reason for it, it'd
better do without.
- the Executor already had special needs which required txn.Exec() to be
extended with control options in a ExecOpt structure. SQL is the only
user of that control, so burdening the more general txn.Exec() interface
was already unfortunate. Which leads us to the most important:
- a future commit fixes cockroachdb#17592 - the fix is yet another condition
controlling whether or not we can auto-retry: we can't if we've already
streamed results to the client. This is something that can't cleanly be
put into that ExecOpt structure, so it's really time for the Executor to
control the retry loop.

I've also made other improvements in the code close to auto-retries:
- The state transition AutoRetry->Open moves up from runTxnAttempt() to
execParsed(); the new place makes more sense - we're now dealing with
this transition after we've done dealing with auto-retries.
- The code was doing something seemingly non-sensical: it was holding on
to a reference to the KV transaction before running some statements, and
doing stuff to that reference afterwards, despite the fact that the KV
txn might have been cleaned up and gone by the time control returned to
that layer. This was done in relation to cockroachdb#7881, to paper over an
unexpected state assertion firing. This code cannot be maintained now;
it's time to try again without it.
@andreimatei
Copy link
Contributor

All the code involved here has been replaced by something that hopefully doesn't have such problem in #22277.

@knz
Copy link
Contributor

knz commented Feb 20, 2018

🎉

@knz
Copy link
Contributor

knz commented Feb 20, 2018

RIP 7881. You will be missed. (NOT!!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

8 participants