-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: nil pointer panic in client.Txn.CleanupOnError #7881
Comments
Looks like we're calling |
@andreimatei This is deep in code you've touched. |
@dt, @paperstreet can one of you take a closer look at this while @andreimatei is out of the office? |
Yeah, I can take a look tomorrow morning if no gets to it before me. |
I'm still familiarizing myself with this code, but maybe something like the following could have caused that panic:
|
This happened again on gamma, running beta-20160714.
|
Adding some debug logging for this in #7962 |
sql: add more logging to troubleshoot #7881
Additional logging from the just relocated beta cluster:
Can be found on: |
Thanks! Looking into this again... |
The autocommit is run on the receiver of (*Txn).Exec. At the beginning of the call, this is the same as txnState.txn, but it's not guarenteed to be the same after Exec returns. For cockroachdb#7881.
The autocommit is run on the receiver of (*Txn).Exec. At the beginning of the call, this is the same as txnState.txn, but it's not guarenteed to be the same after Exec returns. For cockroachdb#7881.
The autocommit is run on the receiver of (*Txn).Exec. At the beginning of the call, this is the same as txnState.txn, but it's not guarenteed to be the same after Exec returns. For cockroachdb#7881.
Did you mean to close this? |
I did not, thanks. I forgot to change the PR text, apparently. |
Restarting beta with sha abcf0fd. Will ping this if I see those panics again |
I've decided to go medieval and invest in tracing all SQL execution in the hope that cockroachdb#7881 will be triggered again. This builds on the prev commit that added the ability to collect spans across the executor and a client.Txn. The tracing is gated by an env var. When turned on, every SQL txn creates a new tracer that accumulates spans in the session's txnState (this is similar to how "SQL tracing" currently works). When we detect 7881, we mark the root span of the current one of these traces for "sampling", which means that later, when that span (and hence the trace) is closed, we dump the trace with all the log messages in it (note that currently this trace only has one span, since we're not very good at starting child spans yet).
@andreimatei should we close this issue? |
No, there was some funky fix put in that we need to get rid of. |
This seems to still be an issue, as seen in #14560. We somehow still attempt to AutoCommit while |
Fixes cockroachdb#14560 Because of cockroachdb#7881, we somehow get an AutoCommitError while txnState.txn is nil. This shouldn't happen - we shouldn't be attempting to autocommit if we're no longer in a KV transaction. A logging statement that was added for cockroachdb#7881 failed to behave properly for this supposedly-impossible case. This commit fixes the logging statement.
Fixes cockroachdb#14560 Because of cockroachdb#7881, we somehow get an AutoCommitError while txnState.txn is nil. This shouldn't happen - we shouldn't be attempting to autocommit if we're no longer in a KV transaction. A logging statement that was added for cockroachdb#7881 failed to behave properly for this supposedly-impossible case. This commit fixes the logging statement.
just adding a fact here that is unlikely to be related: we do not the txn to Txnstate.txn when we call prepare within a transaction. |
Note to self: I can trigger the assertions we put in to catch this bug at an old commit: 7fff0a0
and the |
Before this patch, the Executor was delegating the handling of auto-retries (the loop handling retriable errors) to the lower-level txn.Exec() interface. With all due respect for that interface, it is now time for the Executor to take control over these retries for a number of reasons: - conceptually, the Executor is already dealing with client-directed retries, so it's already in the "retry" business. It was confusing that the auto-retries were handled elsewhere. - the txn.Exec() interfaces forces the Executor into an unusual closure-passing programming style. If there's no reason for it, it'd better do without. - the Executor already had special needs which required txn.Exec() to be extended with control options in a ExecOpt structure. SQL is the only user of that control, so burdening the more general txn.Exec() interface was already unfortunate. Which leads us to the most important: - a future commit fixes cockroachdb#17592 - the fix is yet another condition controlling whether or not we can auto-retry: we can't if we've already streamed results to the client. This is something that can't cleanly be put into that ExecOpt structure, so it's really time for the Executor to control the retry loop. I've also made other improvements in the code close to auto-retries: - The state transition AutoRetry->Open moves up from runTxnAttempt() to execParsed(); the new place makes more sense - we're now dealing with this transition after we've done dealing with auto-retries. - The code was doing something seemingly non-sensical: it was holding on to a reference to the KV transaction before running some statements, and doing stuff to that reference afterwards, despite the fact that the KV txn might have been cleaned up and gone by the time control returned to that layer. This was done in relation to cockroachdb#7881, to paper over an unexpected state assertion firing. This code cannot be maintained now; it's time to try again without it.
Before this patch, the Executor was delegating the handling of auto-retries (the loop handling retriable errors) to the lower-level txn.Exec() interface. With all due respect for that interface, it is now time for the Executor to take control over these retries for a number of reasons: - conceptually, the Executor is already dealing with client-directed retries, so it's already in the "retry" business. It was confusing that the auto-retries were handled elsewhere. - the txn.Exec() interfaces forces the Executor into an unusual closure-passing programming style. If there's no reason for it, it'd better do without. - the Executor already had special needs which required txn.Exec() to be extended with control options in a ExecOpt structure. SQL is the only user of that control, so burdening the more general txn.Exec() interface was already unfortunate. Which leads us to the most important: - a future commit fixes cockroachdb#17592 - the fix is yet another condition controlling whether or not we can auto-retry: we can't if we've already streamed results to the client. This is something that can't cleanly be put into that ExecOpt structure, so it's really time for the Executor to control the retry loop. I've also made other improvements in the code close to auto-retries: - The state transition AutoRetry->Open moves up from runTxnAttempt() to execParsed(); the new place makes more sense - we're now dealing with this transition after we've done dealing with auto-retries. - The code was doing something seemingly non-sensical: it was holding on to a reference to the KV transaction before running some statements, and doing stuff to that reference afterwards, despite the fact that the KV txn might have been cleaned up and gone by the time control returned to that layer. This was done in relation to cockroachdb#7881, to paper over an unexpected state assertion firing. This code cannot be maintained now; it's time to try again without it.
Before this patch, the Executor was delegating the handling of auto-retries (the loop handling retriable errors) to the lower-level txn.Exec() interface. With all due respect for that interface, it is now time for the Executor to take control over these retries for a number of reasons: - conceptually, the Executor is already dealing with client-directed retries, so it's already in the "retry" business. It was confusing that the auto-retries were handled elsewhere. - the txn.Exec() interfaces forces the Executor into an unusual closure-passing programming style. If there's no reason for it, it'd better do without. - the Executor already had special needs which required txn.Exec() to be extended with control options in a ExecOpt structure. SQL is the only user of that control, so burdening the more general txn.Exec() interface was already unfortunate. Which leads us to the most important: - a future commit fixes cockroachdb#17592 - the fix is yet another condition controlling whether or not we can auto-retry: we can't if we've already streamed results to the client. This is something that can't cleanly be put into that ExecOpt structure, so it's really time for the Executor to control the retry loop. I've also made other improvements in the code close to auto-retries: - The state transition AutoRetry->Open moves up from runTxnAttempt() to execParsed(); the new place makes more sense - we're now dealing with this transition after we've done dealing with auto-retries. - The code was doing something seemingly non-sensical: it was holding on to a reference to the KV transaction before running some statements, and doing stuff to that reference afterwards, despite the fact that the KV txn might have been cleaned up and gone by the time control returned to that layer. This was done in relation to cockroachdb#7881, to paper over an unexpected state assertion firing. This code cannot be maintained now; it's time to try again without it.
Before this patch, the Executor was delegating the handling of auto-retries (the loop handling retriable errors) to the lower-level txn.Exec() interface. With all due respect for that interface, it is now time for the Executor to take control over these retries for a number of reasons: - conceptually, the Executor is already dealing with client-directed retries, so it's already in the "retry" business. It was confusing that the auto-retries were handled elsewhere. - the txn.Exec() interfaces forces the Executor into an unusual closure-passing programming style. If there's no reason for it, it'd better do without. - the Executor already had special needs which required txn.Exec() to be extended with control options in a ExecOpt structure. SQL is the only user of that control, so burdening the more general txn.Exec() interface was already unfortunate. Which leads us to the most important: - a future commit fixes cockroachdb#17592 - the fix is yet another condition controlling whether or not we can auto-retry: we can't if we've already streamed results to the client. This is something that can't cleanly be put into that ExecOpt structure, so it's really time for the Executor to control the retry loop. I've also made other improvements in the code close to auto-retries: - The state transition AutoRetry->Open moves up from runTxnAttempt() to execParsed(); the new place makes more sense - we're now dealing with this transition after we've done dealing with auto-retries. - The code was doing something seemingly non-sensical: it was holding on to a reference to the KV transaction before running some statements, and doing stuff to that reference afterwards, despite the fact that the KV txn might have been cleaned up and gone by the time control returned to that layer. This was done in relation to cockroachdb#7881, to paper over an unexpected state assertion firing. This code cannot be maintained now; it's time to try again without it.
Before this patch, the Executor was delegating the handling of auto-retries (the loop handling retriable errors) to the lower-level txn.Exec() interface. With all due respect for that interface, it is now time for the Executor to take control over these retries for a number of reasons: - conceptually, the Executor is already dealing with client-directed retries, so it's already in the "retry" business. It was confusing that the auto-retries were handled elsewhere. - the txn.Exec() interfaces forces the Executor into an unusual closure-passing programming style. If there's no reason for it, it'd better do without. - the Executor already had special needs which required txn.Exec() to be extended with control options in a ExecOpt structure. SQL is the only user of that control, so burdening the more general txn.Exec() interface was already unfortunate. Which leads us to the most important: - a future commit fixes cockroachdb#17592 - the fix is yet another condition controlling whether or not we can auto-retry: we can't if we've already streamed results to the client. This is something that can't cleanly be put into that ExecOpt structure, so it's really time for the Executor to control the retry loop. I've also made other improvements in the code close to auto-retries: - The state transition AutoRetry->Open moves up from runTxnAttempt() to execParsed(); the new place makes more sense - we're now dealing with this transition after we've done dealing with auto-retries. - The code was doing something seemingly non-sensical: it was holding on to a reference to the KV transaction before running some statements, and doing stuff to that reference afterwards, despite the fact that the KV txn might have been cleaned up and gone by the time control returned to that layer. This was done in relation to cockroachdb#7881, to paper over an unexpected state assertion firing. This code cannot be maintained now; it's time to try again without it.
Before this patch, the Executor was delegating the handling of auto-retries (the loop handling retriable errors) to the lower-level txn.Exec() interface. With all due respect for that interface, it is now time for the Executor to take control over these retries for a number of reasons: - conceptually, the Executor is already dealing with client-directed retries, so it's already in the "retry" business. It was confusing that the auto-retries were handled elsewhere. - the txn.Exec() interfaces forces the Executor into an unusual closure-passing programming style. If there's no reason for it, it'd better do without. - the Executor already had special needs which required txn.Exec() to be extended with control options in a ExecOpt structure. SQL is the only user of that control, so burdening the more general txn.Exec() interface was already unfortunate. Which leads us to the most important: - a future commit fixes cockroachdb#17592 - the fix is yet another condition controlling whether or not we can auto-retry: we can't if we've already streamed results to the client. This is something that can't cleanly be put into that ExecOpt structure, so it's really time for the Executor to control the retry loop. I've also made other improvements in the code close to auto-retries: - The state transition AutoRetry->Open moves up from runTxnAttempt() to execParsed(); the new place makes more sense - we're now dealing with this transition after we've done dealing with auto-retries. - The code was doing something seemingly non-sensical: it was holding on to a reference to the KV transaction before running some statements, and doing stuff to that reference afterwards, despite the fact that the KV txn might have been cleaned up and gone by the time control returned to that layer. This was done in relation to cockroachdb#7881, to paper over an unexpected state assertion firing. This code cannot be maintained now; it's time to try again without it.
Before this patch, the Executor was delegating the handling of auto-retries (the loop handling retriable errors) to the lower-level txn.Exec() interface. With all due respect for that interface, it is now time for the Executor to take control over these retries for a number of reasons: - conceptually, the Executor is already dealing with client-directed retries, so it's already in the "retry" business. It was confusing that the auto-retries were handled elsewhere. - the txn.Exec() interfaces forces the Executor into an unusual closure-passing programming style. If there's no reason for it, it'd better do without. - the Executor already had special needs which required txn.Exec() to be extended with control options in a ExecOpt structure. SQL is the only user of that control, so burdening the more general txn.Exec() interface was already unfortunate. Which leads us to the most important: - a future commit fixes cockroachdb#17592 - the fix is yet another condition controlling whether or not we can auto-retry: we can't if we've already streamed results to the client. This is something that can't cleanly be put into that ExecOpt structure, so it's really time for the Executor to control the retry loop. I've also made other improvements in the code close to auto-retries: - The state transition AutoRetry->Open moves up from runTxnAttempt() to execParsed(); the new place makes more sense - we're now dealing with this transition after we've done dealing with auto-retries. - The code was doing something seemingly non-sensical: it was holding on to a reference to the KV transaction before running some statements, and doing stuff to that reference afterwards, despite the fact that the KV txn might have been cleaned up and gone by the time control returned to that layer. This was done in relation to cockroachdb#7881, to paper over an unexpected state assertion firing. This code cannot be maintained now; it's time to try again without it.
All the code involved here has been replaced by something that hopefully doesn't have such problem in #22277. |
🎉 |
RIP 7881. You will be missed. (NOT!!) |
On the register cluster running beta-20160629, two nodes have failed (ten minutes apart) with the following panic:
The text was updated successfully, but these errors were encountered: