-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv,sql: transaction commit deadlines due to leases should be late-bound #63725
Conversation
Tobias suggested me to use this PR as an experiment to test our new notification system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed review. Two big things:
I think we need to have clearer principles regarding where we update the deadline in the conn_executor.
It'd also be nice to have a test that exercises the renewal by shortening the durations dramatically and showing that this patch has the intended impact on the ability of a transaction to observe a refreshed lease and then get pushed beyond an original expiration.
Reviewed 18 of 21 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @fqazi)
pkg/kv/txn.go, line 688 at r1 (raw file):
// UpdateDeadline sets the transactions deadline to the passed deadline. // It may move the deadline further into the future.
it also may move it into the past. How about It may move the deadline to any timestamp above the current read timestamp.
Maybe also add a note that it's probably weird to set it to something below the current provisional commit timestamp but that that isn't enforced here.
pkg/kv/txn.go, line 702 at r1 (raw file):
if deadline.Less(readTimestamp) { log.Fatalf(ctx, "deadline below read timestamp is nonsensical; "+ "txn has would have no change to commit. Deadline: %s. Read timestamp: %s Previous Deadline: %s.",
nit: you didn't write it but can we spruce up this message. s/change/chance/
and then lose the upper-case word beginnings.
pkg/settings/setting.go, line 23 at r1 (raw file):
// MaxSettings is the maximum number of settings that the system supports. // Exported for tests. const MaxSettings = 259
meh, go to 512.
pkg/sql/conn_executor.go, line 1500 at r1 (raw file):
case ExecStmt: // Update the deadline on the transaction based on the collections if deadline, haveDeadline := ex.extraTxnState.descCollection.Deadline(); haveDeadline && ex.state.mu.txn != nil {
I think this should happen in the state transition itself rather than here. The idea is that the state of the transaction changes, generally, underneath the state machine. This code is really hard to get right and not as principled as it ought to be. It may take some iteration to get this part clean. I think I was a lot less aware of the design of this machine back then.
pkg/sql/conn_executor_exec.go, line 254 at r1 (raw file):
var stmt Statement queryID := ex.generateID() // Update the deadline on the transaction based on the collections
nit: period.
pkg/sql/conn_executor_prepare.go, line 229 at r1 (raw file):
return nil, err } // Prepare is an implicit transaction and we know it with any
I could see pulling this out into its own PR or commit. This is a very subtle correctness issue.
pkg/sql/catalog/lease/lease.go, line 66 at r1 (raw file):
func between0and1inclusive(f float64) error { if f < 0 || f > 1 { return errors.Errorf("must be between 0 and 1")
nit: you've got the f
there, may as well tell us the provided value.
439efec
to
785b36f
Compare
785b36f
to
784c187
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/settings/setting.go, line 23 at r1 (raw file):
Previously, ajwerner wrote…
meh, go to 512.
Done.
pkg/sql/conn_executor.go, line 1500 at r1 (raw file):
Previously, ajwerner wrote…
I think this should happen in the state transition itself rather than here. The idea is that the state of the transaction changes, generally, underneath the state machine. This code is really hard to get right and not as principled as it ought to be. It may take some iteration to get this part clean. I think I was a lot less aware of the design of this machine back then.
Done.
pkg/sql/conn_executor_prepare.go, line 229 at r1 (raw file):
Previously, ajwerner wrote…
I could see pulling this out into its own PR or commit. This is a very subtle correctness issue.
I'll make this its own commit, since it shouldn't get lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI looks unhappy, any clue why? A rebase is in order too.
Reviewed 2 of 7 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @fqazi)
pkg/sql/catalog/lease/lease_test.go, line 2879 at r2 (raw file):
SELECT * FROM t1; select pg_sleep(5); INSERT INTO t1 VALUES (1);
This works for an interesting reason. You don't get auto-pushed in cockroach to the present when writing. There is a "closed timestamp" that trails the present that will push writes. The default closed timestamp trails the present by roughly 3 seconds. You could shorten that. The problem with it is that it is a best-effort thing which means that this could get flakey. If you want to ensure that the transaction gets pushed, you'll need to use a separate transaction to read the key.
I could be convinced that relying on the closed timestamp is okay. One option to ensure that the closed timestamp has occurred is by creating a rangefeed over the span. That will send you a checkpoint telling you when a closed timestamp has passed.
Let me dig into the failures didn't notice these locally. Not clear from the logs, need to dig deeper, since the connection mysteriously dies. Have a theory if it is linked to the changes. |
Linked to the transaction start not having deadline assertion, let me dig into what the root cause. |
Previously, when an implicit transaction was used during the prepare phase a new transaction is constructed, which will acquire leases as needed. The descCollections will be shared between the execution and prepare phases, which could lead to these leases being held during the execute phase even though the prepare transaction is complete. To address this the current patch will release those leases right after the prepare phase. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fqazi , since you're doing work in the area, can I kindly ask you to look into #24020 and see if the weirdness with rolling back a txn that exceeded its deadline is still current? The issue has been closed, but for no good reason.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
b2de60e
to
a0869fa
Compare
a0869fa
to
6c09098
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 21 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @fqazi)
pkg/sql/conn_executor.go, line 2337 at r4 (raw file):
Quoted 8 lines of code…
if txnIsOpen { if deadline, haveDeadline := ex.extraTxnState.descCollection.Deadline(); haveDeadline && ex.state.mu.txn != nil { err := ex.state.mu.txn.UpdateDeadline(ex.Ctx(), deadline) if err != nil { return advanceInfo{}, err } } }
nit: pull this into a helper maybeUpdateDeadline
or something like that.
pkg/sql/catalog/lease/lease.go, line 79 at r4 (raw file):
// LeaseRenewalDuration controls the default time before a lease expires when //// acquisition to renew the lease begins.
nit: extra //
pkg/sql/catalog/lease/lease_test.go, line 2854 at r4 (raw file):
// Validates that the transaction deadline can be extended // past the original lease duration. Previously, we had a // a limitation if the transaction took longer then this
incomplete sentence.
pkg/sql/catalog/lease/lease_test.go, line 2874 at r4 (raw file):
// it to resume when the channel gets unblocked. if req.Txn != nil && req.Txn.ID.String() == txnID { filterMu.Unlock()
it's a little subtle that this doesn't need locking on blockedOnce
because it assumes there will be at most one inflight EndTxn
for the transaction. Comment?
pkg/sql/catalog/lease/lease_test.go, line 2904 at r4 (raw file):
BEGIN; SELECT * FROM t1; select pg_sleep(5);
the issue I have with this test is that it's slow. I think I'd love to see this work in a very similar way to the below test except without the schema change.
pkg/sql/catalog/lease/lease_test.go, line 2921 at r4 (raw file):
resumeChan := make(chan struct{}) //
missing comment body
pkg/sql/catalog/lease/lease_test.go, line 2935 at r4 (raw file):
} // Fetch the transaction ID, so that we can delay the commit txnIDResult, err := conn.QueryContext(ctx, `SELECT id FROM crdb_internal.node_transactions WHERE session_id IN (SELECT * FROM [SHOW session_id]);`)
This seems like a good use of QueryRowContext
pkg/sql/catalog/lease/lease_test.go, line 2948 at r4 (raw file):
return } require.NoError(t, err)
I don't think you want to call require
on a goroutine. Tests can deadlock if t.Fatal
gets called on a separate goroutine.
pkg/sql/catalog/lease/lease_test.go, line 2957 at r4 (raw file):
Quoted 4 lines of code…
// Execute an insert on the same connection and attempt // to commit, this operation will fail. _, err = conn.ExecContext(ctx, ` INSERT INTO t1 VALUES (1);
I think it's worth breaking this up. 100% of the time this should fail on the INSERT
. I say that because by the time we do that insert, our deadline should be set to some timestamp below we need to be pushed in order to not rewrite history. My guess is that in the case where we hit the request filter above, we're aborting the transaction and not committing it.
All that is to say, I'm not sure we need the request filter to exercise the case where the renewal does not happen.
What I would use the RequestFilter for is to ensure that indeed the deadline does move in the case where there is no schema change. I might do that by just pulling the deadline out of each and every require corresponding to the txn and putting it in some slice or something and then making sure it goes up.
pkg/sql/catalog/lease/lease_test.go, line 2974 at r4 (raw file):
require.NoError(t, err) // Give the lease time to expire time.Sleep(time.Second * 5)
I thought the leases had a 0 duration; does it need any time to expire?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @fqazi)
pkg/sql/catalog/lease/lease_test.go, line 2957 at r4 (raw file):
Previously, ajwerner wrote…
// Execute an insert on the same connection and attempt // to commit, this operation will fail. _, err = conn.ExecContext(ctx, ` INSERT INTO t1 VALUES (1);
I think it's worth breaking this up. 100% of the time this should fail on the
INSERT
. I say that because by the time we do that insert, our deadline should be set to some timestamp below we need to be pushed in order to not rewrite history. My guess is that in the case where we hit the request filter above, we're aborting the transaction and not committing it.All that is to say, I'm not sure we need the request filter to exercise the case where the renewal does not happen.
What I would use the RequestFilter for is to ensure that indeed the deadline does move in the case where there is no schema change. I might do that by just pulling the deadline out of each and every require corresponding to the txn and putting it in some slice or something and then making sure it goes up.
Ignore me about the insert failing. I was thinking the transaction was carrying its deadline everywhere with it. One interesting thing is that by the time the insert finishes, the new deadline will be below the write timestamp. It's not below the read timestamp only because we don't eagerly refresh. That's quite subtle. I'm going to file an issue.
6c09098
to
dbcda1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @fqazi)
pkg/sql/catalog/lease/lease_test.go, line 2854 at r4 (raw file):
Previously, ajwerner wrote…
incomplete sentence.
Done.
pkg/sql/catalog/lease/lease_test.go, line 2874 at r4 (raw file):
Previously, ajwerner wrote…
it's a little subtle that this doesn't need locking on
blockedOnce
because it assumes there will be at most one inflightEndTxn
for the transaction. Comment?
Done.
pkg/sql/catalog/lease/lease_test.go, line 2921 at r4 (raw file):
Previously, ajwerner wrote…
missing comment body
Done.
pkg/sql/catalog/lease/lease_test.go, line 2948 at r4 (raw file):
Previously, ajwerner wrote…
I don't think you want to call
require
on a goroutine. Tests can deadlock ift.Fatal
gets called on a separate goroutine.
Done.
pkg/sql/catalog/lease/lease_test.go, line 2957 at r4 (raw file):
Previously, ajwerner wrote…
Ignore me about the insert failing. I was thinking the transaction was carrying its deadline everywhere with it. One interesting thing is that by the time the insert finishes, the new deadline will be below the write timestamp. It's not below the read timestamp only because we don't eagerly refresh. That's quite subtle. I'm going to file an issue.
Done.
pkg/sql/catalog/lease/lease_test.go, line 2974 at r4 (raw file):
Previously, ajwerner wrote…
I thought the leases had a 0 duration; does it need any time to expire?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, just minor things left. I'd like to understand the nil
transaction cases better.
Reviewed 10 of 21 files at r4, 2 of 5 files at r5.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @fqazi)
pkg/kv/txn.go, line 690 at r5 (raw file):
// It may move the deadline to any timestamp above the current read timestamp. // // The deadline cannot be lower than txn.ReadTimestamp.
This ends up being rather subtle so a bit more commentary would be appreciated. Note that if the deadline is below the current provisional commit timestamp (write timestamp), nothing happens but that that transaction is doomed. I might also add some commentary that we are relying on the read timestamp not changing during execution in order to uphold this invariant. That is valid today but it's fragile.
pkg/sql/conn_executor.go, line 2337 at r5 (raw file):
// Update the deadline on the transaction based on the collections, // if the transaction is currently open. if txnIsOpen && ex.state.mu.txn != nil {
When are we in the open state but txn
is nil?
pkg/sql/conn_executor.go, line 2337 at r5 (raw file):
// Update the deadline on the transaction based on the collections, // if the transaction is currently open. if txnIsOpen && ex.state.mu.txn != nil {
nit: Would it be crazy to add yet one more layer of indirection here by creating ex.maybeUpdateDeadline
which checks that the transaction is open and then updates the deadline?
pkg/sql/conn_executor_prepare.go, line 230 at r5 (raw file):
// Prepare is an implicit transaction and we know it with any // other descriptors. Once the implicit transaction is done
Something got garbled in this sentence.
Thanks for the quick review @ajwerner! Let me check when those cases occur |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @fqazi)
pkg/kv/txn.go, line 690 at r5 (raw file):
This ends up being rather subtle so a bit more commentary would be appreciated. Note that if the deadline is below the current provisional commit timestamp (write timestamp), nothing happens but that that transaction is doomed. I might also add some commentary that we are relying on the read timestamp not changing during execution in order to uphold this invariant. That is valid today but it's fragile.
Done changed to:
// UpdateDeadline sets the transactions deadline to the passed deadline.
// It may move the deadline to any timestamp above the current read timestamp.
// If the deadline is below the current provisional commit timestamp (write timestamp),
// then the transaction will fail with a deadline error during the commit.
// The deadline cannot be lower than txn.ReadTimestamp and we make the assumption
// the read timestamp will not change during execution, which is valid today.
pkg/sql/conn_executor_prepare.go, line 230 at r5 (raw file):
Previously, ajwerner wrote…
// Prepare is an implicit transaction and we know it with any // other descriptors. Once the implicit transaction is done
Something got garbled in this sentence.
Done:
// Prepare with an implicit transaction will end up creating
// a new transaction. Once this transaction is complete,
// we can safely release the leases, otherwise we will
// incorrectly hold leases for later operations.
dbcda1f
to
d5a1c95
Compare
Previously, SQL statements, during their execution, lease tables from the lease.Manager indirectly through the connExecutor's descs.Collection. Whenever a lease is used, the deadline of that version of that lease is applied to the transaction as a commit deadline. This had the unfortunate side effect that renewals would not be properly picked up, so if a transaction took longer then the 4 to 5 minute validity period, then we can get into problematic scenarios. To address, this is patch, adds lease tracking through the the table collections, so that we are aware when extension occurs, and uses that to figure out the latest time a given transaction can commit. Fixes cockroachdb#51042 Release note (bug fix): Fixed a bug that prevent transactions which lasted longer than 5 minutes and then performed writes from committing.
d5a1c95
to
df39050
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)
pkg/sql/conn_executor.go, line 2337 at r5 (raw file):
Previously, ajwerner wrote…
When are we in the open state but
txn
is nil?
Removed and didn't run into any broken scenario
pkg/sql/conn_executor.go, line 2337 at r5 (raw file):
Previously, ajwerner wrote…
nit: Would it be crazy to add yet one more layer of indirection here by creating
ex.maybeUpdateDeadline
which checks that the transaction is open and then updates the deadline?
Discarding this one, since only one place checks the open state now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r6.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)
bors r=ajwerner |
Build succeeded: |
See [here](https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_UnitTests&testNameId=6532733632782929224&tab=testDetails). This test was taking 10 minutes every time. This was happening because of the change to wait for the version's leases to expire in cockroachdb#63725. After: ``` --- PASS: TestSchemaChangeRetryOnVersionChange (1.70s) ``` Release note: None
65142: sql/*: remove catalog.ResolvedSchema, generalize SchemaDescriptor r=postamar a=ajwerner The `ResolvedSchema` was a blemish on the descriptor resolution APIs. It made it very hard to create a reasonable abstraction for descriptor resolution injection. This commit creates new implementations of `SchemaDescriptor` to represent virtual, public, and temporary schemas. There is likely more cleanup that could be done along the way to make the different kinds of schemas more uniform, but that is left for later or never. Relates to #64089. Release note: None 65780: sql: Add unit tests for planning inside the new schema changer r=ajwerner a=fqazi Previously, the new schema changer did not have any unit tests covering the planning capability. This was inadequate, because we had no way of detect if plans were regressing with changes or enhancements. To address this, this patch adds basic tests to see if the operators/dependencies for a given command are sane. Release note: None 66051: logcrash: update non-release Sentry URL r=mgartner a=mgartner The documentation for creating Sentry reports in non-release builds directed developers to set the COCKROACH_CRASH_REPORTS env var to an incorrect URL. Sentry reports were not created when using this URL. The documentation has been updated to refer to the correct URL. Release note: None 66070: sql: fix very slow TestSchemaChangeRetryOnVersionChange r=fqazi a=ajwerner See [here](https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_UnitTests&testNameId=6532733632782929224&tab=testDetails). This test was taking 10 minutes every time. This was happening because of the change to wait for the version's leases to expire in #63725. After: ``` --- PASS: TestSchemaChangeRetryOnVersionChange (1.70s) ``` Release note: None 66074: sql: fix statement diagnostics for EXECUTE r=RaduBerinde a=RaduBerinde Previously, prepared statements ran through the EXECUTE statements could not trigger collection of statement diagnostics. This is because we consider the fingerprint of the EXECUTE statement instead of the one from the prepared statement. The fix is to move up the special handling code in the executor, and replace the AST and fingerprint before setting up the instrumentation helper. Release note (bug fix): queries ran through the EXECUTE statement can now generate statement diagnostic bundles as expected. Fixes #66048. Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Faizan Qazi <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Radu Berinde <[email protected]>
Previously, SQL statements, during their execution, lease
tables from the lease.Manager indirectly through the
connExecutor's descs.Collection. Whenever a lease is used,
the deadline of that version of that lease is applied to
the transaction as a commit deadline. This had the unfortunate
side effect that renewals would not be properly picked up, so
if a transaction took longer then the 4 to 5 minute validity
period, then we can get into problematic scenarios.
To address, this is patch, adds lease tracking through the
the table collections, so that we are aware when extension
occurs, and uses that to figure out the latest time a given
transaction can commit.
Fixes #51042
Release note (bug fix): Fixed a bug that prevent transactions which lasted
longer than 5 minutes and then performed writes from committing.