RFC: Lazy Transaction Record Creation #32971

nvanbenschoten · 2018-12-10T08:57:35Z

Remove transaction record creation from serving as the first step of
every non-1PC transaction. Defer transaction record creation as late
as possible in the lifetime of a transaction, often skipping a
transaction record with the PENDING state entirely.

This will provide a performance improvement, simplify the transaction
model, and pave the way for the implementation of the more ambitious
parallel commits RFC.

Release note: None

cockroach-teamcity · 2018-12-10T08:57:43Z

This change is

tbg

There's a sizable amount of work here and two unknowns (delaying txn auto-gc and the 1PC optimization), but it looks like we can explore those ahead of time.

Reviewed 2 of 2 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181209_lazy_txn_record_creation.md, line 34 at r1 (raw file):

4. Repeat steps 2 & 3 as many times as necessary
5. Commit the transaction record
6. Resolve all intent (asynchronously)

intents (here and below)

docs/RFCS/20181209_lazy_txn_record_creation.md, line 69 at r1 (raw file):

Lazily creating transaction records adds a complication here. Now that we're no longer writing a transaction record as early as possible [1] in the lifetime of a transaction, its very easy for a concurrent transaction to observe an intent of an ongoing transaction, push its record, and find nothing there. In fact, we expect this to be the case for any push in the first `min(txn duration, DefaultHeartbeatInterval)` of a transaction's lifetime. Currently, a push that doesn't find a record writes it as `ABORTED` and moves on. That isn't workable if we want to increase the amount of time without a transaction record.

We get around this by using the existence of an intent at a given timestamp as further evidence of client-instructed activity. Instead of just considering the transaction record's original and heartbeat timestamp to determine if it is expired, we also take into consideration the intent's timestamp. We know that the client had to have been alive at the timestamp that the intent was written to issue the write. This means that it is now possible for a transaction to wait on another transaction without the pushee's transaction record existing yet. This works as expected - either the transaction record is eventually written by the transactions first heartbeat and the concurrent transaction continues waiting or the transaction record is never written and the "activity" attributed to the transaction due to the intent becomes stale enough that the transaction can be considered expired.

It's worth pointing out (or discussing somewhere) that transactions write at their OrigTimestamp and so you could end up with a transaction that's been pushed for a while (but is still writing) to use a timestamp that's fairly old. I don't think there's a real problem here now that we've removed SNAPSHOT, but we'll have to change the behavior (i.e. always write at Timestamp, not OrigTimestamp) to avoid surprises down the line.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 99 at r1 (raw file):

A similar check to that discussed above will be performed on the `TxnSpanGCThreshold`.

_side note: would anyone mind if I `s/BeginTransaction/BeginTxn/g` and `s/EndTransaction/EndTxn/g` to make these request types consistent with all other transaction coordination requests?_

SGTM

docs/RFCS/20181209_lazy_txn_record_creation.md, line 109 at r1 (raw file):

`batcheval.PushTxn` handling of extant transaction records will not change. 

By synthesizing transaction records based on the last active timestamp of an intent, `batcheval.PushTxn` will allow transactions to continue executing without fear of being immediately aborted even without having written a transaction record. As long as the transaction commits or begins heartbeating its transaction record within `TxnLivenessThreshold`, it will be safe from rouge aborts. Since most transactions have a duration less than `TxnLivenessThreshold`, most transactions won't ever need to write a transaction record in the `PENDING` state at all.

rogue

docs/RFCS/20181209_lazy_txn_record_creation.md, line 123 at r1 (raw file):

This complicates PUSH_TIMESTAMP requests, who want to ensure that a transaction can only commit above a certain timestamp. Currently, the request does this by modifying the existing transaction record and moving its timestamp. However, with lazily creating transaction records, a PUSH_TIMESTAMP request may find no transaction record through which to convey this information.

To facilitate this behavior without allowing other transactions from writing `PENDING` transaction records, this RFC proposes that PUSH_TIMESTAMP requests use the read timestamp cache to convey this information. A successful PUSH_TIMESTAMP request will bump the read timestamp cache for the local transaction key to the desired timestamp. `EndTransaction` requests will then check the read timestamp cache (in addition to the write timestamp cache, [maybe](#replica-side-writetooold-avoidance)) and will forward their txn timestamp accordingly. This draws clear parallels to the other uses of the read timestamp cache and is much less of an abuse than the current use of the write timestamp cache to store finalized transaction IDs.

👍

docs/RFCS/20181209_lazy_txn_record_creation.md, line 132 at r1 (raw file):

- transaction records can be GCed by two sources: `EndTransaction` requests and the GC queue. 
- transaction records can be GCed if they are `COMMITTED` or `ABORTED`. If a transaction record is `PENDING` and processed by the GC queue, it will first be aborted before being GCed.
- `EndTransaction` requests can only be issued by the transactions own coordinator, not by concurrent transactions. This property is held [even in the parallel commits RFC](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20180324_parallel_commit.md#transaction-record-recreation). This means that any `EndTransaction` request that GCs a transaction record was known to the transaction's coordinator.

transaction's

docs/RFCS/20181209_lazy_txn_record_creation.md, line 143 at r1 (raw file):

The situation becomes a little worse now that `batcheval.PushTxn` will handle (often wait) for missing transaction records. In fact, a `PushTxn` request has no way of determining whether a transaction record has not yet been written or has already been finalized and GCed. This won't cause any correctness issues, but it means that naively a `PushTxn` may end up waiting for a transaction record that has already been GCed simply because it saw activity on one of the transaction's intents within the `TxnLivenessThreshold`.

It is unclear how to avoid this issue entirely. This RFC's current thinking is that we should do away with the `TxnAutoGC` fast-path and enforce a `TxnLivenessThreshold` waiting period on eager transaction record GCs. If we made the eager GC of transaction records wait `TxnLivenessThreshold` then any `PushTxn` request in this period would find a non-GCed transaction record and any `PushTxn` request after this period would necessarily synthesize an expired transaction record. Thus, we would avoid the issue. This waiting period could be introduced in the `intentResolver`, and plays into some of the goals outlined in [#30780](https://github.com/cockroachdb/cockroach/issues/30780).

Hmm, this is a tricky problem. I agree that optimizing the fast path into batches can help naturally, but it creates a hard dependency on that project. Maybe that's OK, but we could also keep a map[UUID] -> struct { TxnMeta Expiration } on the leaseholder that can be consulted by pushes and allows them to avoid the waiting period. (This looks like it'd be helpful for #31664 as well, but I don't really think that's true because that map would know only about transactions whose nearby intents are already gone).

There could also be a perf impact if more txn records make it into L0 before being deleted.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 160 at r1 (raw file):

Now that we no longer have `BeginTransaction` requests, things are a little different. However, the same approach can be used. We can update the write timestamp cache on `EndTransaction` requests and consult it on committing `EndTransaction` requests. This will prevent more than a single `EndTransaction` request from executing successfully.

However, we should be able to do more. We can use MVCC to enforce this replay protection for us. If the EndTransaction request is replayed alone then it is harmless to let it write a new `COMMITTED` transaction record. If any writes are replayed after a successful commit then they will necessarily hit WriteTooOld errors and will be pushed to a new timestamp. If those writes are in the same batch as the EndTransaction then they will prevent it from committing. If those writes are in different batches from the EndTransaction then they will not be resolvable.

Laying down new intents and a new committed txn record could tickle new bugs. Will ResolveIntent accidentally commit these doubled up writes at the new timestamp?

I think it should be called out that we're talking about EndTransaction which is side-effect free. If you have a commit trigger, you definitely don't want to apply it twice.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 174 at r1 (raw file):

#### Replica-Side WriteTooOld Avoidance

Unfortunately, this currently breaks down because we have an optimization that avoids blocking transaction commit on WriteTooOld errors in two places. The first is during [1PC transaction evaluation](https://github.com/cockroachdb/cockroach/blob/8f30db0f07e940667bc34314ec6a446491a29790/pkg/storage/replica.go#L6036) and the second is during [EndTransaction evaluation](https://github.com/cockroachdb/cockroach/blob/8f30db0f07e940667bc34314ec6a446491a29790/pkg/storage/batcheval/cmd_end_transaction.go#L401). These cases allow a transaction to commit through a WriteTooOld error even without client-side approval. This breaks our ability to use MVCC to provide replay protection. It's unclear how important this optimization is as it's only applicable to heavily contended transactions with no refresh spans (i.e. blind writes). Blind writes from SQL are not very common and the optimization only saves a single round-trip to the client, who will never need an actual transaction retry because it can be trivially refreshed.

I hope we can remove this optimization. Are you planning to run an experiment?

docs/RFCS/20181209_lazy_txn_record_creation.md, line 176 at r1 (raw file):

Unfortunately, this currently breaks down because we have an optimization that avoids blocking transaction commit on WriteTooOld errors in two places. The first is during [1PC transaction evaluation](https://github.com/cockroachdb/cockroach/blob/8f30db0f07e940667bc34314ec6a446491a29790/pkg/storage/replica.go#L6036) and the second is during [EndTransaction evaluation](https://github.com/cockroachdb/cockroach/blob/8f30db0f07e940667bc34314ec6a446491a29790/pkg/storage/batcheval/cmd_end_transaction.go#L401). These cases allow a transaction to commit through a WriteTooOld error even without client-side approval. This breaks our ability to use MVCC to provide replay protection. It's unclear how important this optimization is as it's only applicable to heavily contended transactions with no refresh spans (i.e. blind writes). Blind writes from SQL are not very common and the optimization only saves a single round-trip to the client, who will never need an actual transaction retry because it can be trivially refreshed.

### 1PC Transactions

It also doesn't seem that there's a solution here yet. That's fine, but it'd be nice to call it out explicitly (TODO or just a statement).

docs/RFCS/20181209_lazy_txn_record_creation.md, line 182 at r1 (raw file):

This detection is moderately more difficult to make efficient without `BeginTransaction`. The straightforward but wrong approach would be to compare the key bounds of each of the `IntentSpans` in the batch's `EndTransaction` with the Range it is evaluating on. If all `IntentSpans` are in the Range's bounds, the batch can be evaluated as a 1PC transaction. This gets tripped up when the txn has written before but only to the range that its record is on.

A workable approach would be to only increment the transaction's sequence number on writes and to check whether the batch contains all of the sequence numbers up to the exclusive upper bound sequence on the `EndTransaction` request. This simplifies to checking the first write in the batch to see if it has sequence number 1, but that approach will break down under parallel commits.

What would break under parallel commits? Under parallel commits, you'd still get a single batch that has the sequence numbers from 1, .. N, wouldn't you?
In the worst case, DistSender could send a 1PC hint.

nvanbenschoten

Thanks for the thoughtful responses!

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181209_lazy_txn_record_creation.md, line 34 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

intents (here and below)

Done.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 69 at r1 (raw file):

we'll have to change the behavior (i.e. always write at Timestamp, not OrigTimestamp) to avoid surprises down the line

I think @benesch is already on it!

Wrote a bit about this.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 99 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

SGTM

👍

docs/RFCS/20181209_lazy_txn_record_creation.md, line 109 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

rogue

Done.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 132 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

transaction's

Done.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 143 at r1 (raw file):

Maybe that's OK, but we could also keep a map[UUID] -> struct { TxnMeta Expiration } on the leaseholder that can be consulted by pushes and allows them to avoid the waiting period.

The problem with this (or any in-memory cache) is that a lease transfer could easily undermine the protection and make concurrent transactions susceptible to 2s waiting periods.

This looks like it'd be helpful for #31664 as well, but I don't really think that's true because that map would know only about transactions whose nearby intents are already gone

Yeah, the key for that change is that the information needs to be accessible wherever the intents are.

There could also be a perf impact if more txn records make it into L0 before being deleted.

Good point. All the more reason to explore a separate column family for these values 😄

One way I realized that we could look at this is that we won't GC a transaction record until it has "expired".

docs/RFCS/20181209_lazy_txn_record_creation.md, line 160 at r1 (raw file):

Will ResolveIntent accidentally commit these doubled up writes at the new timestamp?

I feared it would, but it looks like someone thought of this problem before:

cockroach/pkg/storage/engine/mvcc.go

Line 2077 in f524717

timestampsValid := !intent.Txn.Timestamp.Less(hlc.Timestamp(meta.Timestamp))

I think it should be called out that we're talking about EndTransaction which is side-effect free. If you have a commit trigger, you definitely don't want to apply it twice.

That's a great point. Seems like we'll need to make all commit triggers idempotent to make this work.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 174 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I hope we can remove this optimization. Are you planning to run an experiment?

Yeah, my current testing plan is to run kv0 --cycle-length=16 --concurrency=<high> on an 8 node cluster, with and without explicit SQL transactions. kv performs UPSERTs, which should translate to blind writes if we don't pass --secondary-index.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 176 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It also doesn't seem that there's a solution here yet. That's fine, but it'd be nice to call it out explicitly (TODO or just a statement).

There's kind of a solution, but I was curious if others had a better idea. See my comment below.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 182 at r1 (raw file):

What would break under parallel commits? Under parallel commits, you'd still get a single batch that has the sequence numbers from 1, .. N, wouldn't you?

I'll clarify. What I meant was that looking at only the first write and checking if it was seq num 1 would only work before parallel commits because seq num 1 and the EndTransaction implies that every seq num is in that batch. Parallel commits breaks this because we'll no longer split EndTransaction from the rest of the batch, even if part of it is sent to another range. We can always look at every write and ensure that we have 1 through N in the batch, without any gaps. That's what I think we'll end up doing.

Of course, we should go with the post parallel commits strategy immediately to avoid migration concerns down the road.

nvanbenschoten · 2018-12-11T00:41:22Z

cc. @spencerkimball you might be interested in reading this RFC.

If so, mind weighing in on my question about QueryTxn requests. I'm still undecided about whether they should

implicitly return synthesized transaction records based on the provided TxnMeta (meaning that the returned txn will never be nil)
do so, but only when a Synthesize option is set to true
never do so and push the logic on to the client

I suspect that you'll have good insight into this. Note that regardless of what decision we make for this, we'll need to adjust batcheval.QueryTxn to return waiting txns even for missing transaction records. That's a somehwat orthogonal concern, but it might help guide the other decision.

spencerkimball · 2018-12-11T00:53:35Z

docs/RFCS/20181209_lazy_txn_record_creation.md

+
+This detection is moderately more difficult to make efficient without `BeginTransaction`. The straightforward but wrong approach would be to compare the key bounds of each of the `IntentSpans` in the batch's `EndTransaction` with the Range it is evaluating on. If all `IntentSpans` are in the Range's bounds, the batch can be evaluated as a 1PC transaction. This gets tripped up when the txn has written before but only to the range that its record is on.
+
+A workable approach would be to only increment the transaction's sequence number on writes and to check whether the batch contains all of the sequence numbers up to the exclusive upper bound sequence on the `EndTransaction` request. This simplifies to checking the first write in the batch to see if it has sequence number 1, but that simplification will break down under parallel commits, so it will not be pursued.


Seems reasonable to add a new field to the EndTransaction request which specifies whether the batch comprises the entire transaction and let the txn coord sender set the flag. Guessing down at the storage level seems fragile.

spencerkimball · 2018-12-11T00:54:20Z

docs/RFCS/20181209_lazy_txn_record_creation.md

+
+### QueryTxn / txnwait.Queue
+
+The `txnwait.Queue` queries both the pusher and the pushee transaction using `QueryTxn` requests while waiting for changes to either. While doing so, the `txnwait.Queue` has access to `TxnMeta` objects for both of these transactions. Either the `QueryTxn` request or the `txnwait.Queue` itself will need to be made aware that transaction records are now creates lazily. 


s/creates/created/

spencerkimball · 2018-12-11T00:55:43Z

docs/RFCS/20181209_lazy_txn_record_creation.md

+The `txnwait.Queue` queries both the pusher and the pushee transaction using `QueryTxn` requests while waiting for changes to either. While doing so, the `txnwait.Queue` has access to `TxnMeta` objects for both of these transactions. Either the `QueryTxn` request or the `txnwait.Queue` itself will need to be made aware that transaction records are now creates lazily. 
+
+Currently the `txnwait.Queue` works properly even if the pusher does not have a transaction record but does not behave properly when the pushee does not have a transaction record. There are a few alternatives to fixing this:
+- `QueryTxn` could use its provided `TxnMeta` object to synthesize a transaction record when one does not exist. This is nice from the perspective of hiding complication behind an abstraction, but might undermine the purpose of `QueryTxn` in certain situations.


Do you have "certain situations" in mind, or are you leaving the possibility open that there may be situations you're not aware of?

spencerkimball · 2018-12-11T01:01:55Z

docs/RFCS/20181209_lazy_txn_record_creation.md

+The `txnwait.Queue` queries both the pusher and the pushee transaction using `QueryTxn` requests while waiting for changes to either. While doing so, the `txnwait.Queue` has access to `TxnMeta` objects for both of these transactions. Either the `QueryTxn` request or the `txnwait.Queue` itself will need to be made aware that transaction records are now creates lazily. 
+
+Currently the `txnwait.Queue` works properly even if the pusher does not have a transaction record but does not behave properly when the pushee does not have a transaction record. There are a few alternatives to fixing this:
+- `QueryTxn` could use its provided `TxnMeta` object to synthesize a transaction record when one does not exist. This is nice from the perspective of hiding complication behind an abstraction, but might undermine the purpose of `QueryTxn` in certain situations.


I prefer the approach of QueryTxn synthesizing the txn record. Since it already must return the list of dependent txns, it is implicitly agreeing that the queried txn exists. Fulfilling the contract of the API call by positing the state of the implicit txn and returning it maintains the abstraction and seems easier to reason about. Letting this bit of complexity leak into txnwait.Queue feels worse.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181209_lazy_txn_record_creation.md, line 147 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

s/creates/created/

Done.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 150 at r2 (raw file):

Do you have "certain situations" in mind, or are you leaving the possibility open that there may be situations you're not aware of?

No, just leaving the possibility open. I agree with everything you said in the second comment. Changed the text to reflect that.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 182 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Seems reasonable to add a new field to the EndTransaction request which specifies whether the batch comprises the entire transaction and let the txn coord sender set the flag. Guessing down at the storage level seems fragile.

I tend to fear this kind of state because any actor that modifies a batch needs to be made aware of it. For instance, DistSender would need to be careful when splitting up batches to correctly adjust the field. It seems less fragile for the 1PC fast-path to be triggered simply by BatchRequests that conform to certain functional criteria.

spencerkimball · 2018-12-11T01:40:23Z

Did you consider sending the BeginTransaction request but just making it a noop, so you can minimize changes to the existing code?

…

On Mon, Dec 10, 2018 at 8:23 PM Nathan VanBenschoten < ***@***.***> wrote: ***@***.**** commented on this pull request. *Reviewable <https://reviewable.io/reviews/cockroachdb/cockroach/32971>* status: [image:

] complete! 0 of 0 LGTMs obtained ------------------------------ *docs/RFCS/20181209_lazy_txn_record_creation.md, line 147 at r2 <https://reviewable.io/reviews/cockroachdb/cockroach/32971#-LTMKWp3Ho0mIP1JbRzy-r2-147:-LTPnZnv0-uYiNzY1_mx:b-896fix> (raw file <https://github.com/cockroachdb/cockroach/blob/9d104cbdbf48f20adb7ee584e0bb9a8380130057/docs/RFCS/20181209_lazy_txn_record_creation.md#L147>):* *Previously, spencerkimball (Spencer Kimball) wrote…* s/creates/created/ Done. ------------------------------ *docs/RFCS/20181209_lazy_txn_record_creation.md, line 150 at r2 <https://reviewable.io/reviews/cockroachdb/cockroach/32971#-LTMKWp3Ho0mIP1JbRzy-r2-150:-LTPn_is914yzTGbH_gs:b-np7ij0> (raw file <https://github.com/cockroachdb/cockroach/blob/9d104cbdbf48f20adb7ee584e0bb9a8380130057/docs/RFCS/20181209_lazy_txn_record_creation.md#L150>):* Do you have "certain situations" in mind, or are you leaving the possibility open that there may be situations you're not aware of? No, just leaving the possibility open. I agree with everything you said in the second comment. Changed the text to reflect that. ------------------------------ *docs/RFCS/20181209_lazy_txn_record_creation.md, line 182 at r2 <https://reviewable.io/reviews/cockroachdb/cockroach/32971#-LTMKWp3Ho0mIP1JbRzy-r2-182:-LTPpYTdA1njO9zlKy_a:bkervex> (raw file <https://github.com/cockroachdb/cockroach/blob/9d104cbdbf48f20adb7ee584e0bb9a8380130057/docs/RFCS/20181209_lazy_txn_record_creation.md#L182>):* *Previously, spencerkimball (Spencer Kimball) wrote…* Seems reasonable to add a new field to the EndTransaction request which specifies whether the batch comprises the entire transaction and let the txn coord sender set the flag. Guessing down at the storage level seems fragile. I tend to fear this kind of state because any actor that modifies a batch needs to be made aware of it. For instance, DistSender would need to be careful when splitting up batches to correctly adjust the field. It seems less fragile for the 1PC fast-path to be triggered simply by BatchRequests that conform to certain functional criteria. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#32971 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF3MTbbqWiH7WAMrbb5_rsoDmHBeK3_4ks5u3wklgaJpZM4ZK6nZ> .

andreimatei

Really nice RFC!

andreimatei · 2018-12-11T18:22:48Z

docs/RFCS/20181209_lazy_txn_record_creation.md

+
+1. Only the transaction coordinator that owns a transaction can create its transaction record in the `PENDING` state or create/move its transaction record in/to the `COMMITTED` state.
+2. Any transaction can move a transaction record to the `ABORTED` state.
+3. Unless specified by `CanPushWithPriority`, a contending transaction will wait until a transaction has been inactive for at least `TxnLivenessThreshold` before aborting it. As we'll see below, this RFC actually strengthens this property.


This is the only reference in the document to CanPushWithPriority. I'd take it out.

andreimatei · 2018-12-11T18:39:29Z

docs/RFCS/20181209_lazy_txn_record_creation.md

+
+### Properties Preserved
+
+1. Only the transaction coordinator that owns a transaction can create its transaction record in the `PENDING` state or create/move its transaction record in/to the `COMMITTED` state.


Why? Consider giving a hint and linking to another section.

andreimatei · 2018-12-11T18:39:51Z

docs/RFCS/20181209_lazy_txn_record_creation.md

+2. Any transaction can move a transaction record to the `ABORTED` state.
+3. Unless specified by `CanPushWithPriority`, a contending transaction will wait until a transaction has been inactive for at least `TxnLivenessThreshold` before aborting it. As we'll see below, this RFC actually strengthens this property.
+
+Without storing [local transaction keys in the read timestamp cache](#Successful-PUSH_TIMESTAMP), this first property is hard to keep.


This sentence seems out of context and out of place; consider scratching it.

andreimatei · 2018-12-11T18:42:15Z

docs/RFCS/20181209_lazy_txn_record_creation.md

+
+1. they serve as the single linearization point of a transaction. Transactions are considered `COMMITTED` or `ABORTED` when the Raft entry that changes the transaction record to one of these two states is committed. This role interacts with the `EndTransaction` and `PushTxn` request types.
+2. they serve as the source of truth for the liveness of a transaction. Transaction coordinators heartbeat their transaction record and concurrent transactions observe the state of a transaction's record to learn about its disposition. This role interacts with the `HeartbeatTxn`, `PushTxn`, and `QueryTxn` request types.
+3. they perform bookkeeping on the resources held by finalized (`COMMITTED` or `ABORTED`) transactions. A finalized transaction record will point at each of the transaction's intents, allowing both the original transaction coordinator and other transactions to resolve and later garbage-collect the intents. This role interacts with the `ResolveIntent`, `ResolveIntentRange`, and `GC` request types.


s/each intent/intent key spans

andreimatei · 2018-12-11T18:46:37Z

docs/RFCS/20181209_lazy_txn_record_creation.md

+
+Lazily creating transaction records adds a complication here. Now that we're no longer writing a transaction record as early as possible [1] in the lifetime of a transaction, its very easy for a concurrent transaction to observe an intent of an ongoing transaction, push its record, and find nothing there. In fact, we expect this to be the case for any push in the first `min(txn duration, DefaultHeartbeatInterval)` of a transaction's lifetime. Currently, a push that doesn't find a record writes it as `ABORTED` and moves on. That isn't workable if we want to increase the amount of time without a transaction record.
+
+We get around this by using the existence of an intent at a given timestamp as further evidence of client-instructed activity. Instead of just considering the transaction record's original and heartbeat timestamp to determine if it is expired, we also take into consideration the intent's timestamp. We know that the client had to have been alive at the timestamp that the intent was written to issue the write [2]. This means that it is now possible for a transaction to wait on another transaction without the pushee's transaction record existing yet. This works as expected - either the transaction record is eventually written by the transactions first heartbeat and the concurrent transaction continues waiting or the transaction record is never written and the "activity" attributed to the transaction due to the intent becomes stale enough that the transaction can be considered expired.


nit: s/further/

andreimatei · 2018-12-11T18:59:06Z

docs/RFCS/20181209_lazy_txn_record_creation.md

+
+Now that we no longer have `BeginTransaction` requests, things are a little different. However, the same approach can be used. We can update the write timestamp cache on `EndTransaction` requests and consult it on committing `EndTransaction` requests. This will prevent more than a single `EndTransaction` request from executing successfully.
+
+However, we should be able to do more. We can use MVCC to enforce this replay protection for us. If the EndTransaction request is replayed alone then it is harmless to let it write a new `COMMITTED` transaction record. If any writes are replayed after a successful commit then they will necessarily hit WriteTooOld errors and will be pushed to a new timestamp. If those writes are in the same batch as the EndTransaction then they will prevent it from committing. If those writes are in different batches from the EndTransaction then they will not be resolvable.


then they will not be resolvable

Mind putting some more words here? You mean to say that the EndTransaction batch will succeed, but no intents will have been laid down by the other write batches, so it all works out? But, wouldn't the intents still be laid down (albeit at a higher timestamp than the original)? Isn't that a problem?

andreimatei · 2018-12-11T19:24:18Z

docs/RFCS/20181209_lazy_txn_record_creation.md

+| Initial Action               | Replay                                | Protection                                                   |
+|------------------------------|---------------------------------------|--------------------------------------------------------------|
+| txn aborted by coordinator   | replay commit                         | not allowed by client, see [TxnCoordSender](#TxnCoordSender) |
+| txn aborted by coordinator   | replay abort                          | not an issue                                                 |


nit: I think "not an issue" is a bit ambiguous. Maybe "could happen; no harm, no foul"

andreimatei · 2018-12-11T19:25:22Z

docs/RFCS/20181209_lazy_txn_record_creation.md

+| txn committed by coordinator | replay abort                          | not allowed by client, see [TxnCoordSender](#TxnCoordSender) |
+| txn committed by coordinator | replay commit with writes in batch    | protected by WriteToOld errors                               |
+| txn committed by coordinator | replay commit without writes in batch | not an issue                                                 |
+| txn aborted by other txn     | replay commit                         | hit aborted txn record or TxnSpanGCThreshold                 |


"replay commit" but also simply a race between the coordinator commit and a 3rd party abort, right?

andreimatei · 2018-12-11T19:28:06Z

docs/RFCS/20181209_lazy_txn_record_creation.md

+
+#### Replica-Side WriteTooOld Avoidance
+
+Unfortunately, this currently breaks down because we have an optimization that avoids blocking transaction commit on WriteTooOld errors in two places. The first is during [1PC transaction evaluation](https://github.com/cockroachdb/cockroach/blob/8f30db0f07e940667bc34314ec6a446491a29790/pkg/storage/replica.go#L6036) and the second is during [EndTransaction evaluation](https://github.com/cockroachdb/cockroach/blob/8f30db0f07e940667bc34314ec6a446491a29790/pkg/storage/batcheval/cmd_end_transaction.go#L401). These cases allow a transaction to commit through a WriteTooOld error even without client-side approval. This breaks our ability to use MVCC to provide replay protection. It's unclear how important this optimization is as it's only applicable to heavily contended transactions with no refresh spans (i.e. blind writes). Blind writes from SQL are not very common and the optimization only saves a single round-trip to the client, who will never need an actual transaction retry because it can be trivially refreshed.


so, what do you suggest specifically?

andreimatei · 2018-12-11T19:45:20Z

docs/RFCS/20181209_lazy_txn_record_creation.md

+
+This is a rare operation, so there is little concern of it adding strain to the read timestamp cache. Frequent page rotations in the cache will have the exact same effect as they do on any other transactional writes, so that is not a concern either.
+
+#### Transaction Record GC


It seems to me that it'd be a pity to remove the sync GC of txn records... Imagine workloads that just update some rows over and over - we'll end up with tons of txn records. I think we'll end up with huge ranges that can't even be split.

To solve the case where a Push doesn't know if a txn record was GCed or not can't we use the write timestamp cache? If the GC would populate it (as the EndTransaction already does), the Push could consult it and, if it gets a hit, then the Push could be considered successful I think.

bdarnell

Looks really good. My only concerns are about the loss of the TxnAutoGC optimization and the complexity of the "replay protection" section.

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181209_lazy_txn_record_creation.md, line 176 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

There's kind of a solution, but I was curious if others had a better idea. See my comment below.

What about having the client send BeginTransaction exactly as it does now but make it a no-op on the server side? Then we could still detect 1PC. Or is it a goal to eliminate the complexity and bookkeeping that goes along with the client sending BeginTransaction?

docs/RFCS/20181209_lazy_txn_record_creation.md, line 49 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This sentence seems out of context and out of place; consider scratching it.

Or moving it somewhere else and expanding on it.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 57 at r3 (raw file):

1. they serve as the single linearization point of a transaction. Transactions are considered `COMMITTED` or `ABORTED` when the Raft entry that changes the transaction record to one of these two states is committed. This role interacts with the `EndTransaction` and `PushTxn` request types.
2. they serve as the source of truth for the liveness of a transaction. Transaction coordinators heartbeat their transaction record and concurrent transactions observe the state of a transaction's record to learn about its disposition. This role interacts with the `HeartbeatTxn`, `PushTxn`, and `QueryTxn` request types.
3. they perform bookkeeping on the resources held by finalized (`COMMITTED` or `ABORTED`) transactions. A finalized transaction record will point at each of the transaction's intents, allowing both the original transaction coordinator and other transactions to resolve and later garbage-collect the intents. This role interacts with the `ResolveIntent`, `ResolveIntentRange`, and `GC` request types.

A COMMITTED transaction record will point to the transactions intents. An ABORTED one may or may not point to the intents, depending on who aborted it.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 81 at r3 (raw file):

### TxnCoordSender

The `TxnCoordSender` no longer sends `BeginTransaction` requests if it detects that every node in a cluster is at a version that can handle lazy transaction record creation.

If the upgrade is not finalized, it's still possible for updated nodes to be rolled back. I think this needs to be based on the cluster version setting instead of knowing what version the nodes are currently running.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 127 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

It seems to me that it'd be a pity to remove the sync GC of txn records... Imagine workloads that just update some rows over and over - we'll end up with tons of txn records. I think we'll end up with huge ranges that can't even be split.

To solve the case where a Push doesn't know if a txn record was GCed or not can't we use the write timestamp cache? If the GC would populate it (as the EndTransaction already does), the Push could consult it and, if it gets a hit, then the Push could be considered successful I think.

Yeah, the sync GC of txn records is pretty important (IIRC). If we remove that path then we are forced to write finalized txn records where we can currently elide them, which would negate the benefit of skipping the PENDING txn record in many cases.

What if when we start waiting on a missing txn record, we also query the txn again to see if it is still there, and rely on the fact that for committed txns all intents are resolved before the txn can be gc'd? I think this would allow us to keep the TxnAutoGC fast path for committed transactions (we might still need to write the txn record for aborted ones, but that's probably ok since the txn wait queue has made aborted txns less common).

docs/RFCS/20181209_lazy_txn_record_creation.md, line 141 at r3 (raw file):

Thing are more interesting for concurrent transactions, which have no guarantee about the state that they observe a transaction in. It is completely valid that a concurrent transaction observes a missing transaction record that was either `COMMITTED` or `ABORTED` and later GCed without having any indication of what happened to it. This is already the case, and we opt for rewriting the transaction record as `ABORTED` in that case. This isn't great, but it also won't cause any issues.

The situation becomes a little worse now that `batcheval.PushTxn` will handle (often wait) for missing transaction records. In fact, a `PushTxn` request has no way of determining whether a transaction record has not yet been written or has already been finalized and GCed. This won't cause any correctness issues, but it means that naively a `PushTxn` may end up waiting for a transaction record that has already been GCed simply because it saw activity on one of the transaction's intents within the `TxnLivenessThreshold`.

s/wait) for/wait for)/

docs/RFCS/20181209_lazy_txn_record_creation.md, line 162 at r3 (raw file):

However, we should be able to do more.

Why do more? If the write timestamp cache solves the basic problem, why introduce a more complex solution?

nvanbenschoten

Did you consider sending the BeginTransaction request but just making it a noop, so you can minimize changes to the existing code?

Yes, Ben asked the same question. See below.

The only remaining discussion to resolve (and update RFC with) is about the TxnAutoGC optimization.

Also, @andy-kimball you might be interested in this RFC. I think your perspective would be valuable.

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181209_lazy_txn_record_creation.md, line 176 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What about having the client send BeginTransaction exactly as it does now but make it a no-op on the server side? Then we could still detect 1PC. Or is it a goal to eliminate the complexity and bookkeeping that goes along with the client sending BeginTransaction?

Spencer suggested this above. It will work before parallel commits, but once we introduce that change, an EndTransaction will no longer indicating that the suffix of a transaction is included in the same batch. I'd like a solution that solves the full parallel commits case.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 45 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Why? Consider giving a hint and linking to another section.

Done.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 47 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This is the only reference in the document to CanPushWithPriority. I'd take it out.

Done.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 49 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Or moving it somewhere else and expanding on it.

Done.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 57 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

s/each intent/intent key spans

Done.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 57 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

A COMMITTED transaction record will point to the transactions intents. An ABORTED one may or may not point to the intents, depending on who aborted it.

Done.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 69 at r3 (raw file):

nit: s/further/

Done.

But I'm not sure I get it - if we're relying on the txn.Timestamp, that can be arbitrarily in the past. If it's way in the past, then the txn record should have been heartbeated and so it should exist, but still wouldn't it be a good idea to start recording some wallclock when the intent is laid down?

It can't be arbitrarily in the past. The minimum timestamp it can have is the transaction's OrigTimestamp. That is the timestamp that coordinator will start counting from when waiting for the initial heartbeat interval. So its record should be heartbeated at or before txn.Timestamp + DefaultHeartbeatInterval, which will be before txn.Timestamp + TxnLivenessThreshold.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 81 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If the upgrade is not finalized, it's still possible for updated nodes to be rolled back. I think this needs to be based on the cluster version setting instead of knowing what version the nodes are currently running.

That's what I was meaning, but the wording was off. Done.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 127 at r3 (raw file):

To solve the case where a Push doesn't know if a txn record was GCed or not can't we use the write timestamp cache? If the GC would populate it (as the EndTransaction already does), the Push could consult it and, if it gets a hit, then the Push could be considered successful I think.

That seems like it could work, but I don't see how the PUSH_ABORT case would work with a timestamp-cache like structure.

Yeah, the sync GC of txn records is pretty important (IIRC). If we remove that path then we are forced to write finalized txn records where we can currently elide them, which would negate the benefit of skipping the PENDING txn record in many cases.

Hmm, good point.

What if when we start waiting on a missing txn record, we also query the txn again to see if it is still there, and rely on the fact that for committed txns all intents are resolved before the txn can be gc'd? I think this would allow us to keep the TxnAutoGC fast path for committed transactions (we might still need to write the txn record for aborted ones, but that's probably ok since the txn wait queue has made aborted txns less common).

I think you mean "query the intent again". I like this idea and think it's a workable solution, although I still need to work my head around where this logic should live.

The only case where I see it breaking down is with replayed writes that hit WriteTooOld errors and get pushed forward but still lay down intents. Someone who runs into one of these intents will query the transaction record, find nothing, query the intent again, find it, and then wait up to TxnLivenessThreshold. These should be rare enough that this situation can be ignored, but it's worth pointing it out.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 141 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/wait) for/wait for)/

Done.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 162 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

then they will not be resolvable

Mind putting some more words here? You mean to say that the EndTransaction batch will succeed, but no intents will have been laid down by the other write batches, so it all works out? But, wouldn't the intents still be laid down (albeit at a higher timestamp than the original)? Isn't that a problem?

No, it's not a problem. Those will be aborted if the transaction is recommitted. See the link I posted above.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 162 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

However, we should be able to do more.

Why do more? If the write timestamp cache solves the basic problem, why introduce a more complex solution?

Yeah, that's what I'm currently thinking. We can leave this for later, if we ever want to address it at all. Added a section.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 167 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: I think "not an issue" is a bit ambiguous. Maybe "could happen; no harm, no foul"

Done.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 171 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

"replay commit" but also simply a race between the coordinator commit and a 3rd party abort, right?

Done.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 176 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

so, what do you suggest specifically?

Done.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181209_lazy_txn_record_creation.md, line 69 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: s/further/

Done.

But I'm not sure I get it - if we're relying on the txn.Timestamp, that can be arbitrarily in the past. If it's way in the past, then the txn record should have been heartbeated and so it should exist, but still wouldn't it be a good idea to start recording some wallclock when the intent is laid down?

It can't be arbitrarily in the past. The minimum timestamp it can have is the transaction's OrigTimestamp. That is the timestamp that coordinator will start counting from when waiting for the initial heartbeat interval. So its record should be heartbeated at or before txn.Timestamp + DefaultHeartbeatInterval, which will be before txn.Timestamp + TxnLivenessThreshold.

yeah, I was recognizing what you're saying, but still - don't you think it's better to record a timestamp taken from the intent range's clock when the intent is laid down? Doing so would eliminate the need for much reasoning.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 127 at r3 (raw file):

That seems like it could work, but I don't see how the PUSH_ABORT case would work with a timestamp-cache like structure.

I was thinking that a PUSH_ABORT would succeed and write an aborted txn record if it runs into the timestamp cache. Which means that we might leave some of these aborted records around to be cleaned up later, which isn't great... Unless we store some more info in the timestamp cache so that the push can know exactly what happened to that transaction...
Anyway, I think Ben's suggestion is better.

I'm not sure I understand you reply to Ben - what's wrong about the restarted txns and their intents? Intents from old epochs should also be cleaned up once the txn commits, right?

Remove transaction record creation from serving as the first step of every non-1PC transaction. Defer transaction record creation as late as possible in the lifetime of a transaction, often skipping a transaction record with the `PENDING` state entirely. This will provide a performance improvement, simplify the transaction model, and pave the way for the implementation of the more ambitious parallel commits RFC. Release note: None

nvanbenschoten

I updated the RFC to reflect the current thinking about the TxnAutoGC optimization. I think I've convinced myself that querying the intent again when entering the txnwait.Queue is the right solution here.

I'm planning on merging this sometime tomorrow if no one else has comments.

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181209_lazy_txn_record_creation.md, line 69 at r3 (raw file):
I used to think so, but we don't actually do something like this anywhere. Instead, most request types have Now fields (e.g. HeartbeatTxnRequest, PushTxnRequest, etc.). This wouldn't be much different.

Doing so would eliminate the need for much reasoning.

I don't think the reasoning is much of a stretch. We're just saying that the intent timestamp must be equal to or greater than the txn's OrigTimestamp. Also, I'd like to avoid a second timestamp on intents for backwards compatibility and for performance.

docs/RFCS/20181209_lazy_txn_record_creation.md, line 127 at r3 (raw file):
I just adjusted the text to reflect the strategy of issuing a QueryIntent request when entering the txnwait.Queue. With this approach, we won't need to adjust the txn record GC policy.

I'm not sure I understand you reply to Ben - what's wrong about the restarted txns and their intents? Intents from old epochs should also be cleaned up once the txn commits, right?

I'm not concerned with restarted txns, I'm concerned about writes that get replayed after a transaction has finished. These can leave stray intents around and will fool our detection of transaction liveness can cause a stall of up to TxnLivenessThreshold (2s). I suspect that these are rare enough to ignore in practice.

bdarnell

Reviewed 2 of 2 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained

Verbal approval

nvanbenschoten · 2018-12-19T23:18:25Z

bors r+

32971: RFC: Lazy Transaction Record Creation r=nvanbenschoten a=nvanbenschoten Remove transaction record creation from serving as the first step of every non-1PC transaction. Defer transaction record creation as late as possible in the lifetime of a transaction, often skipping a transaction record with the `PENDING` state entirely. This will provide a performance improvement, simplify the transaction model, and pave the way for the implementation of the more ambitious parallel commits RFC. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>

craig · 2018-12-19T23:35:44Z

Build succeeded

GitHub CI (Cockroach)

Informs cockroachdb#25437. Informs cockroachdb#32971. This is the first part of addressing cockroachdb#32971. Part two will update concurrent txns to not immediately abort missing txn records and part three will update the txn client to stop sending BeginTxn records. The change closely resembles what was laid out in the corresponding RFC sections. `HeartbeatTxn` and `EndTxn` are both updated to permit the creation of transaction records when they finds that one is missing. To prevent replays from causing issues, they *both* check the write timestamp cache when creating a transaction record. This is one area where the change diverges from the RFC. Instead of having `EndTxn` always check the write timestamp cache, it only does so if it is creating a txn record from scratch. To make this safe, `HeartbeatTxn` also checks the write timestamp cache if it is creating a txn record from scratch. This prevents a long running transaction from increasingly running the risk of being aborted by a lease transfer as its `EndTxn` continues to be delayed. Instead, a lease transfer will only abort a transaction if it comes before the transaction record creation, which should be within a second of the transaction starting. The change pulls out a new `CanCreateTxnRecord` method, which has the potential of being useful for detecting eagerly GCed transaction records and solving the major problem from the RFC without an extra QueryIntent. This is what Andrei was pushing for before. More details are included in TODOs within the PR. _### Migration Safety Most of the changes to the transaction state machine are straightforward to validate. Transaction records can still never be created without checking for replays and only an EndTxn from the client's sync path can GC an ABORTED txn record. This means that it is still impossible for a transaction to move between the two finalized statuses (at some point it would be worth it to draw this state machine). The one tricky part is ensuring that the changes in this PR are safe when run in mixed-version clusters. The two areas of concern are: 1. lease transfers between a new and an old cluster on a range that should/does contain a transaction record. 2. an old txn client that is not expecting HeartbeatTxn and EndTxn requests to create txn records if they are found to be missing. 3. an old txn client may not expect a HeartbeatTxn to hit the write timestamp cache if run concurrently with an EndTxn. The first concern is protected by the write timestamp cache. Regardless of which replica creates that transaction record from a BeginTxn req or a HeartbeatTxn req (on the new replica), it will first need to check the timestamp cache. This prevents against any kind of replay that could create a transaction record that the old replica is not prepared to handle. We can break the second concern into two parts. First, an old txn client will never send an EndTxn without having sent a successful BeginTxn, so the new behavior there is not an issue. Second, an old txn client may send a HeartbeatTxn concurrently with a BeginTxn. If the heartbeat wins, it will create a txn record on a new replica. If the BeginTxn evaluates on a new replica, it will be a no-op. If it evaluates on an old replica, it will result in a retryable error. The third concern is specific to the implementation of the heartbeat loop itself. If a HeartbeatTxn loses a race with an EndTxn, it may get an aborted error after checking the timestamp cache. This is desirable from the replica-side, but we'd need to ensure that the client will handle it correctly. This PR adds some protection (see `sendingEndTxn`) to the txn client to prevent this case from causing weirdness on the client, but I don't think this could actually cause issues even without this protection. The reason is that the txn client might mark itself as aborted due to the heartbeat, but this will be overwritten when the EndTxn returns and the sql client will still hear back from the successful EndTxn. This must have actually always been an issue because it was always possible for a committed txn record to be GCed and then written as aborted later, at which point a concurrent heartbeat could have observed it. I'd like Andrei to sign off on this last hazard, as he's thought about this kind of thing more than anybody. Release note: None

33396: storage: allow HeartbeatTxn and EndTxn requests to create txn records r=nvanbenschoten a=nvanbenschoten Informs #25437. Informs #32971. This is the first part of addressing #32971. Part two will update concurrent txns to not immediately abort missing txn records and part three will update the txn client to stop sending BeginTxn records. The change closely resembles what was laid out in the corresponding RFC sections. `HeartbeatTxn` and `EndTxn` are both updated to permit the creation of transaction records when they finds that one is missing. To prevent replays from causing issues, they *both* check the write timestamp cache when creating a transaction record. This is one area where the change diverges from the RFC. Instead of having `EndTxn` always check the write timestamp cache, it only does so if it is creating a txn record from scratch. To make this safe, `HeartbeatTxn` also checks the write timestamp cache if it is creating a txn record from scratch. This prevents a long-running transaction from increasingly running the risk of being aborted by a lease transfer as its `EndTxn` continues to be delayed. Instead, a lease transfer will only abort a transaction if it comes before the transaction record creation, which should be within a second of the transaction starting. The change pulls out a new `CanCreateTxnRecord` method, which has the potential of being useful for detecting eagerly GCed transaction records and solving the major problem from the RFC without an extra QueryIntent. This is what Andrei was pushing for before. More details are included in TODOs within the PR. ### Migration Safety Most of the changes to the transaction state machine are straightforward to validate. Transaction records can still never be created without checking for replays and only an EndTxn from the client's sync path can GC an ABORTED txn record. This means that it is still impossible for a transaction to move between the two finalized statuses (at some point it would be worth it to draw this state machine). The one tricky part is ensuring that the changes in this PR are safe when run in mixed-version clusters. The two areas of concern are: 1. lease transfers between a new and an old cluster on a range that should/does contain a transaction record. 2. an old txn client that is not expecting HeartbeatTxn and EndTxn requests to create txn records if they are found to be missing. 3. an old txn client may not expect a HeartbeatTxn to hit the write timestamp cache if run concurrently with an EndTxn. The first concern is protected by the write timestamp cache. Regardless of which replica creates that transaction record from a BeginTxn req or a HeartbeatTxn req (on the new replica), it will first need to check the timestamp cache. This prevents against any kind of replay that could create a transaction record that the old replica is not prepared to handle. We can break the second concern into two parts. First, an old txn client will never send an EndTxn without having sent a successful BeginTxn, so the new behavior there is not an issue. Second, an old txn client may send a HeartbeatTxn concurrently with a BeginTxn. If the heartbeat wins, it will create a txn record on a new replica. If the BeginTxn evaluates on a new replica, it will be a no-op. If it evaluates on an old replica, it will result in a retryable error. The third concern is specific to the implementation of the heartbeat loop itself. If a HeartbeatTxn loses a race with an EndTxn, it may get an aborted error after checking the timestamp cache. This is desirable from the replica-side, but we'd need to ensure that the client will handle it correctly. This PR adds some protection (see `sendingEndTxn`) to the txn client to prevent this case from causing weirdness on the client, but I don't think this could actually cause issues even without this protection. The reason is that the txn client might mark itself as aborted due to the heartbeat, but this will be overwritten when the EndTxn returns and the sql client will still hear back from the successful EndTxn. This must have actually always been an issue because it was always possible for a committed txn record to be GCed and then written as aborted later, at which point a concurrent heartbeat could have observed it. I'd like Andrei to sign off on this last hazard, as he's thought about this kind of thing more than anybody. Co-authored-by: Nathan VanBenschoten <[email protected]>

Informs cockroachdb#25437. Informs cockroachdb#32971. This is the second part of addressing cockroachdb#32971. The final part will be updating the txn client to stop sending BeginTxn requests. The change closely resembles what is laid out in the corresponding RFC sections (`PushTxn` and `QueryTxn / txnwait.Queue`). Both `PushTxn` requests and `QueryTxn` requests are adjusted to tolerate missing transaction records and to synthesize them based on the information pulled from intents if necessary. By hiding these details behind the request API abstraction, we don't actually need to modify the `txnwait.Queue` at all! The change does diverge from the RFC in two distinct ways. The first is that it introduces a new invariant about transaction record creation. Transaction can now only ever be created by requests originating from their own coordinator (`BeginTxn`, `HeartbeatTxn`, and `EndTxn`). They can never be created in any state (not even ABORTED) by concurrent actors. This is actually a much stronger invariant than what previously existed (and even what was proposed in the RFC), and it simplifies a lot of logic by dramatically refining the transaction state machine. We now know that for a transaction record to exist, it must have been created by its own coordinator and it must have checked with `CanCreateTxnRecord`. Making this new invariant work required the second major divergence from the RFC. The RFC suggested that we use the read timestamp cache for successful transaction timestamp pushes before a transaction record was written. This PR takes this a step further and uses the write timestamp cache for successful transaction aborts before a transaction record was written. In doing this, we emerge with a very sane timestamp cache policy - the read timestamp cache is used to push transaction timestamps and the write timestamp cache is used to abort them entirely (which it was already essentially being used for). The txnID marker on these timestamp cache entries then becomes the transaction that initiated the push/abort. Transactions then consult both of these sources before creating their transaction record in `CanCreateTxnRecord`. In doing this, we avoid an entire class of situations where concurrent transactions created and abandoned transaction records for another transaction, requiring eventual GC. Release note: None

Based on cockroachdb#33523. Closes cockroachdb#25437. This is the final part of addressing cockroachdb#32971. The change gates the use of BeginTransaction on a cluster version. When a cluster's version is sufficiently high, it will stop sending them. To make this work, a small change was needed for the detection of 1PC transactions. We now use sequence numbers to determine that a batch includes all of the writes in a transaction. This is in line with the proposal from the RFC's `1PC Transactions` section and will work correctly with parallel commits. Release note (performance improvement): Reduce the network and storage overhead of multi-Range transactions.

Informs cockroachdb#25437. Informs cockroachdb#32971. This is the second part of addressing cockroachdb#32971. The final part will be updating the txn client to stop sending BeginTxn requests. The change closely resembles what is laid out in the corresponding RFC sections (`PushTxn` and `QueryTxn / txnwait.Queue`). Both `PushTxn` requests and `QueryTxn` requests are adjusted to tolerate missing transaction records and to synthesize them based on the information pulled from intents if necessary. By hiding these details behind the request API abstraction, we don't actually need to modify the `txnwait.Queue` at all! The change does diverge from the RFC in two distinct ways. The first is that it introduces a new invariant about transaction record creation. Transaction can now only ever be created by requests originating from their own coordinator (`BeginTxn`, `HeartbeatTxn`, and `EndTxn`). They can never be created in any state (not even ABORTED) by concurrent actors. This is actually a much stronger invariant than what previously existed (and even what was proposed in the RFC), and it simplifies a lot of logic by dramatically refining the transaction state machine. We now know that for a transaction record to exist, it must have been created by its own coordinator and it must have checked with `CanCreateTxnRecord`. Making this new invariant work required the second major divergence from the RFC. The RFC suggested that we use the read timestamp cache for successful transaction timestamp pushes before a transaction record was written. This PR takes this a step further and uses the write timestamp cache for successful transaction aborts before a transaction record was written. In doing this, we emerge with a very sane timestamp cache policy - the read timestamp cache is used to push transaction timestamps and the write timestamp cache is used to abort them entirely (which it was already essentially being used for). The txnID marker on these timestamp cache entries then becomes the transaction that initiated the push/abort. Transactions then consult both of these sources before creating their transaction record in `CanCreateTxnRecord`. In doing this, we avoid an entire class of situations where concurrent transactions created and abandoned transaction records for another transaction, requiring eventual GC. Release note: None

33523: storage: tolerate missing transaction records when pushing r=nvanbenschoten a=nvanbenschoten Informs #25437. Informs #32971. This is the second part of addressing #32971. The final part will be updating the txn client to stop sending BeginTxn requests. The change closely resembles what is laid out in the corresponding RFC sections (`PushTxn` and `QueryTxn / txnwait.Queue`). Both `PushTxn` requests and `QueryTxn` requests are adjusted to tolerate missing transaction records and to synthesize them based on the information pulled from intents if necessary. By hiding these details behind the request API abstraction, we don't actually need to modify the `txnwait.Queue` at all! The change does diverge from the RFC in two distinct ways. The first is that it introduces a new invariant about transaction record creation. Transaction records can now only ever be created by requests originating from their own coordinator (`BeginTxn`, `HeartbeatTxn`, and `EndTxn`). They can never be created in any state (not even ABORTED) by concurrent actors. This is a much stronger invariant than what previously existed (and even what was proposed in the RFC), and it simplifies a lot of logic by dramatically refining the transaction state machine. We now know that for a transaction record to exist, it must have been created by its own coordinator and it must have checked with `CanCreateTxnRecord`. Making this new invariant work required the second major divergence from the RFC. The RFC suggested that we use the read timestamp cache for successful transaction timestamp pushes before a transaction record is written. This PR takes this a step further and uses the write timestamp cache for successful transaction aborts before a transaction record is written. In doing this, we emerge with a very sane timestamp cache policy - the read timestamp cache is used to push transaction timestamps and the write timestamp cache is used to abort them entirely (which it was already essentially being used for). The txnID marker on these timestamp cache entries becomes the transaction that initiated the push/abort. Transactions then consult both of these sources before creating their transaction record in `CanCreateTxnRecord`. In doing this, we avoid an entire class of situations where concurrent transactions created and abandoned transaction records for another transaction, requiring eventual GC. Co-authored-by: Nathan VanBenschoten <[email protected]>

Based on cockroachdb#33523. Closes cockroachdb#25437. This is the final part of addressing cockroachdb#32971. The change gates the use of BeginTransaction on a cluster version. When a cluster's version is sufficiently high, it will stop sending them. To make this work, a small change was needed for the detection of 1PC transactions. We now use sequence numbers to determine that a batch includes all of the writes in a transaction. This is in line with the proposal from the RFC's `1PC Transactions` section and will work correctly with parallel commits. Release note (performance improvement): Reduce the network and storage overhead of multi-Range transactions.

33566: kv: stop using BeginTransaction requests r=nvanbenschoten a=nvanbenschoten Based on #33523. Closes #25437. This is the final part of addressing #32971. The change gates the use of BeginTransaction on a cluster version. When a cluster's version is sufficiently high, it will stop sending them. To make this work, a small change was needed for the detection of 1PC transactions. We now use sequence numbers to determine that a batch includes all of the writes in a transaction. This is in line with the proposal from the RFC's `1PC Transactions` section and will work correctly with parallel commits. Release note (performance improvement): Reduce the network and storage overhead of multi-Range transactions. Co-authored-by: Nathan VanBenschoten <[email protected]>

This commit completes the migration started in cockroachdb#32971 (a year ago to the week). It removes all references to `BeginTransactionRequest` and deletes the request type entirely. This reduces the number of request types that can create transaction records down to just two: `EndTransactionRequest` and `HeartbeatTxnRequest` (see the state transition diagram in `replica_tscache.go`). It is safe to remove server-side handling of `BeginTransactionRequest` because we never use it in v19.2 binaries, so v20.1 binaries will never run in a cluster that is using the request type. Once this commit is merged, I'm going to `s/EndTransaction/EndTxn/g` to bring that request name in line with all other transaction metadata requests (e.g. `HeartbeatTxnRequest`, `PushTxnRequest`, `QueryTxnRequest`, and `RecoverTxnRequest`). Release note: None

42933: roachpb/storage: complete removal of BeginTransactionRequest r=nvanbenschoten a=nvanbenschoten This commit completes the migration started in #32971 (a year ago to the week). It removes all references to `BeginTransactionRequest` and deletes the request type entirely. This reduces the number of request types that can create transaction records down to just two: `EndTransactionRequest` and `HeartbeatTxnRequest` (see the state transition diagram in `replica_tscache.go`). It is safe to remove server-side handling of `BeginTransactionRequest` because we never use it in v19.2 binaries, so v20.1 binaries will never run in a cluster that is using the request type. Once this commit is merged, I'm going to `s/EndTransaction/EndTxn/g` to bring that request name in line with all other transaction metadata requests (e.g. `HeartbeatTxnRequest`, `PushTxnRequest`, `QueryTxnRequest`, and `RecoverTxnRequest`). Co-authored-by: Nathan VanBenschoten <[email protected]>

nvanbenschoten requested review from bdarnell, andreimatei and tbg December 10, 2018 08:57

nvanbenschoten requested a review from a team as a code owner December 10, 2018 08:57

tbg approved these changes Dec 10, 2018

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/lazyTxnRecordRFC branch from ddea58d to 9d104cb Compare December 10, 2018 19:00

nvanbenschoten commented Dec 10, 2018

View reviewed changes

spencerkimball reviewed Dec 11, 2018

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/lazyTxnRecordRFC branch from 9d104cb to 52e1aa4 Compare December 11, 2018 01:23

nvanbenschoten commented Dec 11, 2018

View reviewed changes

andreimatei reviewed Dec 11, 2018

View reviewed changes

bdarnell reviewed Dec 12, 2018

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/lazyTxnRecordRFC branch from 52e1aa4 to dd6cbeb Compare December 13, 2018 00:59

nvanbenschoten commented Dec 13, 2018

View reviewed changes

andreimatei previously requested changes Dec 13, 2018

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/lazyTxnRecordRFC branch from dd6cbeb to c007c99 Compare December 19, 2018 01:14

nvanbenschoten commented Dec 19, 2018

View reviewed changes

nvanbenschoten mentioned this pull request Dec 19, 2018

roachpb: create new TransactionRecord subset proto #33257

Merged

bdarnell approved these changes Dec 19, 2018

View reviewed changes

nvanbenschoten mentioned this pull request Jan 8, 2019

kv: stop using BeginTransaction requests #33566

Merged

nvanbenschoten mentioned this pull request Dec 4, 2019

roachpb/storage: complete removal of BeginTransactionRequest #42933

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Lazy Transaction Record Creation #32971

RFC: Lazy Transaction Record Creation #32971

nvanbenschoten commented Dec 10, 2018

cockroach-teamcity commented Dec 10, 2018

tbg left a comment

nvanbenschoten left a comment

nvanbenschoten commented Dec 11, 2018

spencerkimball Dec 11, 2018

spencerkimball Dec 11, 2018

spencerkimball Dec 11, 2018

spencerkimball Dec 11, 2018

nvanbenschoten left a comment

spencerkimball commented Dec 11, 2018 via email

andreimatei left a comment

andreimatei Dec 11, 2018

andreimatei Dec 11, 2018

andreimatei Dec 11, 2018

andreimatei Dec 11, 2018

andreimatei Dec 11, 2018

andreimatei Dec 11, 2018

andreimatei Dec 11, 2018

andreimatei Dec 11, 2018

andreimatei Dec 11, 2018

andreimatei Dec 11, 2018

bdarnell left a comment

nvanbenschoten left a comment

andreimatei left a comment

nvanbenschoten left a comment

bdarnell left a comment

nvanbenschoten commented Dec 19, 2018

craig bot commented Dec 19, 2018


		This detection is moderately more difficult to make efficient without `BeginTransaction`. The straightforward but wrong approach would be to compare the key bounds of each of the `IntentSpans` in the batch's `EndTransaction` with the Range it is evaluating on. If all `IntentSpans` are in the Range's bounds, the batch can be evaluated as a 1PC transaction. This gets tripped up when the txn has written before but only to the range that its record is on.

		A workable approach would be to only increment the transaction's sequence number on writes and to check whether the batch contains all of the sequence numbers up to the exclusive upper bound sequence on the `EndTransaction` request. This simplifies to checking the first write in the batch to see if it has sequence number 1, but that simplification will break down under parallel commits, so it will not be pursued.


		### QueryTxn / txnwait.Queue

		The `txnwait.Queue` queries both the pusher and the pushee transaction using `QueryTxn` requests while waiting for changes to either. While doing so, the `txnwait.Queue` has access to `TxnMeta` objects for both of these transactions. Either the `QueryTxn` request or the `txnwait.Queue` itself will need to be made aware that transaction records are now creates lazily.


		### Properties Preserved

		1. Only the transaction coordinator that owns a transaction can create its transaction record in the `PENDING` state or create/move its transaction record in/to the `COMMITTED` state.


		Lazily creating transaction records adds a complication here. Now that we're no longer writing a transaction record as early as possible [1] in the lifetime of a transaction, its very easy for a concurrent transaction to observe an intent of an ongoing transaction, push its record, and find nothing there. In fact, we expect this to be the case for any push in the first `min(txn duration, DefaultHeartbeatInterval)` of a transaction's lifetime. Currently, a push that doesn't find a record writes it as `ABORTED` and moves on. That isn't workable if we want to increase the amount of time without a transaction record.

		We get around this by using the existence of an intent at a given timestamp as further evidence of client-instructed activity. Instead of just considering the transaction record's original and heartbeat timestamp to determine if it is expired, we also take into consideration the intent's timestamp. We know that the client had to have been alive at the timestamp that the intent was written to issue the write [2]. This means that it is now possible for a transaction to wait on another transaction without the pushee's transaction record existing yet. This works as expected - either the transaction record is eventually written by the transactions first heartbeat and the concurrent transaction continues waiting or the transaction record is never written and the "activity" attributed to the transaction due to the intent becomes stale enough that the transaction can be considered expired.


		Now that we no longer have `BeginTransaction` requests, things are a little different. However, the same approach can be used. We can update the write timestamp cache on `EndTransaction` requests and consult it on committing `EndTransaction` requests. This will prevent more than a single `EndTransaction` request from executing successfully.

		However, we should be able to do more. We can use MVCC to enforce this replay protection for us. If the EndTransaction request is replayed alone then it is harmless to let it write a new `COMMITTED` transaction record. If any writes are replayed after a successful commit then they will necessarily hit WriteTooOld errors and will be pushed to a new timestamp. If those writes are in the same batch as the EndTransaction then they will prevent it from committing. If those writes are in different batches from the EndTransaction then they will not be resolvable.


		#### Replica-Side WriteTooOld Avoidance

		Unfortunately, this currently breaks down because we have an optimization that avoids blocking transaction commit on WriteTooOld errors in two places. The first is during [1PC transaction evaluation](https://github.com/cockroachdb/cockroach/blob/8f30db0f07e940667bc34314ec6a446491a29790/pkg/storage/replica.go#L6036) and the second is during [EndTransaction evaluation](https://github.com/cockroachdb/cockroach/blob/8f30db0f07e940667bc34314ec6a446491a29790/pkg/storage/batcheval/cmd_end_transaction.go#L401). These cases allow a transaction to commit through a WriteTooOld error even without client-side approval. This breaks our ability to use MVCC to provide replay protection. It's unclear how important this optimization is as it's only applicable to heavily contended transactions with no refresh spans (i.e. blind writes). Blind writes from SQL are not very common and the optimization only saves a single round-trip to the client, who will never need an actual transaction retry because it can be trivially refreshed.


		This is a rare operation, so there is little concern of it adding strain to the read timestamp cache. Frequent page rotations in the cache will have the exact same effect as they do on any other transactional writes, so that is not a concern either.

		#### Transaction Record GC

RFC: Lazy Transaction Record Creation #32971

RFC: Lazy Transaction Record Creation #32971

Conversation

nvanbenschoten commented Dec 10, 2018

cockroach-teamcity commented Dec 10, 2018

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Dec 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

spencerkimball commented Dec 11, 2018 via email

andreimatei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Dec 19, 2018

craig bot commented Dec 19, 2018

Build succeeded