-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use full txn struct instead of just txn ID #55
Conversation
56d4350
to
dfe6f93
Compare
// transaction should always be sent with subsequent requests which | ||
// are part of the same transaction. If multiple requests are sent | ||
// in parallel as part of the same transaction, then subsequent | ||
// specification of Txn should use the maximum of all returned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes me nervous to make the Transaction non-opaque to clients; this could be an obstacle to future changes to the struct. What if we make EndTransactionRequest take a list of Transactions, and require that each ResponseHeader.Txn must be passed back to the server (either in some RequestHeader that trades it in for a new Transaction in the response, or in the final EndTransactionRequest). We could retain the "use the maximum timestamp" rule as a heuristic when applications have multiple Transactions to choose from, but it would no longer be required for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha...I started work on EndTransaction and came to the conclusion that the Transaction object returned in the ResponseHeader shouldn't change / be supplied at all except for as a response to the first request in the transaction (specifying an empty, non-nil Transaction to start the transaction).
Instead of having the transaction timestamp being munged, we instead always rely on the client to pass up the maximum timestamp seen so far via RequestHeader.Timestamp from all ResponseHeader.Timestamps. This is far more reasonable and allows us to have a non-mutable transaction object for the length of a single run of a transaction.
I changed this PR to include a BeginTransaction API call. Here are the pros:
- Eliminates confusion in the API as to when a txn should be supplied, whether it should be empty, whether or not isolation or priority should be included even if TxnID is missing, etc. It was getting much too confusing.
- BeginTransaction is satisfied immediately by the receiving node--despite having a key which it's targeted at (this is how we build up the locality-aware txn ID), it's not passed to the range leader; all we need to return a valid Txn struct is a clock synchronized with the cluster. This means that BeginTransaction will be as fast as a round trip to the client's gateway node.
- Building on the previous point, many transactions will want to fan out parallel writes. In order to do that, you need a txn structure. Previously, we would have had to do a "real" request of some sort which would have involved the latency incurred from either route-to-leader for read or to achieve raft consensus for write.
Cons:
- Requires a round trip to gateway node even if txn could have been done as a sequential set of commands where the first could have usefully created the Txn struct as well.
Feels like making the API more explicit weighed most heavily for me. The cost seems minimal given we only hit gateway node.
LGTM |
97b3979
to
44dc8cc
Compare
@bdarnell: might want to take another gander at this one. It's changed a lot. |
// We can update the current metadata as long as the timestamp is | ||
// greater than or equal and if there's a transaction underway, | ||
// the current epoch must be greater. | ||
if !timestamp.Less(meta.Timestamp) && (meta.Txn == nil || txn.Epoch >= meta.Txn.Epoch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we need to compare the txnID for the second condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because it's already checked by the if block immediately preceding it. At this if statement, we're already guaranteed that if there's an existing intent, the txn IDs match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yes, i missed that part. sgtm
LGTM with two comments. |
@jiangmingyang @bdarnell I added another commit to the PR addressing some additional items. Please take another look. |
lgtm for the new changes. |
reply.Rows = kvs | ||
reply.SetGoError(err) | ||
} | ||
|
||
// BeginTransaction returns an error, as it's handled trivially at | ||
// any cockroach node and doesn't need to be addressed to a range. | ||
func (r *Range) BeginTransaction(args *proto.BeginTransactionRequest, reply *proto.BeginTransactionResponse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why even have it here then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. Will remove.
c843fe1
to
954d76a
Compare
struct in write intents and in request / response headers. Cleaned up some code in MVCC that was unnecessarily complex and introduced concept of transaction epochs for Put() and for ResolveIntent(). Transaction epochs start at 0 and are incremented every time a transaction is retried. Transactions are retried (instead of aborted) in the event that a value is encountered within the "window of uncertainty" between RequestHeader.Timestamp and RequestHeader.MaxTimestamp, or in the case of an SSI transaction, if the final transaction timestamp is not equal to the initial transaction timestamp. Typically, when a transaction has to restart, it covers all of the same keys on writes and increments the intents' epochs accordingly. However, if there's conditional logic in the transaction, it might end up skipping some keys which were written the previous run--those earlier intents should then not be committed when the intents are resolved after txn commit.
…cation of Transaction and Timestamp. In particular, the Transaction object is now not supplied in the Response except for in the case of BeginTransaction and HeartbeatTransaction requests. Clarified the comments for ResponseHeader to indicate that the request Timestamp must be continually upgraded to be the maximum of all response Timestamps.
- Use write batch for all multi-put ops in MVCC. - Delete previous intent versions when overwriting an intent with a newer timestamp within same transaction. This can happen on transaction retries. - When resolving an intent, make sure to properly update the intent value's timestamp if the commit timestamp is different from the original timestamp the value was written at.
954d76a
to
5d5ec34
Compare
Use full txn struct instead of just txn ID
LGTM |
We have been seeing long startup times which disappear spontaneously. During a restart of the beta cluster, the following goroutine was observed, which suggests that we were spending a lot of time GCing replicas on startup. engine ??:0 _Cfunc_DBIterNext(cockroachdb#324, cockroachdb#323, 0, 0, 0, 0, 0, 0, 0) engine rocksdb.go:1135 (*rocksDBIterator).Next(cockroachdb#235) storage replica_data_iter.go:104 (*ReplicaDataIterator).Next(cockroachdb#316) storage store.go:1748 (*Store).destroyReplicaData(#109, cockroachdb#317, 0, 0) storage store.go:841 (*Store).Start.func2(0x101b, cockroachdb#300, 0x36, 0x40, cockroachdb#301, 0x36, 0x40, cockroachdb#315, 0x3, 0x4, ...) storage store.go:734 IterateRangeDescriptors.func1(cockroachdb#306, 0x40, 0x41, cockroachdb#307, 0x92, 0x92, cockroachdb#341, 0, 0x186c, 0x4000, ...) engine mvcc.go:1593 MVCCIterate(cockroachdb#329, #68, #47, #81, cockroachdb#232, 0x9, 0x10, cockroachdb#233, 0xb, 0x10, ...) storage store.go:738 IterateRangeDescriptors(cockroachdb#330, cockroachdb#196, #47, #81, cockroachdb#195, #179, #110) storage store.go:867 (*Store).Start(#109, cockroachdb#330, cockroachdb#196, #179, cockroachdb#185, 0x1) server node.go:405 (*Node).initStores(#78, cockroachdb#330, cockroachdb#196, #98, 0x1, 0x1, #179, 0, #55) server node.go:330 (*Node).start(#78, cockroachdb#330, cockroachdb#196, #42, #129, #98, 0x1, 0x1, 0, 0, ...) server server.go:431 (*Server).Start(#5, cockroachdb#330, cockroachdb#196, #95, 0x1) cli start.go:368 runStart(#34, #178, 0, 0x9, 0, 0) cobra command.go:599 (*Command).execute(#34, #177, 0x9, 0x9, #34, #177) cobra command.go:689 (*Command).ExecuteC(#33, #70, cockroachdb#343, #72) cobra command.go:648 (*Command).Execute(#33, #71, cockroachdb#343) cli cli.go:96 Run(#64, 0xa, 0xa, 0, 0) main main.go:37 main()
We have been seeing long startup times which disappear spontaneously. During a restart of the beta cluster, the following goroutine was observed, which suggests that we were spending a lot of time GCing replicas on startup. engine ??:0 _Cfunc_DBIterNext(cockroachdb#324, cockroachdb#323, 0, 0, 0, 0, 0, 0, 0) engine rocksdb.go:1135 (*rocksDBIterator).Next(cockroachdb#235) storage replica_data_iter.go:104 (*ReplicaDataIterator).Next(cockroachdb#316) storage store.go:1748 (*Store).destroyReplicaData(#109, cockroachdb#317, 0, 0) storage store.go:841 (*Store).Start.func2(0x101b, cockroachdb#300, 0x36, 0x40, cockroachdb#301, 0x36, 0x40, cockroachdb#315, 0x3, 0x4, ...) storage store.go:734 IterateRangeDescriptors.func1(cockroachdb#306, 0x40, 0x41, cockroachdb#307, 0x92, 0x92, cockroachdb#341, 0, 0x186c, 0x4000, ...) engine mvcc.go:1593 MVCCIterate(cockroachdb#329, #68, #47, #81, cockroachdb#232, 0x9, 0x10, cockroachdb#233, 0xb, 0x10, ...) storage store.go:738 IterateRangeDescriptors(cockroachdb#330, cockroachdb#196, #47, #81, cockroachdb#195, #179, #110) storage store.go:867 (*Store).Start(#109, cockroachdb#330, cockroachdb#196, #179, cockroachdb#185, 0x1) server node.go:405 (*Node).initStores(#78, cockroachdb#330, cockroachdb#196, #98, 0x1, 0x1, #179, 0, #55) server node.go:330 (*Node).start(#78, cockroachdb#330, cockroachdb#196, #42, #129, #98, 0x1, 0x1, 0, 0, ...) server server.go:431 (*Server).Start(#5, cockroachdb#330, cockroachdb#196, #95, 0x1) cli start.go:368 runStart(#34, #178, 0, 0x9, 0, 0) cobra command.go:599 (*Command).execute(#34, #177, 0x9, 0x9, #34, #177) cobra command.go:689 (*Command).ExecuteC(#33, #70, cockroachdb#343, #72) cobra command.go:648 (*Command).Execute(#33, #71, cockroachdb#343) cli cli.go:96 Run(#64, 0xa, 0xa, 0, 0) main main.go:37 main()
Move from dealing with just transaction ID to using the full transaction
struct in write intents and in request / response headers.
Cleaned up some code in MVCC that was unnecessarily complex and introduced
concept of transaction epochs for Put() and for ResolveIntent(). Transaction
epochs start at 0 and are incremented every time a transaction is retried.
Transactions are retried (instead of aborted) in the event that a value is
encountered within the "window of uncertainty" between RequestHeader.Timestamp
and RequestHeader.MaxTimestamp, or in the case of an SSI transaction, if
the final transaction timestamp is not equal to the initial transaction
timestamp.
Typically, when a transaction has to restart, it covers all of the same
keys on writes and increments the intents' epochs accordingly. However,
if there's conditional logic in the transaction, it might end up skipping
some keys which were written the previous run--those earlier intents should
then not be committed when the intents are resolved after txn commit.