Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use full txn struct instead of just txn ID #55

Merged
merged 3 commits into from
Sep 14, 2014

Conversation

spencerkimball
Copy link
Member

Move from dealing with just transaction ID to using the full transaction
struct in write intents and in request / response headers.

Cleaned up some code in MVCC that was unnecessarily complex and introduced
concept of transaction epochs for Put() and for ResolveIntent(). Transaction
epochs start at 0 and are incremented every time a transaction is retried.
Transactions are retried (instead of aborted) in the event that a value is
encountered within the "window of uncertainty" between RequestHeader.Timestamp
and RequestHeader.MaxTimestamp, or in the case of an SSI transaction, if
the final transaction timestamp is not equal to the initial transaction
timestamp.

Typically, when a transaction has to restart, it covers all of the same
keys on writes and increments the intents' epochs accordingly. However,
if there's conditional logic in the transaction, it might end up skipping
some keys which were written the previous run--those earlier intents should
then not be committed when the intents are resolved after txn commit.

@spencerkimball
Copy link
Member Author

@bdarnell @Tobstar87

@spencerkimball spencerkimball force-pushed the spencerkimball/store-txn branch from 56d4350 to dfe6f93 Compare September 12, 2014 14:14
// transaction should always be sent with subsequent requests which
// are part of the same transaction. If multiple requests are sent
// in parallel as part of the same transaction, then subsequent
// specification of Txn should use the maximum of all returned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me nervous to make the Transaction non-opaque to clients; this could be an obstacle to future changes to the struct. What if we make EndTransactionRequest take a list of Transactions, and require that each ResponseHeader.Txn must be passed back to the server (either in some RequestHeader that trades it in for a new Transaction in the response, or in the final EndTransactionRequest). We could retain the "use the maximum timestamp" rule as a heuristic when applications have multiple Transactions to choose from, but it would no longer be required for correctness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha...I started work on EndTransaction and came to the conclusion that the Transaction object returned in the ResponseHeader shouldn't change / be supplied at all except for as a response to the first request in the transaction (specifying an empty, non-nil Transaction to start the transaction).

Instead of having the transaction timestamp being munged, we instead always rely on the client to pass up the maximum timestamp seen so far via RequestHeader.Timestamp from all ResponseHeader.Timestamps. This is far more reasonable and allows us to have a non-mutable transaction object for the length of a single run of a transaction.

I changed this PR to include a BeginTransaction API call. Here are the pros:

  • Eliminates confusion in the API as to when a txn should be supplied, whether it should be empty, whether or not isolation or priority should be included even if TxnID is missing, etc. It was getting much too confusing.
    • BeginTransaction is satisfied immediately by the receiving node--despite having a key which it's targeted at (this is how we build up the locality-aware txn ID), it's not passed to the range leader; all we need to return a valid Txn struct is a clock synchronized with the cluster. This means that BeginTransaction will be as fast as a round trip to the client's gateway node.
    • Building on the previous point, many transactions will want to fan out parallel writes. In order to do that, you need a txn structure. Previously, we would have had to do a "real" request of some sort which would have involved the latency incurred from either route-to-leader for read or to achieve raft consensus for write.

Cons:

  • Requires a round trip to gateway node even if txn could have been done as a sequential set of commands where the first could have usefully created the Txn struct as well.

Feels like making the API more explicit weighed most heavily for me. The cost seems minimal given we only hit gateway node.

@bdarnell
Copy link
Contributor

LGTM

@spencerkimball spencerkimball force-pushed the spencerkimball/store-txn branch 2 times, most recently from 97b3979 to 44dc8cc Compare September 13, 2014 00:03
@spencerkimball
Copy link
Member Author

@bdarnell: might want to take another gander at this one. It's changed a lot.

// We can update the current metadata as long as the timestamp is
// greater than or equal and if there's a transaction underway,
// the current epoch must be greater.
if !timestamp.Less(meta.Timestamp) && (meta.Txn == nil || txn.Epoch >= meta.Txn.Epoch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to compare the txnID for the second condition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because it's already checked by the if block immediately preceding it. At this if statement, we're already guaranteed that if there's an existing intent, the txn IDs match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yes, i missed that part. sgtm

@jiangmingyang
Copy link
Contributor

LGTM with two comments.

@spencerkimball
Copy link
Member Author

@jiangmingyang @bdarnell I added another commit to the PR addressing some additional items. Please take another look.

@jiangmingyang
Copy link
Contributor

lgtm for the new changes.

reply.Rows = kvs
reply.SetGoError(err)
}

// BeginTransaction returns an error, as it's handled trivially at
// any cockroach node and doesn't need to be addressed to a range.
func (r *Range) BeginTransaction(args *proto.BeginTransactionRequest, reply *proto.BeginTransactionResponse) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why even have it here then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. Will remove.

@spencerkimball spencerkimball force-pushed the spencerkimball/store-txn branch 2 times, most recently from c843fe1 to 954d76a Compare September 14, 2014 14:05
Spencer Kimball added 3 commits September 14, 2014 10:50
struct in write intents and in request / response headers.

Cleaned up some code in MVCC that was unnecessarily complex and introduced
concept of transaction epochs for Put() and for ResolveIntent(). Transaction
epochs start at 0 and are incremented every time a transaction is retried.
Transactions are retried (instead of aborted) in the event that a value is
encountered within the "window of uncertainty" between RequestHeader.Timestamp
and RequestHeader.MaxTimestamp, or in the case of an SSI transaction, if
the final transaction timestamp is not equal to the initial transaction
timestamp.

Typically, when a transaction has to restart, it covers all of the same
keys on writes and increments the intents' epochs accordingly. However,
if there's conditional logic in the transaction, it might end up skipping
some keys which were written the previous run--those earlier intents should
then not be committed when the intents are resolved after txn commit.
…cation

of Transaction and Timestamp. In particular, the Transaction object is now
not supplied in the Response except for in the case of BeginTransaction
and HeartbeatTransaction requests.

Clarified the comments for ResponseHeader to indicate that the
request Timestamp must be continually upgraded to be the maximum
of all response Timestamps.
 - Use write batch for all multi-put ops in MVCC.

 - Delete previous intent versions when overwriting an intent with
   a newer timestamp within same transaction. This can happen on
   transaction retries.

 - When resolving an intent, make sure to properly update the intent
   value's timestamp if the commit timestamp is different from the
   original timestamp the value was written at.
@spencerkimball spencerkimball force-pushed the spencerkimball/store-txn branch from 954d76a to 5d5ec34 Compare September 14, 2014 14:51
spencerkimball added a commit that referenced this pull request Sep 14, 2014
Use full txn struct instead of just txn ID
@spencerkimball spencerkimball merged commit 5c66ec3 into master Sep 14, 2014
@spencerkimball spencerkimball deleted the spencerkimball/store-txn branch September 14, 2014 14:51
@spencerkimball spencerkimball restored the spencerkimball/store-txn branch September 14, 2014 14:51
@spencerkimball spencerkimball deleted the spencerkimball/store-txn branch September 14, 2014 15:25
@bdarnell
Copy link
Contributor

LGTM

tbg referenced this pull request in tbg/cockroach Aug 31, 2016
We have been seeing long startup times which disappear spontaneously. During a
restart of the beta cluster, the following goroutine was observed, which suggests
that we were spending a lot of time GCing replicas on startup.

    engine              ??:0                     _Cfunc_DBIterNext(cockroachdb#324, cockroachdb#323, 0, 0, 0, 0, 0, 0, 0)
    engine              rocksdb.go:1135          (*rocksDBIterator).Next(cockroachdb#235)
    storage             replica_data_iter.go:104 (*ReplicaDataIterator).Next(cockroachdb#316)
    storage             store.go:1748            (*Store).destroyReplicaData(#109, cockroachdb#317, 0, 0)
    storage             store.go:841             (*Store).Start.func2(0x101b, cockroachdb#300, 0x36, 0x40, cockroachdb#301, 0x36, 0x40, cockroachdb#315, 0x3, 0x4, ...)
    storage             store.go:734             IterateRangeDescriptors.func1(cockroachdb#306, 0x40, 0x41, cockroachdb#307, 0x92, 0x92, cockroachdb#341, 0, 0x186c, 0x4000, ...)
    engine              mvcc.go:1593             MVCCIterate(cockroachdb#329, #68, #47, #81, cockroachdb#232, 0x9, 0x10, cockroachdb#233, 0xb, 0x10, ...)
    storage             store.go:738             IterateRangeDescriptors(cockroachdb#330, cockroachdb#196, #47, #81, cockroachdb#195, #179, #110)
    storage             store.go:867             (*Store).Start(#109, cockroachdb#330, cockroachdb#196, #179, cockroachdb#185, 0x1)
    server              node.go:405              (*Node).initStores(#78, cockroachdb#330, cockroachdb#196, #98, 0x1, 0x1, #179, 0, #55)
    server              node.go:330              (*Node).start(#78, cockroachdb#330, cockroachdb#196, #42, #129, #98, 0x1, 0x1, 0, 0, ...)
    server              server.go:431            (*Server).Start(#5, cockroachdb#330, cockroachdb#196, #95, 0x1)
    cli                 start.go:368             runStart(#34, #178, 0, 0x9, 0, 0)
    cobra               command.go:599           (*Command).execute(#34, #177, 0x9, 0x9, #34, #177)
    cobra               command.go:689           (*Command).ExecuteC(#33, #70, cockroachdb#343, #72)
    cobra               command.go:648           (*Command).Execute(#33, #71, cockroachdb#343)
    cli                 cli.go:96                Run(#64, 0xa, 0xa, 0, 0)
    main                main.go:37               main()
tbg referenced this pull request in tbg/cockroach Aug 31, 2016
We have been seeing long startup times which disappear spontaneously. During a
restart of the beta cluster, the following goroutine was observed, which suggests
that we were spending a lot of time GCing replicas on startup.

    engine              ??:0                     _Cfunc_DBIterNext(cockroachdb#324, cockroachdb#323, 0, 0, 0, 0, 0, 0, 0)
    engine              rocksdb.go:1135          (*rocksDBIterator).Next(cockroachdb#235)
    storage             replica_data_iter.go:104 (*ReplicaDataIterator).Next(cockroachdb#316)
    storage             store.go:1748            (*Store).destroyReplicaData(#109, cockroachdb#317, 0, 0)
    storage             store.go:841             (*Store).Start.func2(0x101b, cockroachdb#300, 0x36, 0x40, cockroachdb#301, 0x36, 0x40, cockroachdb#315, 0x3, 0x4, ...)
    storage             store.go:734             IterateRangeDescriptors.func1(cockroachdb#306, 0x40, 0x41, cockroachdb#307, 0x92, 0x92, cockroachdb#341, 0, 0x186c, 0x4000, ...)
    engine              mvcc.go:1593             MVCCIterate(cockroachdb#329, #68, #47, #81, cockroachdb#232, 0x9, 0x10, cockroachdb#233, 0xb, 0x10, ...)
    storage             store.go:738             IterateRangeDescriptors(cockroachdb#330, cockroachdb#196, #47, #81, cockroachdb#195, #179, #110)
    storage             store.go:867             (*Store).Start(#109, cockroachdb#330, cockroachdb#196, #179, cockroachdb#185, 0x1)
    server              node.go:405              (*Node).initStores(#78, cockroachdb#330, cockroachdb#196, #98, 0x1, 0x1, #179, 0, #55)
    server              node.go:330              (*Node).start(#78, cockroachdb#330, cockroachdb#196, #42, #129, #98, 0x1, 0x1, 0, 0, ...)
    server              server.go:431            (*Server).Start(#5, cockroachdb#330, cockroachdb#196, #95, 0x1)
    cli                 start.go:368             runStart(#34, #178, 0, 0x9, 0, 0)
    cobra               command.go:599           (*Command).execute(#34, #177, 0x9, 0x9, #34, #177)
    cobra               command.go:689           (*Command).ExecuteC(#33, #70, cockroachdb#343, #72)
    cobra               command.go:648           (*Command).Execute(#33, #71, cockroachdb#343)
    cli                 cli.go:96                Run(#64, 0xa, 0xa, 0, 0)
    main                main.go:37               main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants