From 5c037e659d75679066589b39a8bac4299aac6f86 Mon Sep 17 00:00:00 2001 From: Nathan VanBenschoten Date: Wed, 25 Nov 2020 13:04:30 -0500 Subject: [PATCH] kv: return uncertainty error on intent in uncertainty interval MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This commit is a partial revert of #40600 and #46830. It solves the same problem that those PRs were solving, but in a different way. Those two PRs were handling the case where a reading transaction observes an intent in its uncertainty interval. Before those fixes, we were not considering intents in a scan's uncertainty interval to be uncertain. This had the potential to cause stale reads because an unresolved intent doesn't indicate that its transaction hasn’t been committed and is not a causal ancestor of the scan. The approach the first PR took was to throw a WriteIntentError on intents in a scan's uncertainty interval. Effectively, it made scans consider intents in their uncertainty interval to be write-read conflicts. This had the benefit that the scan would push the intent and eventually resolve the conflict, either by aborting the intent, pushing it out of the read's uncertainty interval, or waiting for it to commit. In this last case (which is by the far the most common), the scan would then throw an uncertainty error, because it now had a committed value in its uncertainty interval. The second PR introduced some awkward code to deal with the fallout from this decision. Once we started trying to keep the lock table in sync with the MVCC state, we needed to use latches to ensure proper synchronization between any operations that could conflict because such conflicts would influence the state of the lock table. This meant that we needed to start holding latches across a read's uncertainty interval, because intent writes in this interval were considered write-read conflicts. This led to some moderately gross code and always felt a little wrong. Now that we are complicating the logic around uncertainty intervals even further, this becomes even more of a burden. This motivates a reworking of these interactions. This commit replaces the logic that considers intents in a transaction's uncertainty interval to be write-read conflicts for logic that considers such intents to be... uncertain. Doing so means that a transaction will immediately refresh/restart above the uncertain timestamp and will only then begin conflicting with the intent. This has a number of nice benefits: 1. it keeps all considerations of read uncertainty intervals down in MVCC iteration logic. The lock table no longer needs to be aware of it. This is a big win now and an even bigger win once we introduce synthetic timestamps. 2. readers that are almost certainly bound to hit an uncertainty error and need to restart will now do so sooner. In rare cases, this may result in wasted work. In the vast majority of cases, this will allow the reader to be more responsive to the commit of its conflict. 3. uncertainty errors will only be thrown for locks in the uncertainty interval of a read that are protecting a provisional write (intents). Before, any form of a lock in a read's uncertainty interval would be considered a write-read conflict, which was pessimistic and not needed for correctness. In a future with a fully segregated lock table, the change in semantic meaning here becomes even more clear. Instead of detecting the lock associated with an intent in a read's uncertainty interval and declaring a write-read conflict, the read will instead pass through the lock table untouched and will detect the provisional value associated with an intent and declaring uncertainty. This seems closer to what were actually trying to say about these interactions. Before making this change, I intend to validate the hypothesis that it will not affect performance (or may even slightly improve performance) by running it on the YCSB benchmark suite. --- pkg/kv/kvserver/batcheval/declare.go | 43 ++--- .../concurrency/concurrency_manager.go | 1 - .../kvserver/concurrency/lock_table_waiter.go | 6 + .../acquire_wrong_txn_race | 167 ------------------ .../testdata/concurrency_manager/uncertainty | 79 +++++++-- pkg/storage/pebble_mvcc_scanner.go | 17 +- .../mvcc_histories/uncertainty_interval | 24 +-- 7 files changed, 108 insertions(+), 229 deletions(-) delete mode 100644 pkg/kv/kvserver/concurrency/testdata/concurrency_manager/acquire_wrong_txn_race diff --git a/pkg/kv/kvserver/batcheval/declare.go b/pkg/kv/kvserver/batcheval/declare.go index 13a52666f279..ab65b790e457 100644 --- a/pkg/kv/kvserver/batcheval/declare.go +++ b/pkg/kv/kvserver/batcheval/declare.go @@ -50,39 +50,20 @@ func DefaultDeclareIsolatedKeys( access = spanset.SpanReadOnly if header.Txn != nil { // For transactional reads, acquire read latches all the way up to - // the transaction's MaxTimestamp, because reads may observe locks + // the transaction's MaxTimestamp, because reads may observe writes // all the way up to this timestamp. // - // TODO(nvanbenschoten): this parallels similar logic in - // concurrency.Request.readConflictTimestamp, which indicates that - // there is almost certainly a better way to structure this. There - // are actually two issues here that lead to this duplication: - // - // 1. latch spans and lock spans are declared separately. While these - // concepts are not identical, it appears that lock spans are always - // a subset of latch spans, which means that we can probably unify - // the concepts more closely than we have thus far. This would - // probably also have positive performance implications, as the - // duplication mandates extra memory allocations. - // - // 2. latch spans can each be assigned unique MVCC timestamps but lock - // spans inherit the timestamp of their request's transaction (see - // lockTable and concurrency.Request.{read,write}ConflictTimestamp). - // This difference is strange and confusing. It's not clear that the - // generality of latches each being declared at their own timestamp - // is useful. There may be an emergent pattern that arises here when - // we unify latch and lock spans (see part 1) where latches that are - // in the lock span subset can inherit their request's transaction's - // timestamp and latches that are not are non-MVCC latches. - // - // Note that addressing these issues does not necessarily need to - // lead to the timestamp that MVCC spans are interpretted at being - // the same for the purposes of the latch manager and lock-table. - // For instance, once the lock-table is segregated and all logic - // relating to "lock discovery" is removed, we no longer need to - // acquire read latches up to a txn's max timestamp, just to its - // read timestamp. However, we will still need to search the - // lock-table up to a txn's max timestamp. + // It is critical that reads declare latches up through their + // uncertainty interval so that they are properly synchronized with + // earlier writes that may have a happened-before relationship with + // the read. These writes could not have completed and returned to + // the client until they were durable in the Range's Raft log. + // However, they may not have been applied to the replica's state + // machine by the time the write was acknowledged, because Raft + // entry application occur be asynchronously with respect to the + // writer (see AckCommittedEntriesBeforeApplication). Latching is + // the only mechanism that ensures that any observers of the write + // wait for the write apply before reading. timestamp.Forward(header.Txn.MaxTimestamp) } } diff --git a/pkg/kv/kvserver/concurrency/concurrency_manager.go b/pkg/kv/kvserver/concurrency/concurrency_manager.go index 2e5e53a8d5d4..8e61972abdf0 100644 --- a/pkg/kv/kvserver/concurrency/concurrency_manager.go +++ b/pkg/kv/kvserver/concurrency/concurrency_manager.go @@ -419,7 +419,6 @@ func (r *Request) readConflictTimestamp() hlc.Timestamp { ts := r.Timestamp if r.Txn != nil { ts = r.Txn.ReadTimestamp - ts.Forward(r.Txn.MaxTimestamp) } return ts } diff --git a/pkg/kv/kvserver/concurrency/lock_table_waiter.go b/pkg/kv/kvserver/concurrency/lock_table_waiter.go index c0758b87bfb7..553b4d2b6eb0 100644 --- a/pkg/kv/kvserver/concurrency/lock_table_waiter.go +++ b/pkg/kv/kvserver/concurrency/lock_table_waiter.go @@ -613,6 +613,12 @@ func (w *lockTableWaiterImpl) pushHeader(req Request) roachpb.Header { // could race). Since the subsequent execution of the original request // might mutate the transaction, make a copy here. See #9130. h.Txn = req.Txn.Clone() + // We must push at least to req.readConflictTimestamp(), but for + // transactional requests we actually want to go all the way up to the + // top of the transaction's uncertainty interval. This allows us to not + // have to restart for uncertainty if the push succeeds and we come back + // and read. + h.Timestamp.Forward(req.Txn.MaxTimestamp) } return h } diff --git a/pkg/kv/kvserver/concurrency/testdata/concurrency_manager/acquire_wrong_txn_race b/pkg/kv/kvserver/concurrency/testdata/concurrency_manager/acquire_wrong_txn_race deleted file mode 100644 index a761a63c0545..000000000000 --- a/pkg/kv/kvserver/concurrency/testdata/concurrency_manager/acquire_wrong_txn_race +++ /dev/null @@ -1,167 +0,0 @@ -# ------------------------------------------------------------- -# The following sequence of events used to trigger an assertion -# in lock-table that a lock was acquired by a different -# transaction than the current lock holder. This was due to an -# interleaving of concurrent requests holding compatible latches -# that allowed the lock-table state to get out of sync. If scans -# do not acquire latches all the way up to their max timestamp -# (i.e. the max time that they can observe conflicting locks), -# we risk hitting this race condition. -# -# Setup: txn1 acquires lock k at ts 15 -# lock-table is cleared, lock k removed -# -# Test: txn2 sequences to scan at ts 10 with max ts 20 -# txn2 observes txn1's lock while evaluation -# txn2 does not immediately handle-write-intent-error -# txn1 attempts to commits and resolve its lock -# txn1 blocks while acquiring latches [*] -# txn3 blocks while acquiring latches -# txn2 "discovers" and adds txn1's lock to lock-table -# txn2 re-sequences and blocks while acquiring latches -# txn1 proceeds in clearing its lock from the lock-table -# txn2 proceeds in reading -# txn3 proceeds in acquiring lock k at ts 25 -# -# [*] if txn2 was only holding latches up to its read timestamp -# and not up to its max timestamp then this wouldn't block, -# allowing the lock to be removed. txn3 could then proceed -# after this. Regardless of whether txn2 informed the -# lock-table about the discovered lock it found from txn1 -# first or txn3 informed the lock-table about the lock it -# acquired first, the second request would hit an assertion. -# ------------------------------------------------------------- - -new-txn name=txn1 ts=15 epoch=0 ----- - -new-txn name=txn2 ts=10 epoch=0 maxts=20 ----- - -new-txn name=txn3 ts=25 epoch=0 ----- - -new-request name=req1 txn=txn1 ts=15 - put key=k value=v ----- - -new-request name=req2 txn=txn2 ts=10 - get key=k ----- - -new-request name=req3 txn=txn3 ts=25 - put key=k value=v2 ----- - -sequence req=req1 ----- -[1] sequence req1: sequencing request -[1] sequence req1: acquiring latches -[1] sequence req1: scanning lock table for conflicting locks -[1] sequence req1: sequencing complete, returned guard - -on-lock-acquired req=req1 key=k ----- -[-] acquire lock: txn 00000001 @ k - -finish req=req1 ----- -[-] finish req1: finishing request - -debug-lock-table ----- -global: num=1 - lock: "k" - holder: txn: 00000001-0000-0000-0000-000000000000, ts: 15.000000000,0, info: unrepl epoch: 0, seqs: [0] -local: num=0 - -on-split ----- -[-] split range: complete - -debug-lock-table ----- -global: num=0 -local: num=0 - -# -------------------------------- -# Setup complete, test starts here -# -------------------------------- - -sequence req=req2 ----- -[2] sequence req2: sequencing request -[2] sequence req2: acquiring latches -[2] sequence req2: scanning lock table for conflicting locks -[2] sequence req2: sequencing complete, returned guard - -# txn2 observes txn1's lock while evaluation, but it does not immediately -# handle-write-intent-error. Maybe its goroutine was unscheduled for a while. In -# the interim period, requests come in to try to resolve the lock and to replace -# the lock. Neither should be allowed to proceed. - -on-txn-updated txn=txn1 status=committed ----- -[-] update txn: committing txn1 - -new-request name=reqRes1 txn=none ts=15 - resolve-intent txn=txn1 key=k status=committed ----- - -sequence req=reqRes1 ----- -[3] sequence reqRes1: sequencing request -[3] sequence reqRes1: acquiring latches -[3] sequence reqRes1: blocked on select in spanlatch.(*Manager).waitForSignal - -sequence req=req3 ----- -[4] sequence req3: sequencing request -[4] sequence req3: acquiring latches -[4] sequence req3: blocked on select in spanlatch.(*Manager).waitForSignal - -handle-write-intent-error req=req2 lease-seq=1 - intent txn=txn1 key=k ----- -[3] sequence reqRes1: sequencing complete, returned guard -[5] handle write intent error req2: handled conflicting intents on "k", released latches - -sequence req=req2 ----- -[6] sequence req2: re-sequencing request -[6] sequence req2: acquiring latches -[6] sequence req2: blocked on select in spanlatch.(*Manager).waitForSignal - -on-lock-updated req=reqRes1 txn=txn1 key=k status=committed ----- -[-] update lock: committing txn 00000001 @ k - -finish req=reqRes1 ----- -[-] finish reqRes1: finishing request -[4] sequence req3: scanning lock table for conflicting locks -[4] sequence req3: sequencing complete, returned guard -[6] sequence req2: scanning lock table for conflicting locks -[6] sequence req2: sequencing complete, returned guard - -finish req=req2 ----- -[-] finish req2: finishing request - -on-lock-acquired req=req3 key=k ----- -[-] acquire lock: txn 00000003 @ k - -finish req=req3 ----- -[-] finish req3: finishing request - -debug-lock-table ----- -global: num=1 - lock: "k" - holder: txn: 00000003-0000-0000-0000-000000000000, ts: 25.000000000,0, info: unrepl epoch: 0, seqs: [0] -local: num=0 - -reset ----- diff --git a/pkg/kv/kvserver/concurrency/testdata/concurrency_manager/uncertainty b/pkg/kv/kvserver/concurrency/testdata/concurrency_manager/uncertainty index 39edf98ed602..841ec0064efe 100644 --- a/pkg/kv/kvserver/concurrency/testdata/concurrency_manager/uncertainty +++ b/pkg/kv/kvserver/concurrency/testdata/concurrency_manager/uncertainty @@ -1,5 +1,5 @@ # ------------------------------------------------------------- -# A transactional (txn2) read-only request runs into replicated +# A transactional (txn2) read-only request runs into a replicated # intent below its read timestamp. It informs the lock table and # pushes the intent's transaction (txn1) above its uncertainty # window. The push succeeds and the request is able to proceed. @@ -59,20 +59,31 @@ reset namespace ---- # ------------------------------------------------------------- -# A transactional (txn2) read-only request runs into replicated -# intent above its read timestamp but in its uncertainty window. -# It informs the lock table and pushes the intent's transaction -# (txn1) above its uncertainty window. The push succeeds and the -# request is able to proceed. +# A transactional (txn2) read-only request runs into a replicated +# intent below its read timestamp and informs the lock table. +# +# A second transactional (txn3) read-only request does not wait +# on the intent in the lock table because its read timestamp is +# below the intent timestamp. However, the intent's provisional +# value is in txn3's uncertainty window. The read-only request +# returns a ReadWithinUncertaintyIntervalError and does not +# inform the lock table of the intent. +# +# This causes txn3 to restart with a higher read timestamp. At +# this point, since txn1's intent is still present, it begins +# waiting in the lock table. # ------------------------------------------------------------- new-txn name=txn1 ts=14,1 epoch=0 ---- -new-txn name=txn2 ts=12,1 epoch=0 maxts=15,1 +new-txn name=txn2 ts=15,1 epoch=0 ---- -new-request name=req1 txn=txn2 ts=12,1 +new-txn name=txn3 ts=12,1 epoch=0 maxts=15,1 +---- + +new-request name=req1 txn=txn2 ts=15,1 get key=k ---- @@ -104,17 +115,63 @@ sequence req=req1 [3] sequence req1: pushing timestamp of txn 00000001 above 15.000000000,1 [3] sequence req1: blocked on select in concurrency_test.(*cluster).PushTransaction -on-txn-updated txn=txn1 status=pending ts=15,2 +# txn3 does not wait on lock in lock table, but does end up throwing a +# ReadWithinUncertaintyIntervalError when it scans and finds txn1's +# provisional value in its uncertainty interval. +new-request name=req2 txn=txn3 ts=12,1 + get key=k ---- -[-] update txn: increasing timestamp of txn1 -[3] sequence req1: resolving intent "k" for txn 00000001 with PENDING status + +sequence req=req2 +---- +[4] sequence req2: sequencing request +[4] sequence req2: acquiring latches +[4] sequence req2: scanning lock table for conflicting locks +[4] sequence req2: sequencing complete, returned guard + +finish req=req2 +---- +[-] finish req2: finishing request + +# txn3 refreshes/restarts with a higher read timestamp than that of the +# value it saw in its uncertainty interval. It then retries its request. +on-txn-updated txn=txn3 status=pending ts=14,2 +---- +[-] update txn: increasing timestamp of txn3 + +new-request name=req2-retry txn=txn3 ts=14,2 + get key=k +---- + +sequence req=req2-retry +---- +[5] sequence req2-retry: sequencing request +[5] sequence req2-retry: acquiring latches +[5] sequence req2-retry: scanning lock table for conflicting locks +[5] sequence req2-retry: waiting in lock wait-queues +[5] sequence req2-retry: pushing timestamp of txn 00000001 above 15.000000000,1 +[5] sequence req2-retry: blocked on select in concurrency_test.(*cluster).PushTransaction + +# txn1 commits and lets both reads through. +on-txn-updated txn=txn1 status=committed +---- +[-] update txn: committing txn1 +[3] sequence req1: resolving intent "k" for txn 00000001 with COMMITTED status [3] sequence req1: acquiring latches [3] sequence req1: scanning lock table for conflicting locks [3] sequence req1: sequencing complete, returned guard +[5] sequence req2-retry: resolving intent "k" for txn 00000001 with COMMITTED status +[5] sequence req2-retry: acquiring latches +[5] sequence req2-retry: scanning lock table for conflicting locks +[5] sequence req2-retry: sequencing complete, returned guard finish req=req1 ---- [-] finish req1: finishing request +finish req=req2-retry +---- +[-] finish req2-retry: finishing request + reset namespace ---- diff --git a/pkg/storage/pebble_mvcc_scanner.go b/pkg/storage/pebble_mvcc_scanner.go index 683f27d24e77..cd35d672c5bf 100644 --- a/pkg/storage/pebble_mvcc_scanner.go +++ b/pkg/storage/pebble_mvcc_scanner.go @@ -374,20 +374,23 @@ func (p *pebbleMVCCScanner) getAndAdvance() bool { } ownIntent := p.txn != nil && p.meta.Txn.ID.Equal(p.txn.ID) - maxVisibleTS := p.ts - if p.checkUncertainty { - maxVisibleTS = p.txn.MaxTimestamp - } - otherIntentVisible := metaTS.LessEq(maxVisibleTS) || p.failOnMoreRecent + visibleIntent := metaTS.LessEq(p.ts) || p.failOnMoreRecent - if !ownIntent && !otherIntentVisible { + if !ownIntent && !visibleIntent { // 8. The key contains an intent, but we're reading below the intent. // Seek to the desired version, checking for uncertainty if necessary. // // Note that if we own the intent (i.e. we're reading transactionally) // we want to read the intent regardless of our read timestamp and fall // into case 8 below. - return p.seekVersion(maxVisibleTS, p.checkUncertainty) + if p.checkUncertainty { + if metaTS.LessEq(p.txn.MaxTimestamp) { + return p.uncertaintyError(metaTS) + } + + return p.seekVersion(p.txn.MaxTimestamp, true) + } + return p.seekVersion(p.ts, false) } if p.inconsistent { diff --git a/pkg/storage/testdata/mvcc_histories/uncertainty_interval b/pkg/storage/testdata/mvcc_histories/uncertainty_interval index 7ecacf277873..67b1ef07ecc2 100644 --- a/pkg/storage/testdata/mvcc_histories/uncertainty_interval +++ b/pkg/storage/testdata/mvcc_histories/uncertainty_interval @@ -150,13 +150,13 @@ run error get t=txn4 k=k2 ---- get: "k2" -> -error: (*roachpb.WriteIntentError:) conflicting intents on "k2" +error: (*roachpb.ReadWithinUncertaintyIntervalError:) ReadWithinUncertaintyIntervalError: read at time 5.000000000,0 encountered previous write with future timestamp 20.000000000,0 within uncertainty interval `t <= 20.000000000,0`; observed timestamps: [] run error scan t=txn4 k=k2 ---- scan: "k2"-"k2\x00" -> -error: (*roachpb.WriteIntentError:) conflicting intents on "k2" +error: (*roachpb.ReadWithinUncertaintyIntervalError:) ReadWithinUncertaintyIntervalError: read at time 5.000000000,0 encountered previous write with future timestamp 20.000000000,0 within uncertainty interval `t <= 20.000000000,0`; observed timestamps: [] run ok @@ -181,13 +181,13 @@ run error get t=txn5 k=k2 ---- get: "k2" -> -error: (*roachpb.WriteIntentError:) conflicting intents on "k2" +error: (*roachpb.ReadWithinUncertaintyIntervalError:) ReadWithinUncertaintyIntervalError: read at time 5.000000000,0 encountered previous write with future timestamp 20.000000000,0 within uncertainty interval `t <= 25.000000000,0`; observed timestamps: [] run error scan t=txn5 k=k2 ---- scan: "k2"-"k2\x00" -> -error: (*roachpb.WriteIntentError:) conflicting intents on "k2" +error: (*roachpb.ReadWithinUncertaintyIntervalError:) ReadWithinUncertaintyIntervalError: read at time 5.000000000,0 encountered previous write with future timestamp 20.000000000,0 within uncertainty interval `t <= 25.000000000,0`; observed timestamps: [] run ok @@ -266,13 +266,13 @@ run error get t=txn8 k=k2 ---- get: "k2" -> -error: (*roachpb.WriteIntentError:) conflicting intents on "k2" +error: (*roachpb.ReadWithinUncertaintyIntervalError:) ReadWithinUncertaintyIntervalError: read at time 10.000000000,0 encountered previous write with future timestamp 20.000000000,0 within uncertainty interval `t <= 20.000000000,0`; observed timestamps: [] run error scan t=txn8 k=k2 ---- scan: "k2"-"k2\x00" -> -error: (*roachpb.WriteIntentError:) conflicting intents on "k2" +error: (*roachpb.ReadWithinUncertaintyIntervalError:) ReadWithinUncertaintyIntervalError: read at time 10.000000000,0 encountered previous write with future timestamp 20.000000000,0 within uncertainty interval `t <= 20.000000000,0`; observed timestamps: [] run ok @@ -297,13 +297,13 @@ run error get t=txn9 k=k2 ---- get: "k2" -> -error: (*roachpb.WriteIntentError:) conflicting intents on "k2" +error: (*roachpb.ReadWithinUncertaintyIntervalError:) ReadWithinUncertaintyIntervalError: read at time 10.000000000,0 encountered previous write with future timestamp 20.000000000,0 within uncertainty interval `t <= 25.000000000,0`; observed timestamps: [] run error scan t=txn9 k=k2 ---- scan: "k2"-"k2\x00" -> -error: (*roachpb.WriteIntentError:) conflicting intents on "k2" +error: (*roachpb.ReadWithinUncertaintyIntervalError:) ReadWithinUncertaintyIntervalError: read at time 10.000000000,0 encountered previous write with future timestamp 20.000000000,0 within uncertainty interval `t <= 25.000000000,0`; observed timestamps: [] run ok @@ -355,13 +355,13 @@ run error get t=txn11 k=k2 ---- get: "k2" -> -error: (*roachpb.WriteIntentError:) conflicting intents on "k2" +error: (*roachpb.ReadWithinUncertaintyIntervalError:) ReadWithinUncertaintyIntervalError: read at time 15.000000000,0 encountered previous write with future timestamp 20.000000000,0 within uncertainty interval `t <= 20.000000000,0`; observed timestamps: [] run error scan t=txn11 k=k2 ---- scan: "k2"-"k2\x00" -> -error: (*roachpb.WriteIntentError:) conflicting intents on "k2" +error: (*roachpb.ReadWithinUncertaintyIntervalError:) ReadWithinUncertaintyIntervalError: read at time 15.000000000,0 encountered previous write with future timestamp 20.000000000,0 within uncertainty interval `t <= 20.000000000,0`; observed timestamps: [] run ok @@ -386,13 +386,13 @@ run error get t=txn12 k=k2 ---- get: "k2" -> -error: (*roachpb.WriteIntentError:) conflicting intents on "k2" +error: (*roachpb.ReadWithinUncertaintyIntervalError:) ReadWithinUncertaintyIntervalError: read at time 15.000000000,0 encountered previous write with future timestamp 20.000000000,0 within uncertainty interval `t <= 25.000000000,0`; observed timestamps: [] run error scan t=txn12 k=k2 ---- scan: "k2"-"k2\x00" -> -error: (*roachpb.WriteIntentError:) conflicting intents on "k2" +error: (*roachpb.ReadWithinUncertaintyIntervalError:) ReadWithinUncertaintyIntervalError: read at time 15.000000000,0 encountered previous write with future timestamp 20.000000000,0 within uncertainty interval `t <= 25.000000000,0`; observed timestamps: [] run ok