-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
kv/kvsever: acquire read latches up to Txn.MaxTimestamp
Fixes #46460. Fixes #43044. This commit fixes the assertion failure observed in #46460. In that issue, we saw the lockTable's `existing lock cannot be acquired by different transaction` assertion fire. This was due to an interleaving of concurrent requests holding compatible latches that allowed the lock-table state to get out of sync. If scans do not acquire latches all the way up to their max timestamp (i.e. the max time that they can observe conflicting locks), we risk hitting this race condition. Here is an example sequence of events that used to trigger the assertion: - txn1 acquires lock k at ts 15 - lock-table is cleared, lock k removed - txn2 sequences to scan at ts 10 with max ts 20 - txn2 observes txn1's lock while evaluation - txn2 does not immediately handle-write-intent-error - txn1 commits and clears lock [*] - txn3 sequences to acquire lock k at ts 25 - txn2 "discovers" and adds txn1's lock to lock-table - txn3 adds lock to lock-table, hits assertion The step marked with a `[*]` is not possible if txn2 is holding read latches all the way up to its max timestamp because the intent resolution would not be allowed to evaluate concurrently with txn2's scan. Interestingly, the last two steps could just as easily be reversed and we would have hit a similar assertion in `lockTable.AddDiscoveredLock`. This sequence of events is now encoded in a ConcurrencyManager data-driven test. Without the rest of this commit, it reliably hits the assertion. Release note (bug fix): A rare assertion failure that contained the text "existing lock cannot be acquired by different transaction" was fixed. This assertion was only present in earlier v20.1 releases and not in any earlier releases. Release justification: high-priority bug fix. Without this, we risk servers crashing unexpectedly due to improper synchronization of KV requests.
- Loading branch information
1 parent
10a05b5
commit 27d1520
Showing
4 changed files
with
226 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
166 changes: 166 additions & 0 deletions
166
pkg/kv/kvserver/concurrency/testdata/concurrency_manager/acquire_wrong_txn_race
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,166 @@ | ||
# ------------------------------------------------------------- | ||
# The following sequence of events used to trigger an assertion | ||
# in lock-table that a lock was acquired by a different | ||
# transaction than the current lock holder. This was due to an | ||
# interleaving of concurrent requests holding compatible latches | ||
# that allowed the lock-table state to get out of sync. If scans | ||
# do not acquire latches all the way up to their max timestamp | ||
# (i.e. the max time that they can observe conflicting locks), | ||
# we risk hitting this race condition. | ||
# | ||
# Setup: txn1 acquires lock k at ts 15 | ||
# lock-table is cleared, lock k removed | ||
# | ||
# Test: txn2 sequences to scan at ts 10 with max ts 20 | ||
# txn2 observes txn1's lock while evaluation | ||
# txn2 does not immediately handle-write-intent-error | ||
# txn1 attempts to commits and resolve its lock | ||
# txn1 blocks while acquiring latches [*] | ||
# txn3 blocks while acquiring latches | ||
# txn2 "discovers" and adds txn1's lock to lock-table | ||
# txn2 re-sequences and blocks while acquiring latches | ||
# txn1 proceeds in clearing its lock from the lock-table | ||
# txn2 proceeds in reading | ||
# txn3 proceeds in acquiring lock k at ts 25 | ||
# | ||
# [*] if txn2 was only holding latches up to its read timestamp | ||
# and not up to its max timestamp then this wouldn't block, | ||
# allowing the lock to be removed. txn3 could then proceed | ||
# after this. Regardless of whether txn2 informed the | ||
# lock-table about the discovered lock it found from txn1 | ||
# first or txn3 informed the lock-table about the lock it | ||
# acquired first, the second request would hit an assertion. | ||
# ------------------------------------------------------------- | ||
|
||
new-txn name=txn1 ts=15 epoch=0 | ||
---- | ||
|
||
new-txn name=txn2 ts=10 epoch=0 maxts=20 | ||
---- | ||
|
||
new-txn name=txn3 ts=25 epoch=0 | ||
---- | ||
|
||
new-request name=req1 txn=txn1 ts=15 | ||
put key=k value=v | ||
---- | ||
|
||
new-request name=req2 txn=txn2 ts=10 | ||
get key=k | ||
---- | ||
|
||
new-request name=req3 txn=txn3 ts=25 | ||
put key=k value=v2 | ||
---- | ||
|
||
sequence req=req1 | ||
---- | ||
[1] sequence req1: sequencing request | ||
[1] sequence req1: acquiring latches | ||
[1] sequence req1: scanning lock table for conflicting locks | ||
[1] sequence req1: sequencing complete, returned guard | ||
|
||
on-lock-acquired req=req1 key=k | ||
---- | ||
[-] acquire lock: txn 00000001 @ k | ||
|
||
finish req=req1 | ||
---- | ||
[-] finish req1: finishing request | ||
|
||
debug-lock-table | ||
---- | ||
global: num=1 | ||
lock: "k" | ||
holder: txn: 00000001-0000-0000-0000-000000000000, ts: 0.000000015,0, info: unrepl epoch: 0, seqs: [0] | ||
local: num=0 | ||
|
||
on-split | ||
---- | ||
[-] split range: complete | ||
|
||
debug-lock-table | ||
---- | ||
global: num=0 | ||
local: num=0 | ||
|
||
# -------------------------------- | ||
# Setup complete, test starts here | ||
# -------------------------------- | ||
|
||
sequence req=req2 | ||
---- | ||
[2] sequence req2: sequencing request | ||
[2] sequence req2: acquiring latches | ||
[2] sequence req2: scanning lock table for conflicting locks | ||
[2] sequence req2: sequencing complete, returned guard | ||
|
||
# txn2 observes txn1's lock while evaluation, but it does not immediately | ||
# handle-write-intent-error. Maybe its goroutine was unscheduled for a while. In | ||
# the interim period, requests come in to try to resolve the lock and to replace | ||
# the lock. Neither should be allowed to proceed. | ||
|
||
on-txn-updated txn=txn1 status=committed | ||
---- | ||
[-] update txn: committing txn1 | ||
|
||
new-request name=reqRes1 txn=none ts=15 | ||
resolve-intent txn=txn1 key=k status=committed | ||
---- | ||
|
||
sequence req=reqRes1 | ||
---- | ||
[3] sequence reqRes1: sequencing request | ||
[3] sequence reqRes1: acquiring latches | ||
[3] sequence reqRes1: blocked on select in spanlatch.(*Manager).waitForSignal | ||
|
||
sequence req=req3 | ||
---- | ||
[4] sequence req3: sequencing request | ||
[4] sequence req3: acquiring latches | ||
[4] sequence req3: blocked on select in spanlatch.(*Manager).waitForSignal | ||
|
||
handle-write-intent-error req=req2 txn=txn1 key=k | ||
---- | ||
[3] sequence reqRes1: sequencing complete, returned guard | ||
[5] handle write intent error req2: handled conflicting intents on "k", released latches | ||
|
||
sequence req=req2 | ||
---- | ||
[6] sequence req2: re-sequencing request | ||
[6] sequence req2: acquiring latches | ||
[6] sequence req2: blocked on select in spanlatch.(*Manager).waitForSignal | ||
|
||
on-lock-updated req=reqRes1 txn=txn1 key=k status=committed | ||
---- | ||
[-] update lock: committing txn 00000001 @ k | ||
|
||
finish req=reqRes1 | ||
---- | ||
[-] finish reqRes1: finishing request | ||
[4] sequence req3: scanning lock table for conflicting locks | ||
[4] sequence req3: sequencing complete, returned guard | ||
[6] sequence req2: scanning lock table for conflicting locks | ||
[6] sequence req2: sequencing complete, returned guard | ||
|
||
finish req=req2 | ||
---- | ||
[-] finish req2: finishing request | ||
|
||
on-lock-acquired req=req3 key=k | ||
---- | ||
[-] acquire lock: txn 00000003 @ k | ||
|
||
finish req=req3 | ||
---- | ||
[-] finish req3: finishing request | ||
|
||
debug-lock-table | ||
---- | ||
global: num=1 | ||
lock: "k" | ||
holder: txn: 00000003-0000-0000-0000-000000000000, ts: 0.000000025,0, info: unrepl epoch: 0, seqs: [0] | ||
local: num=0 | ||
|
||
reset | ||
---- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters