Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvsever: acquire read latches up to Txn.MaxTimestamp #46830

Merged

Conversation

nvanbenschoten
Copy link
Member

Fixes #46460.
Fixes #43044.

This commit fixes the assertion failure observed in #46460. In that issue,
we saw the lockTable's existing lock cannot be acquired by different transaction assertion fire. This was due to an interleaving of concurrent
requests holding compatible latches that allowed the lock-table state to get
out of sync. If scans do not acquire latches all the way up to their max
timestamp (i.e. the max time that they can observe conflicting locks), we
risk hitting this race condition.

Here is an example sequence of events that used to trigger the assertion:

  • txn1 acquires lock k at ts 15
  • lock-table is cleared, lock k removed
  • txn2 sequences to scan at ts 10 with max ts 20
  • txn2 observes txn1's lock while evaluation
  • txn2 does not immediately handle-write-intent-error
  • txn1 commits and clears lock [*]
  • txn3 sequences to acquire lock k at ts 25
  • txn2 "discovers" and adds txn1's lock to lock-table
  • txn3 adds lock to lock-table, hits assertion

The step marked with a [*] is not possible if txn2 is holding read latches
all the way up to its max timestamp because the intent resolution would not
be allowed to evaluate concurrently with txn2's scan. Interestingly, the
last two steps could just as easily be reversed and we would have hit a
similar assertion in lockTable.AddDiscoveredLock. This sequence of
events is now encoded in a ConcurrencyManager data-driven test. Without
the rest of this commit, it reliably hits the assertion.

Release note (bug fix): A rare assertion failure that contained the text
"existing lock cannot be acquired by different transaction" was fixed.
This assertion was only present in earlier v20.1 releases and not in
any earlier releases.

Release justification: high-priority bug fix. Without this, we risk
servers crashing unexpectedly due to improper synchronization of
KV requests.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

  • txn1 acquires lock k at ts 15
  • lock-table is cleared, lock k removed
  • txn2 sequences to scan at ts 10 with max ts 20
  • txn2 observes txn1's lock while evaluation
  • txn2 does not immediately handle-write-intent-error
  • txn1 commits and clears lock [*]
  • txn3 sequences to acquire lock k at ts 25
  • txn2 "discovers" and adds txn1's lock to lock-table
  • txn3 adds lock to lock-table, hits assertion

The change here causes a read at ts 10 to prevent a concurrent write at ts 20. I don't have anything better in mind for now, but it seems to me that when we have the segregated lock table we could do better -- the lockTable can be checked for existing locks up to ts 20, but the latch only needs to be held for 10 -- if there is no existing lock, the read can proceed to evaluation since it does not care about concurrently added intents in the interval (10, \infinity).

Reviewed 3 of 4 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten, @sumeerbhola, and @tbg)


pkg/kv/kvserver/batcheval/declare.go, line 76 at r1 (raw file):

			//    we unify latch and lock spans (see part 1) where latches that are
			//    in the lock span subset can inherit their request's transaction's
			//    timestamp and latches that are not are non-MVCC latches.

I've not fully digested this comment. Do you have thoughts on how this should be structured in the future?
It seems to me that with the segregated lock table these are potentially different.


pkg/kv/kvserver/concurrency/testdata/concurrency_manager/acquire_wrong_txn_race, line 31 at r1 (raw file):

#     after this. Regardless of whether txn2 informed the
#     lock-table about the discovered lock it found from txn1
#     first or txn3 informed the lock-table about the lock it

why would txn3 find the lock if it is no longer in the key space? And if it did find it, it would not try to acquire the lock so wouldn't hit the assertion. Am I missing something?

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

The change here causes a read at ts 10 to prevent a concurrent write at ts 20

Yes, specifically, it prevents the concurrent intent resolution at ts 20.

I don't have anything better in mind for now, but it seems to me that when we have the segregated lock table we could do better

The challenging part here is that we require a request to already be holding latches before it checks the lock-table. But I agree that it does seem like we could improve something in this area.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/batcheval/declare.go, line 76 at r1 (raw file):

Do you have thoughts on how this should be structured in the future?

I think we'll want to merge the latch spans and the lock spans and replace the timestamp in spanset.Span with a boolean indicating whether the operation is an MVCC operation or not. The current notion of "lock spans" would simply become the subset of declared spans that are for MVCC operations. We'd then pull the timestamps used for MVCC operations from the transaction in both the latch manager and the lock table (we already do that for the lock table).

It seems to me that with the segregated lock table these are potentially different.

Do you have an example where they differ?


pkg/kv/kvserver/concurrency/testdata/concurrency_manager/acquire_wrong_txn_race, line 31 at r1 (raw file):

Previously, sumeerbhola wrote…
#     after this. Regardless of whether txn2 informed the
#     lock-table about the discovered lock it found from txn1
#     first or txn3 informed the lock-table about the lock it

why would txn3 find the lock if it is no longer in the key space? And if it did find it, it would not try to acquire the lock so wouldn't hit the assertion. Am I missing something?

txn3 wouldn't find the lock, it would acquire its own. This sentence is saying that regardless of which order the following two operations happen, we'd hit an assertion on the second operation:

  • txn2 informs the lock-table about the lock from txn1 that it discovered
  • txn3 informs the lock-table about its own lock that it acquired

Does that clear things up? Do you have suggestions for improving the wording here to make it more clear?

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenging part here is that we require a request to already be holding latches before it checks the lock-table.

We could continue to acquire latches before checking the segregated lock table -- the latch would be for ts 10 and the subsequent check in the lock table would be for ts 20.

Reviewed 1 of 4 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/batcheval/declare.go, line 76 at r1 (raw file):

Do you have an example where they differ?

I was referring to my PR level comment with locking at ts=20 and latching at ts=10.


pkg/kv/kvserver/concurrency/testdata/concurrency_manager/acquire_wrong_txn_race, line 31 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

txn3 wouldn't find the lock, it would acquire its own. This sentence is saying that regardless of which order the following two operations happen, we'd hit an assertion on the second operation:

  • txn2 informs the lock-table about the lock from txn1 that it discovered
  • txn3 informs the lock-table about its own lock that it acquired

Does that clear things up? Do you have suggestions for improving the wording here to make it more clear?

Got it. The wording is fine.

@nvanbenschoten nvanbenschoten mentioned this pull request Apr 1, 2020
24 tasks
Fixes cockroachdb#46460.
Fixes cockroachdb#43044.

This commit fixes the assertion failure observed in cockroachdb#46460. In that issue,
we saw the lockTable's `existing lock cannot be acquired by different
transaction` assertion fire. This was due to an interleaving of concurrent
requests holding compatible latches that allowed the lock-table state to get
out of sync. If scans do not acquire latches all the way up to their max
timestamp (i.e. the max time that they can observe conflicting locks), we
risk hitting this race condition.

Here is an example sequence of events that used to trigger the assertion:
- txn1 acquires lock k at ts 15
- lock-table is cleared, lock k removed
- txn2 sequences to scan at ts 10 with max ts 20
- txn2 observes txn1's lock while evaluation
- txn2 does not immediately handle-write-intent-error
- txn1 commits and clears lock [*]
- txn3 sequences to acquire lock k at ts 25
- txn2 "discovers" and adds txn1's lock to lock-table
- txn3 adds lock to lock-table, hits assertion

The step marked with a `[*]` is not possible if txn2 is holding read latches
all the way up to its max timestamp because the intent resolution would not
be allowed to evaluate concurrently with txn2's scan. Interestingly, the
last two steps could just as easily be reversed and we would have hit a
similar assertion in `lockTable.AddDiscoveredLock`. This sequence of
events is now encoded in a ConcurrencyManager data-driven test. Without
the rest of this commit, it reliably hits the assertion.

Release note (bug fix): A rare assertion failure that contained the text
"existing lock cannot be acquired by different transaction" was fixed.
This assertion was only present in earlier v20.1 releases and not in
any earlier releases.

Release justification: high-priority bug fix. Without this, we risk
servers crashing unexpectedly due to improper synchronization of
KV requests.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/latchTimestamp branch from f296a2d to 27d1520 Compare April 1, 2020 19:25
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Yes, I think you're right that we only need this change while we still have the "lock discovery" logic. Once we remove that and reads can no longer result in updates to the lock-table, this can go away. I updated the TODO to reflect this point.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/batcheval/declare.go, line 76 at r1 (raw file):

Previously, sumeerbhola wrote…

Do you have an example where they differ?

I was referring to my PR level comment with locking at ts=20 and latching at ts=10.

Got it. Making the timestamp of MVCC latches a property of the request instead of a property of individual latches shouldn't block this. We could always interpret the timestamps of MVCC spans different for the purposes of latches vs. locks. I updated the TODO to reflect this.


pkg/kv/kvserver/concurrency/testdata/concurrency_manager/acquire_wrong_txn_race, line 31 at r1 (raw file):

Previously, sumeerbhola wrote…

Got it. The wording is fine.

Ack.

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 1, 2020

Build succeeded

@craig craig bot merged commit eb5b73f into cockroachdb:master Apr 1, 2020
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/latchTimestamp branch April 3, 2020 22:57
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Nov 25, 2020
This commit is a partial revert of cockroachdb#40600 and cockroachdb#46830. It solves the same
problem that those PRs were solving, but in a different way.

Those two PRs were handling the case where a reading transaction
observes an intent in its uncertainty interval. Before those fixes, we
were not considering intents in a scan's uncertainty interval to be
uncertain. This had the potential to cause stale reads because an
unresolved intent doesn't indicate that its transaction hasn’t been
committed and is not a causal ancestor of the scan.

The approach the first PR took was to throw a WriteIntentError on
intents in a scan's uncertainty interval. Effectively, it made scans
consider intents in their uncertainty interval to be write-read
conflicts. This had the benefit that the scan would push the intent and
eventually resolve the conflict, either by aborting the intent, pushing
it out of the read's uncertainty interval, or waiting for it to commit.
In this last case (which is by the far the most common), the scan would
then throw an uncertainty error, because it now had a committed value in
its uncertainty interval.

The second PR introduced some awkward code to deal with the fallout from
this decision. Once we started trying to keep the lock table in sync
with the MVCC state, we needed to use latches to ensure proper
synchronization between any operations that could conflict because such
conflicts would influence the state of the lock table. This meant that
we needed to start holding latches across a read's uncertainty interval,
because intent writes in this interval were considered write-read
conflicts. This led to some moderately gross code and always felt a little
wrong.

Now that we are complicating the logic around uncertainty intervals even
further, this becomes even more of a burden. This motivates a reworking
of these interactions. This commit replaces the logic that considers
intents in a transaction's uncertainty interval to be write-read
conflicts for logic that considers such intents to be... uncertain.
Doing so means that a transaction will immediately refresh/restart
above the uncertain timestamp and will only then begin conflicting
with the intent.

This has a number of nice benefits:
1. it keeps all considerations of read uncertainty intervals down in
   MVCC iteration logic. The lock table no longer needs to be aware of it.
   This is a big win now and an even bigger win once we introduce synthetic
   timestamps.
2. read latches no longer need to be acquired up to the edge of a read's
   uncertainty interval, they only need to be acquired up to its read
   timestamp. Similarly, the lock table only needs to consider true
   write-read conflicts.
3. readers that are almost certainly bound to hit an uncertainty error
   and need to restart will now do so sooner. In rare cases, this may
   result in wasted work. In the vast majority of cases, this will allow
   the reader to be more responsive to the commit of its conflict.
4. uncertainty errors will only be thrown for locks in the uncertainty
   interval of a read that are protecting a provisional write (intents).
   Before, any form of a lock in a read's uncertainty interval would be
   considered a write-read conflict, which was pessimistic and not needed
   for correctness.

   In a future with a fully segregated lock table, the change in semantic
   meaning here becomes even more clear. Instead of detecting the lock
   associated with an intent in a read's uncertainty interval and declaring
   a write-read conflict, the read will instead pass through the lock table
   untouched and will detect the provisional value associated with an intent
   and declaring uncertainty. This seems closer to what were actually trying
   to say about these interactions.
5. partially unblocks a change like cockroachdb#52610. Now that latches and lock
   consider the same time ranges for write-read conflicts, merging the
   latch spans and lock spans will be easier.

Before making this change, I intend to validate the hypothesis that it
will not affect performance (or may even slightly improve performance)
by running it on the YCSB benchmark suite.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 6, 2021
This commit is a partial revert of cockroachdb#40600 and cockroachdb#46830. It solves the same
problem that those PRs were solving, but in a different way.

Those two PRs were handling the case where a reading transaction
observes an intent in its uncertainty interval. Before those fixes, we
were not considering intents in a scan's uncertainty interval to be
uncertain. This had the potential to cause stale reads because an
unresolved intent doesn't indicate that its transaction hasn’t been
committed and is not a causal ancestor of the scan.

The approach the first PR took was to throw a WriteIntentError on
intents in a scan's uncertainty interval. Effectively, it made scans
consider intents in their uncertainty interval to be write-read
conflicts. This had the benefit that the scan would push the intent and
eventually resolve the conflict, either by aborting the intent, pushing
it out of the read's uncertainty interval, or waiting for it to commit.
In this last case (which is by the far the most common), the scan would
then throw an uncertainty error, because it now had a committed value in
its uncertainty interval.

The second PR introduced some awkward code to deal with the fallout from
this decision. Once we started trying to keep the lock table in sync
with the MVCC state, we needed to use latches to ensure proper
synchronization between any operations that could conflict because such
conflicts would influence the state of the lock table. This meant that
we needed to start holding latches across a read's uncertainty interval,
because intent writes in this interval were considered write-read
conflicts. This led to some moderately gross code and always felt a little
wrong.

Now that we are complicating the logic around uncertainty intervals even
further, this becomes even more of a burden. This motivates a reworking
of these interactions. This commit replaces the logic that considers
intents in a transaction's uncertainty interval to be write-read
conflicts for logic that considers such intents to be... uncertain.
Doing so means that a transaction will immediately refresh/restart
above the uncertain timestamp and will only then begin conflicting
with the intent.

This has a number of nice benefits:
1. it keeps all considerations of read uncertainty intervals down in
   MVCC iteration logic. The lock table no longer needs to be aware of it.
   This is a big win now and an even bigger win once we introduce synthetic
   timestamps.
2. readers that are almost certainly bound to hit an uncertainty error
   and need to restart will now do so sooner. In rare cases, this may
   result in wasted work. In the vast majority of cases, this will allow
   the reader to be more responsive to the commit of its conflict.
3. uncertainty errors will only be thrown for locks in the uncertainty
   interval of a read that are protecting a provisional write (intents).
   Before, any form of a lock in a read's uncertainty interval would be
   considered a write-read conflict, which was pessimistic and not needed
   for correctness.

   In a future with a fully segregated lock table, the change in semantic
   meaning here becomes even more clear. Instead of detecting the lock
   associated with an intent in a read's uncertainty interval and declaring
   a write-read conflict, the read will instead pass through the lock table
   untouched and will detect the provisional value associated with an intent
   and declaring uncertainty. This seems closer to what were actually trying
   to say about these interactions.

Before making this change, I intend to validate the hypothesis that it
will not affect performance (or may even slightly improve performance)
by running it on the YCSB benchmark suite.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 15, 2021
This commit is a partial revert of cockroachdb#40600 and cockroachdb#46830. It solves the same
problem that those PRs were solving, but in a different way.

Those two PRs were handling the case where a reading transaction
observes an intent in its uncertainty interval. Before those fixes, we
were not considering intents in a scan's uncertainty interval to be
uncertain. This had the potential to cause stale reads because an
unresolved intent doesn't indicate that its transaction hasn’t been
committed and is not a causal ancestor of the scan.

The approach the first PR took was to throw a WriteIntentError on
intents in a scan's uncertainty interval. Effectively, it made scans
consider intents in their uncertainty interval to be write-read
conflicts. This had the benefit that the scan would push the intent and
eventually resolve the conflict, either by aborting the intent, pushing
it out of the read's uncertainty interval, or waiting for it to commit.
In this last case (which is by the far the most common), the scan would
then throw an uncertainty error, because it now had a committed value in
its uncertainty interval.

The second PR introduced some awkward code to deal with the fallout from
this decision. Once we started trying to keep the lock table in sync
with the MVCC state, we needed to use latches to ensure proper
synchronization between any operations that could conflict because such
conflicts would influence the state of the lock table. This meant that
we needed to start holding latches across a read's uncertainty interval,
because intent writes in this interval were considered write-read
conflicts. This led to some moderately gross code and always felt a little
wrong.

Now that we are complicating the logic around uncertainty intervals even
further, this becomes even more of a burden. This motivates a reworking
of these interactions. This commit replaces the logic that considers
intents in a transaction's uncertainty interval to be write-read
conflicts for logic that considers such intents to be... uncertain.
Doing so means that a transaction will immediately refresh/restart
above the uncertain timestamp and will only then begin conflicting
with the intent.

This has a number of nice benefits:
1. it keeps all considerations of read uncertainty intervals down in
   MVCC iteration logic. The lock table no longer needs to be aware of it.
   This is a big win now and an even bigger win once we introduce synthetic
   timestamps.
2. readers that are almost certainly bound to hit an uncertainty error
   and need to restart will now do so sooner. In rare cases, this may
   result in wasted work. In the vast majority of cases, this will allow
   the reader to be more responsive to the commit of its conflict.
3. uncertainty errors will only be thrown for locks in the uncertainty
   interval of a read that are protecting a provisional write (intents).
   Before, any form of a lock in a read's uncertainty interval would be
   considered a write-read conflict, which was pessimistic and not needed
   for correctness.

   In a future with a fully segregated lock table, the change in semantic
   meaning here becomes even more clear. Instead of detecting the lock
   associated with an intent in a read's uncertainty interval and declaring
   a write-read conflict, the read will instead pass through the lock table
   untouched and will detect the provisional value associated with an intent
   and declaring uncertainty. This seems closer to what were actually trying
   to say about these interactions.

Before making this change, I intend to validate the hypothesis that it
will not affect performance (or may even slightly improve performance)
by running it on the YCSB benchmark suite.
craig bot pushed a commit that referenced this pull request Jan 16, 2021
57136: kv: consider intent in uncertainty interval to be uncertain, not a write-read conflict r=sumeerbhola a=nvanbenschoten

This PR is a partial revert of #40600 and #46830. It solves the same problem that those PRs were solving, but in a different way.

Those two PRs were handling the case where a reading transaction observes an intent in its uncertainty interval. Before those fixes, we were not considering intents in a scan's uncertainty interval to be uncertain. This had the potential to cause stale reads because an unresolved intent doesn't indicate that its transaction hasn’t been committed and is not a causal ancestor of the scan.

The approach the first PR took was to throw a WriteIntentError on intents in a scan's uncertainty interval. Effectively, it made scans consider intents in their uncertainty interval to be write-read conflicts. This had the benefit that the scan would push the intent and eventually resolve the conflict, either by aborting the intent, pushing it out of the read's uncertainty interval, or waiting for it to commit. In this last case (which is by the far the most common), the scan would then throw an uncertainty error, because it now had a committed value in its uncertainty interval.

The second PR introduced some awkward code to deal with the fallout from this decision. Once we started trying to keep the lock table in sync with the MVCC state, we needed to use latches to ensure proper synchronization between any operations that could conflict because such conflicts would influence the state of the lock table. This meant that we needed to start holding latches across a read's uncertainty interval, because intent writes in this interval were considered write-read conflicts. This led to some moderately gross code and always felt a little wrong.

Now that we are complicating the logic around uncertainty intervals even further, this becomes even more of a burden. This motivates a reworking of these interactions. This commit replaces the logic that considers intents in a transaction's uncertainty interval to be write-read conflicts for logic that considers such intents to be... uncertain. Doing so means that a transaction will immediately refresh/restart above the uncertain timestamp and will only then begin conflicting with the intent.

This has a number of nice benefits:
1. it keeps all considerations of read uncertainty intervals down in MVCC iteration logic. The lock table no longer needs to be aware of it. This is a big win now and an even bigger win once we introduce synthetic timestamps.
2. readers that are almost certainly bound to hit an uncertainty error and need to restart will now do so sooner. In rare cases, this may result in wasted work. In the vast majority of cases, this will allow the reader to be more responsive to the commit of its conflict.
3. uncertainty errors will only be thrown for locks in the uncertainty interval of a read that are protecting a provisional write (intents). Before, any form of a lock in a read's uncertainty interval would be considered a write-read conflict, which was pessimistic and not needed for correctness.

   In a future with a fully segregated lock table, the change in semantic meaning here becomes even more clear. Instead of detecting the lock associated with an intent in a read's uncertainty interval and declaring a write-read conflict, the read will instead pass through the lock table untouched and will detect the provisional value associated with an intent and declaring uncertainty. This seems closer to what were actually trying to say about these interactions.

Before making this change, I intend to validate the hypothesis that it will not affect performance (or may even slightly improve performance) by running it on the YCSB benchmark suite.

Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: clearrange/checks=true failed roachtest: clearrange/checks=true failed
3 participants