kv: ignore pushed intent without Raft consensus #94730

nvanbenschoten · 2023-01-04T23:08:18Z

Sibling to #94728.

If a reader succeeds in pushing the transaction record of a conflicting intent above its read timestamp, it should be able to proceed with its read without immediately resolving that intent at a higher timestamp. Instead, it should remember that the intent is not conflicting while scanning and ignore its provisional value.

The rewrite incurs a Raft consensus round. As a result, reads need to perform writes to move conflicting intents out of their way. This is undesirable.

Jira issue: CRDB-23105

Epic CRDB-38938

Fixes cockroachdb#103126. This commit extends the infrastructure introduced in cockroachdb#49218 for transaction timestamp pushes. It avoids redundant txn pushes of PENDING transactions and batches the resolution of PENDING intents. This breaks the O(num_intents) work performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into something closer to O(num_ranges) work. The commit accomplishes its goals by adding a second per-Range LRU cache of transactions that are PENDING and are known to have been pushed to higher timestamps. We use this cache for two purposes: 1. when we are a non-locking read and we see a lock at a conflicting timestamp who is held by a pushed txn above our read timestamp, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range). 2. we use the existence of a transaction in the cache as an indication that it may have written multiple intents, so we begin deferring intent resolution to enable batching. Together, these two changes make us much more effective at pushing transactions with a large number of intents. The following example (from cockroachdb#103126) demonstrates this: ```sql -- SETUP: run in a 3-node GCP roachprod cluster --- session 1 - write 100k intents CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 100000); --- session 2 - push intents with high-priority txn without uncertainty interval BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms'; SELECT count(*) FROM keys; --- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent Time: 714.441s total --- BEFORE this PR: takes ~1.5ms per intent Time: 151.880s total --- AFTER this PR: takes ~24μs per intent Time: 2.405s ``` The change does have an unfortunate limitation. Deferred intent resolution is only currently enabled for non-locking readers without uncertainty intervals. Readers with uncertainty intervals must contend with the possibility of pushing a conflicting intent up into their uncertainty interval and causing more work for themselves, which is avoided with care by the lockTableWaiter but difficult to coordinate through the txnStatusCache. This limitation is acceptable because the most important case here is optimizing the Export requests issued by backup. This limitation also hints at the long-term plan for this interaction, which is that non-locking readers can ignore known pending intents without the need to even resolve those intents (see cockroachdb#94730). This will require a request-scoped cache of pending, pushed transactions, which does not have the same problems with uncertainty intervals. Release note (performance improvement): Backups no longer perform work proportional to the number of pending intents that they encounter, so they are over 100x faster when encountering long-running, bulk writing transactions.

104784: kv/concurrency: batch intent resolution of pushed intents from same txn r=arulajmani a=nvanbenschoten Fixes #103126. This commit extends the infrastructure introduced in #49218 for transaction timestamp pushes. It avoids redundant txn pushes of PENDING transactions and batches the resolution of PENDING intents. This breaks the O(num_intents) work performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into something closer to O(num_ranges) work. The commit accomplishes its goals by adding a second per-Range LRU cache of transactions that are PENDING and are known to have been pushed to higher timestamps. We use this cache for two purposes: 1. when we are a non-locking read and we see a lock at a conflicting timestamp who is held by a pushed txn above our read timestamp, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range). 2. we use the existence of a transaction in the cache as an indication that it may have written multiple intents, so we begin deferring intent resolution to enable batching. Together, these two changes make us much more effective at pushing transactions with a large number of intents. The following example (from #103126) demonstrates this: ```sql -- SETUP: run in a 3-node GCP roachprod cluster --- session 1 - write 100k intents CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 100000); --- session 2 - push intents with high-priority txn without uncertainty interval BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms'; SELECT count(*) FROM keys; --- BEFORE this PR and before #103265 (i.e. v23.1.2): takes ~7.1ms per intent Time: 714.441s total --- BEFORE this PR: takes ~1.5ms per intent Time: 151.880s total --- AFTER this PR: takes ~24μs per intent Time: 2.405s ``` The change does have an unfortunate limitation. Deferred intent resolution is only currently enabled for non-locking readers without uncertainty intervals. Readers with uncertainty intervals must contend with the possibility of pushing a conflicting intent up into their uncertainty interval and causing more work for themselves, which is avoided with care by the lockTableWaiter but difficult to coordinate through the txnStatusCache. This limitation is acceptable because the most important case here is optimizing the Export requests issued by backup. This limitation also hints at the long-term plan for this interaction, which is that non-locking readers can ignore known pending intents without the need to even resolve those intents (see #94730). This will require a request-scoped cache of pending, pushed transactions, which does not have the same problems with uncertainty intervals. Release note (performance improvement): Backups no longer perform work proportional to the number of pending intents that they encounter, so they are over 100x faster when encountering long-running, bulk writing transactions. Co-authored-by: Arul Ajmani <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

Fixes cockroachdb#103126. This commit extends the infrastructure introduced in cockroachdb#49218 for transaction timestamp pushes. It avoids redundant txn pushes of PENDING transactions and batches the resolution of PENDING intents. This breaks the O(num_intents) work performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into something closer to O(num_ranges) work. The commit accomplishes its goals by adding a second per-Range LRU cache of transactions that are PENDING and are known to have been pushed to higher timestamps. We use this cache for two purposes: 1. when we are a non-locking read and we see a lock at a conflicting timestamp who is held by a pushed txn above our read timestamp, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range). 2. we use the existence of a transaction in the cache as an indication that it may have written multiple intents, so we begin deferring intent resolution to enable batching. Together, these two changes make us much more effective at pushing transactions with a large number of intents. The following example (from cockroachdb#103126) demonstrates this: ```sql -- SETUP: run in a 3-node GCP roachprod cluster --- session 1 - write 100k intents CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 100000); --- session 2 - push intents with high-priority txn without uncertainty interval BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms'; SELECT count(*) FROM keys; --- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent Time: 714.441s total --- BEFORE this PR: takes ~1.5ms per intent Time: 151.880s total --- AFTER this PR: takes ~24μs per intent Time: 2.405s ``` The change does have an unfortunate limitation. Deferred intent resolution is only currently enabled for non-locking readers without uncertainty intervals. Readers with uncertainty intervals must contend with the possibility of pushing a conflicting intent up into their uncertainty interval and causing more work for themselves, which is avoided with care by the lockTableWaiter but difficult to coordinate through the txnStatusCache. This limitation is acceptable because the most important case here is optimizing the Export requests issued by backup. This limitation also hints at the long-term plan for this interaction, which is that non-locking readers can ignore known pending intents without the need to even resolve those intents (see cockroachdb#94730). This will require a request-scoped cache of pending, pushed transactions, which does not have the same problems with uncertainty intervals. Release note (performance improvement): Backups no longer perform work proportional to the number of pending intents that they encounter, so they are over 100x faster when encountering long-running, bulk writing transactions.

nvanbenschoten · 2024-04-16T00:10:52Z

To address this, we will need to adjust the lockTableWaiter to not immediately ResolveIntent(PENDING) intents that non-locking reads encounter and are able to push to a higher timestamp using a PushTxn(PUSH_TIMESTAMP).

The original thinking here was that we would instead retain some information on the concurrency.Guard about pushed transactions and plumb this information down into pebbleMVCCScanner. The pebbleMVCCScanner, upon seeing an intent whose transaction is known to have been pushed to a higher timestamp, would ignore the intent and present the key's next version to the reader. This does not seem terribly difficult but does involve some plumbing of state around.

However, we arrived at a more elegant design for this which generalizes to other forms of intent resolution and enables fused "resolve-and-replace" consensus proposals. The idea is that we first begin deferring all intent resolution in the lockTableWaiter, similar to how we handle ResolveBeforeScanning today. A request's deferred resolution set is taken into account when determining which locks it conflicts with. Eventually, it conflicts with no locks that don't have corresponding deferred "resolution instructions".

We then give requests the choice of whether they want to realize the deferred resolution requests immediately, before latching and evaluation, or whether they want to virtualize/fuse them during evaluation. To realize them immediately, the request simply issue the ResolveIntent requests and push them through Raft, like they do today. This is a useful fallback option.

However, requests can also handle the deferred resolution during evaluation. Read-only requests have the option to virtualize the resolution and read-write requests have the option to fuse with the resolution. Doing so starts with the storage.Engine constructed during evaluation. Read-write requests continue to create a storage.Batch. For the first time, read-write requests also create a storage.Batch. Then, regardless of request path (read-only vs. read-write), command evaluation is run using the Batch and the deferred ResolveIntent requests. The result is a write batch with all conflicting intents resolved such that they no longer conflict with the rest of the BatchRequest. The BatchRequest can then evaluate its original requests on top of the Batch, knowing that it is observing the post-resolution state.

The final trick here is that read-write requests can proceed to propose the entire write batch to raft. This allows them to propose a raft entry that contains intent resolution and the subsequent intent replacement together.

The benefits of this approach are:

request evaluation remains ignorant of the virtualized intent resolution. We don't need to teach pebbleMVCCScanner how to ignore certain intents, it just won't see them in its view of the storage engine.
read-only requests can ignore pushed intents without Raft consensus.
read-write requests can replace one intent with another in a single round of Raft consensus (which will often be pipelined and async).
in both cases, a round of synchronous (i.e. blocking) Raft consensus is avoided.

I created a prototype of this in nvanbenschoten/virtualResolve, but it still needs a lot of work.

nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-transactions Relating to MVCC and the transactional model. T-kv KV Team labels Jan 4, 2023

nvanbenschoten added the A-read-committed Related to the introduction of Read Committed label Mar 30, 2023

nvanbenschoten mentioned this issue Jun 13, 2023

kv/concurrency: batch intent resolution of pushed intents from same txn #104784

Merged

nvanbenschoten mentioned this issue Jun 26, 2023

release-23.1: kv/concurrency: batch intent resolution of pushed intents from same txn #105526

Merged

nvanbenschoten mentioned this issue Jun 26, 2023

release-22.2: kv/concurrency: batch intent resolution of pushed intents from same txn #105527

Merged

exalate-issue-sync bot added the P-2 Issues/test failures with a fix SLA of 3 months label Dec 7, 2023

nvanbenschoten mentioned this issue Mar 14, 2024

concurrency: implement generalized lock promotion #119671

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: ignore pushed intent without Raft consensus #94730

kv: ignore pushed intent without Raft consensus #94730

nvanbenschoten commented Jan 4, 2023 •

edited by exalate-issue-sync bot

Loading

nvanbenschoten commented Apr 16, 2024

kv: ignore pushed intent without Raft consensus #94730

kv: ignore pushed intent without Raft consensus #94730

Comments

nvanbenschoten commented Jan 4, 2023 • edited by exalate-issue-sync bot Loading

nvanbenschoten commented Apr 16, 2024

nvanbenschoten commented Jan 4, 2023 •

edited by exalate-issue-sync bot

Loading