Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: ignore pushed intent without Raft consensus #94730

Open
nvanbenschoten opened this issue Jan 4, 2023 · 1 comment
Open

kv: ignore pushed intent without Raft consensus #94730

nvanbenschoten opened this issue Jan 4, 2023 · 1 comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. A-read-committed Related to the introduction of Read Committed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jan 4, 2023

Sibling to #94728.

If a reader succeeds in pushing the transaction record of a conflicting intent above its read timestamp, it should be able to proceed with its read without immediately resolving that intent at a higher timestamp. Instead, it should remember that the intent is not conflicting while scanning and ignore its provisional value.

The rewrite incurs a Raft consensus round. As a result, reads need to perform writes to move conflicting intents out of their way. This is undesirable.

Jira issue: CRDB-23105

Epic CRDB-38938

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-transactions Relating to MVCC and the transactional model. T-kv KV Team labels Jan 4, 2023
@nvanbenschoten nvanbenschoten added the A-read-committed Related to the introduction of Read Committed label Mar 30, 2023
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 13, 2023
Fixes cockroachdb#103126.

This commit extends the infrastructure introduced in cockroachdb#49218 for transaction
timestamp pushes. It avoids redundant txn pushes of PENDING transactions and
batches the resolution of PENDING intents. This breaks the O(num_intents) work
performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into
something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of
transactions that are PENDING and are known to have been pushed to higher
timestamps. We use this cache for two purposes:

1. when we are a non-locking read and we see a lock at a conflicting timestamp
   who is held by a pushed txn above our read timestamp, we neither wait out the
   kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have written multiple intents, so we begin deferring intent resolution
   to enable batching.

Together, these two changes make us much more effective at pushing transactions
with a large number of intents. The following example (from cockroachdb#103126) demonstrates
this:
```sql
-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s
```

The change does have an unfortunate limitation. Deferred intent resolution
is only currently enabled for non-locking readers without uncertainty
intervals. Readers with uncertainty intervals must contend with the
possibility of pushing a conflicting intent up into their uncertainty
interval and causing more work for themselves, which is avoided with care
by the lockTableWaiter but difficult to coordinate through the
txnStatusCache. This limitation is acceptable because the most important
case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction,
which is that non-locking readers can ignore known pending intents without
the need to even resolve those intents (see cockroachdb#94730). This will require a
request-scoped cache of pending, pushed transactions, which does not have
the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work
proportional to the number of pending intents that they encounter, so they are
over 100x faster when encountering long-running, bulk writing transactions.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 19, 2023
Fixes cockroachdb#103126.

This commit extends the infrastructure introduced in cockroachdb#49218 for transaction
timestamp pushes. It avoids redundant txn pushes of PENDING transactions and
batches the resolution of PENDING intents. This breaks the O(num_intents) work
performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into
something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of
transactions that are PENDING and are known to have been pushed to higher
timestamps. We use this cache for two purposes:

1. when we are a non-locking read and we see a lock at a conflicting timestamp
   who is held by a pushed txn above our read timestamp, we neither wait out the
   kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have written multiple intents, so we begin deferring intent resolution
   to enable batching.

Together, these two changes make us much more effective at pushing transactions
with a large number of intents. The following example (from cockroachdb#103126) demonstrates
this:
```sql
-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s
```

The change does have an unfortunate limitation. Deferred intent resolution
is only currently enabled for non-locking readers without uncertainty
intervals. Readers with uncertainty intervals must contend with the
possibility of pushing a conflicting intent up into their uncertainty
interval and causing more work for themselves, which is avoided with care
by the lockTableWaiter but difficult to coordinate through the
txnStatusCache. This limitation is acceptable because the most important
case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction,
which is that non-locking readers can ignore known pending intents without
the need to even resolve those intents (see cockroachdb#94730). This will require a
request-scoped cache of pending, pushed transactions, which does not have
the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work
proportional to the number of pending intents that they encounter, so they are
over 100x faster when encountering long-running, bulk writing transactions.
craig bot pushed a commit that referenced this issue Jun 20, 2023
104784: kv/concurrency: batch intent resolution of pushed intents from same txn r=arulajmani a=nvanbenschoten

Fixes #103126.

This commit extends the infrastructure introduced in #49218 for transaction timestamp pushes. It avoids redundant txn pushes of PENDING transactions and batches the resolution of PENDING intents. This breaks the O(num_intents) work performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of transactions that are PENDING and are known to have been pushed to higher timestamps. We use this cache for two purposes:

1. when we are a non-locking read and we see a lock at a conflicting timestamp who is held by a pushed txn above our read timestamp, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that it may have written multiple intents, so we begin deferring intent resolution to enable batching.

Together, these two changes make us much more effective at pushing transactions with a large number of intents. The following example (from #103126) demonstrates this:
```sql
-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before #103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s
```

The change does have an unfortunate limitation. Deferred intent resolution is only currently enabled for non-locking readers without uncertainty intervals. Readers with uncertainty intervals must contend with the possibility of pushing a conflicting intent up into their uncertainty interval and causing more work for themselves, which is avoided with care by the lockTableWaiter but difficult to coordinate through the txnStatusCache. This limitation is acceptable because the most important case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction, which is that non-locking readers can ignore known pending intents without the need to even resolve those intents (see #94730). This will require a request-scoped cache of pending, pushed transactions, which does not have the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work proportional to the number of pending intents that they encounter, so they are over 100x faster when encountering long-running, bulk writing transactions.

Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 13, 2023
Fixes cockroachdb#103126.

This commit extends the infrastructure introduced in cockroachdb#49218 for transaction
timestamp pushes. It avoids redundant txn pushes of PENDING transactions and
batches the resolution of PENDING intents. This breaks the O(num_intents) work
performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into
something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of
transactions that are PENDING and are known to have been pushed to higher
timestamps. We use this cache for two purposes:

1. when we are a non-locking read and we see a lock at a conflicting timestamp
   who is held by a pushed txn above our read timestamp, we neither wait out the
   kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have written multiple intents, so we begin deferring intent resolution
   to enable batching.

Together, these two changes make us much more effective at pushing transactions
with a large number of intents. The following example (from cockroachdb#103126) demonstrates
this:
```sql
-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s
```

The change does have an unfortunate limitation. Deferred intent resolution
is only currently enabled for non-locking readers without uncertainty
intervals. Readers with uncertainty intervals must contend with the
possibility of pushing a conflicting intent up into their uncertainty
interval and causing more work for themselves, which is avoided with care
by the lockTableWaiter but difficult to coordinate through the
txnStatusCache. This limitation is acceptable because the most important
case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction,
which is that non-locking readers can ignore known pending intents without
the need to even resolve those intents (see cockroachdb#94730). This will require a
request-scoped cache of pending, pushed transactions, which does not have
the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work
proportional to the number of pending intents that they encounter, so they are
over 100x faster when encountering long-running, bulk writing transactions.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 13, 2023
Fixes cockroachdb#103126.

This commit extends the infrastructure introduced in cockroachdb#49218 for transaction
timestamp pushes. It avoids redundant txn pushes of PENDING transactions and
batches the resolution of PENDING intents. This breaks the O(num_intents) work
performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into
something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of
transactions that are PENDING and are known to have been pushed to higher
timestamps. We use this cache for two purposes:

1. when we are a non-locking read and we see a lock at a conflicting timestamp
   who is held by a pushed txn above our read timestamp, we neither wait out the
   kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have written multiple intents, so we begin deferring intent resolution
   to enable batching.

Together, these two changes make us much more effective at pushing transactions
with a large number of intents. The following example (from cockroachdb#103126) demonstrates
this:
```sql
-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s
```

The change does have an unfortunate limitation. Deferred intent resolution
is only currently enabled for non-locking readers without uncertainty
intervals. Readers with uncertainty intervals must contend with the
possibility of pushing a conflicting intent up into their uncertainty
interval and causing more work for themselves, which is avoided with care
by the lockTableWaiter but difficult to coordinate through the
txnStatusCache. This limitation is acceptable because the most important
case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction,
which is that non-locking readers can ignore known pending intents without
the need to even resolve those intents (see cockroachdb#94730). This will require a
request-scoped cache of pending, pushed transactions, which does not have
the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work
proportional to the number of pending intents that they encounter, so they are
over 100x faster when encountering long-running, bulk writing transactions.
@exalate-issue-sync exalate-issue-sync bot added the P-2 Issues/test failures with a fix SLA of 3 months label Dec 7, 2023
@nvanbenschoten
Copy link
Member Author

To address this, we will need to adjust the lockTableWaiter to not immediately ResolveIntent(PENDING) intents that non-locking reads encounter and are able to push to a higher timestamp using a PushTxn(PUSH_TIMESTAMP).

The original thinking here was that we would instead retain some information on the concurrency.Guard about pushed transactions and plumb this information down into pebbleMVCCScanner. The pebbleMVCCScanner, upon seeing an intent whose transaction is known to have been pushed to a higher timestamp, would ignore the intent and present the key's next version to the reader. This does not seem terribly difficult but does involve some plumbing of state around.


However, we arrived at a more elegant design for this which generalizes to other forms of intent resolution and enables fused "resolve-and-replace" consensus proposals. The idea is that we first begin deferring all intent resolution in the lockTableWaiter, similar to how we handle ResolveBeforeScanning today. A request's deferred resolution set is taken into account when determining which locks it conflicts with. Eventually, it conflicts with no locks that don't have corresponding deferred "resolution instructions".

We then give requests the choice of whether they want to realize the deferred resolution requests immediately, before latching and evaluation, or whether they want to virtualize/fuse them during evaluation. To realize them immediately, the request simply issue the ResolveIntent requests and push them through Raft, like they do today. This is a useful fallback option.

However, requests can also handle the deferred resolution during evaluation. Read-only requests have the option to virtualize the resolution and read-write requests have the option to fuse with the resolution. Doing so starts with the storage.Engine constructed during evaluation. Read-write requests continue to create a storage.Batch. For the first time, read-write requests also create a storage.Batch. Then, regardless of request path (read-only vs. read-write), command evaluation is run using the Batch and the deferred ResolveIntent requests. The result is a write batch with all conflicting intents resolved such that they no longer conflict with the rest of the BatchRequest. The BatchRequest can then evaluate its original requests on top of the Batch, knowing that it is observing the post-resolution state.

The final trick here is that read-write requests can proceed to propose the entire write batch to raft. This allows them to propose a raft entry that contains intent resolution and the subsequent intent replacement together.

The benefits of this approach are:

  • request evaluation remains ignorant of the virtualized intent resolution. We don't need to teach pebbleMVCCScanner how to ignore certain intents, it just won't see them in its view of the storage engine.
  • read-only requests can ignore pushed intents without Raft consensus.
  • read-write requests can replace one intent with another in a single round of Raft consensus (which will often be pipelined and async).
  • in both cases, a round of synchronous (i.e. blocking) Raft consensus is avoided.

I created a prototype of this in nvanbenschoten/virtualResolve, but it still needs a lot of work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. A-read-committed Related to the introduction of Read Committed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
None yet
Development

No branches or pull requests

1 participant