Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: block less on intent resolutions, reduce transaction contention footprint #22349

Open
tbg opened this issue Feb 3, 2018 · 4 comments
Open
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Feb 3, 2018

This is a brain dump on improving command queue congestion where caused by intent resolutions. I have no concrete data to pinpoint that as a problem, though I suspect it can be. In lots of workloads, the number of intent resolutions scales like the number of writes, so there can be a lot of them.

While an intent resolution is in flight, we block concurrent reads and writes to the affected keys. This seems overly strict. For example, consider a read that today conflicts with an in-flight resolution that attempts to COMMIT the value. Today, the read will patiently wait for the value to be resolved (eating a few WAN latencies) and then read it. But if we didn't enforce that, the read could sneak under the resolution. It would see the intent, which does not cause an anomaly, but is kind of pointless because it will then try to resolve the intent again. But imagine opportunistically keeping a map (inflight intent resolution's transaction id -> inflight request cache) around. Upon seeing the intent, the read could query that map and decide what the outcome will be once the intent is resolved (note that the existence of the inflight intent resolution alone is proof that this outcome will eventually be achieved).

This is even more relevant with large ranged resolutions, such as introduced in #21078, where we put large spans in the command queue that block many reads that aren't even going to see any intents.

Something similar is potentially possible for write requests running into intents, but it's tricky (and I'm much less optimistic that it's a win) because there is write skew in emitting the proper stats update: Naively you would think the write could just do the intent resolution itself and then do its write, but the real intent resolution may have evaluated as well too, and the two need to synchronize and decide who will emit the MVCCStats update. The simple way to deal with is to prepare the write under the assumption that the intent resolution applies, and then submit the write with a bit of below-raft logic that checks that there isn't an intent. But this verges into the territory of allowing general pipelining of writes (i.e. not blocking in the command queue) and that would be a much bigger win.

Perhaps it's useful here to have two command queues, where one is only used for intent resolutions and is queried on demand (i.e., when a write runs into an intent, check the intent command queue. When it's a read, don't.)

cc @nvanbenschoten for thoughts.

Jira issue: CRDB-5860

@nvanbenschoten
Copy link
Member

I like the idea of maintaining a map from inflight intent resolution txn id to resolution status. Whenever we see an intent we can simply check this to determine whether we can safely ignore the intent or not.

The way I've pictured this in the past is that we'd actually merge this logic into the command queue itself because this gives us a convenient synchronization point and allows us to make certain guarantees that would be impossible with an opportunistic synchronized map. On each command queue cmd, we would add an intentResolutionStatus roachpb.TransactionStatus field along with a txnID for intent resolution requests. These fields would only be populated for intent resolution requests. Reads would no longer mark commands with a status of COMMITTED or ABORTED as dependencies, but would instead build up an interval tree of overlapping intent resolutions when scanning over overlapping requests (IntervalTree<(interval, txnID), status>). Let's call this the inflightIntentResolutionTree for now.

This tree would be passed all the way down with the read request to the MVCC layer, where it would be consulted in buildScanResults for each intent observed. The question would be what to do when we see an intent that overlaps this tree. First, we'd make sure we have a matching transaction in the inflightIntentResolutionTree. If we don't, we handle the intent as usual, returning a WriteIntentError. If we do, we check whether the status is COMMITTED or ABORTED. On COMMITTED we'd return the first version of that key. On ABORTED, we'd return the second. This would require us to change MVCCScan to return the first two versions of a key when it runs into an intent, instead of no versions like it does currently.

With respect the writes, I agree with you that the proposal "verges into the territory of allowing general pipelining of writes". For now, let's just focus on reads.

@petermattis petermattis added this to the 2.1 milestone Feb 21, 2018
@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-transactions Relating to MVCC and the transactional model. labels Apr 24, 2018
@nvanbenschoten
Copy link
Member

Nothing is going to happen here for 2.1. Moving to 2.2.

@nvanbenschoten nvanbenschoten modified the milestones: 2.1, 2.2 Aug 15, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Oct 20, 2018
Informs cockroachdb#22349.

This is inspired by the trace in cockroachdb#18684 (comment).

The change adds new state to `roachpb.Transaction`, allowing it to remember
the authoritative dispositions of a limited number of ABORTED and COMMITTED
transactions. The transaction can use this new knowledge for two purposes:
1. when a write runs into a conflicting intent for a transaction that it
   knows about, it can immediately resolve the intent and write its new
   intent, all in the same WriteBatch and Raft proposal.
2. when a scan runs into an intent for a transaction it knows about, it
   still throws a WriteIntentError, but the intentResolver doesn't need
   to push the transaction before resolving the intents.

Transactions use this local "memory" to remember the state of intents
that it has run into in the past. This change is founded on the conjecture
that transactions which contend at one location have a high probability of
contending in other locations. The reasoning for this is that clients
typically have a fixed set of queries they run, each of which takes a set
of parameters. If one or more of the parameters line up for the same
transaction type between two transactions, one or more of their statements
will touch the same rows.

It is therefore beneficial for transactions to carry around some memory
about their related transactions so that they can optimize for interactions
with them. For instance, a transaction A that pushes a transaction B after
seeing one of B's intents and finds the B is ABORTED should not need to
push transaction B again to know that it can clean up any other intents
that B has abandoned.

To test this, I ran `kv0 --batch=100 --splits=5` for 10 second with async
intent resolution disabled and a 1ms delay added to `PushTxnRequest`s to
simulate transaction records living on different nodes than their external
intents. I then ran the same command again with the exact same write sequence.
The effect of this is that the second run hit all of the intents abandoned by
the first run and had to clean them up as it went This is the situation
we saw in the linked trace. Here are the results:

First run:
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0           4080          407.8     19.5     17.8     33.6     75.5    142.6
```

Second run without this change:
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0            454           45.4    173.2    167.8    260.0    302.0    352.3
```

Second run with this change:
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0           2383          238.2     33.5     30.4     60.8    104.9    209.7
```

Remembering the status of related transactions allows us to improve throughput
in this degenerate case by **425%** and reduce average latency by **81%**. Of
course, this is the best case scenerio for this change. I don't necessarily
think we even need to implement it like this, but we should start thinking
about this problem. For instance, another alternative would be to introduce
an in-memory LRU cache on each `Store` that held similar information. This
has a number of trade-offs compared to the state being local to transactions.

Release note: None
@nvanbenschoten nvanbenschoten changed the title storage: block less on intent resolutions in command queue storage: block less on intent resolutions, reduce transaction contention footprint Aug 12, 2020
@andreimatei
Copy link
Contributor

Is this discussion still relevant, now that we have a separate lock table? We now discover more locks at once, and we also resolve more intents in batches. But I'm not entirely sure how that changes the calculus here.
cc @sumeerbhola

@sumeerbhola
Copy link
Collaborator

I think what is discussed above is trying to reduce the contention footprint, analogous to what is discussed in #41720 (comment) once we are fully converted to separated intents. It isn't implemented yet, and nor have we fully implemented discovering more locks at once (we discover multiple locks only if they are in the in-memory lock table data-structure), but the plan is to do these when interleaved intents are in the past.

@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@nvanbenschoten nvanbenschoten changed the title storage: block less on intent resolutions, reduce transaction contention footprint kv: block less on intent resolutions, reduce transaction contention footprint Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

6 participants