kv: block less on intent resolutions, reduce transaction contention footprint #22349

tbg · 2018-02-03T11:06:23Z

This is a brain dump on improving command queue congestion where caused by intent resolutions. I have no concrete data to pinpoint that as a problem, though I suspect it can be. In lots of workloads, the number of intent resolutions scales like the number of writes, so there can be a lot of them.

While an intent resolution is in flight, we block concurrent reads and writes to the affected keys. This seems overly strict. For example, consider a read that today conflicts with an in-flight resolution that attempts to COMMIT the value. Today, the read will patiently wait for the value to be resolved (eating a few WAN latencies) and then read it. But if we didn't enforce that, the read could sneak under the resolution. It would see the intent, which does not cause an anomaly, but is kind of pointless because it will then try to resolve the intent again. But imagine opportunistically keeping a map (inflight intent resolution's transaction id -> inflight request cache) around. Upon seeing the intent, the read could query that map and decide what the outcome will be once the intent is resolved (note that the existence of the inflight intent resolution alone is proof that this outcome will eventually be achieved).

This is even more relevant with large ranged resolutions, such as introduced in #21078, where we put large spans in the command queue that block many reads that aren't even going to see any intents.

Something similar is potentially possible for write requests running into intents, but it's tricky (and I'm much less optimistic that it's a win) because there is write skew in emitting the proper stats update: Naively you would think the write could just do the intent resolution itself and then do its write, but the real intent resolution may have evaluated as well too, and the two need to synchronize and decide who will emit the MVCCStats update. The simple way to deal with is to prepare the write under the assumption that the intent resolution applies, and then submit the write with a bit of below-raft logic that checks that there isn't an intent. But this verges into the territory of allowing general pipelining of writes (i.e. not blocking in the command queue) and that would be a much bigger win.

Perhaps it's useful here to have two command queues, where one is only used for intent resolutions and is queried on demand (i.e., when a write runs into an intent, check the intent command queue. When it's a read, don't.)

cc @nvanbenschoten for thoughts.

Jira issue: CRDB-5860

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2018-02-08T20:08:47Z

I like the idea of maintaining a map from inflight intent resolution txn id to resolution status. Whenever we see an intent we can simply check this to determine whether we can safely ignore the intent or not.

The way I've pictured this in the past is that we'd actually merge this logic into the command queue itself because this gives us a convenient synchronization point and allows us to make certain guarantees that would be impossible with an opportunistic synchronized map. On each command queue cmd, we would add an intentResolutionStatus roachpb.TransactionStatus field along with a txnID for intent resolution requests. These fields would only be populated for intent resolution requests. Reads would no longer mark commands with a status of COMMITTED or ABORTED as dependencies, but would instead build up an interval tree of overlapping intent resolutions when scanning over overlapping requests (IntervalTree<(interval, txnID), status>). Let's call this the inflightIntentResolutionTree for now.

This tree would be passed all the way down with the read request to the MVCC layer, where it would be consulted in buildScanResults for each intent observed. The question would be what to do when we see an intent that overlaps this tree. First, we'd make sure we have a matching transaction in the inflightIntentResolutionTree. If we don't, we handle the intent as usual, returning a WriteIntentError. If we do, we check whether the status is COMMITTED or ABORTED. On COMMITTED we'd return the first version of that key. On ABORTED, we'd return the second. This would require us to change MVCCScan to return the first two versions of a key when it runs into an intent, instead of no versions like it does currently.

With respect the writes, I agree with you that the proposal "verges into the territory of allowing general pipelining of writes". For now, let's just focus on reads.

nvanbenschoten · 2018-08-15T15:43:35Z

Nothing is going to happen here for 2.1. Moving to 2.2.

Informs cockroachdb#22349. This is inspired by the trace in cockroachdb#18684 (comment). The change adds new state to `roachpb.Transaction`, allowing it to remember the authoritative dispositions of a limited number of ABORTED and COMMITTED transactions. The transaction can use this new knowledge for two purposes: 1. when a write runs into a conflicting intent for a transaction that it knows about, it can immediately resolve the intent and write its new intent, all in the same WriteBatch and Raft proposal. 2. when a scan runs into an intent for a transaction it knows about, it still throws a WriteIntentError, but the intentResolver doesn't need to push the transaction before resolving the intents. Transactions use this local "memory" to remember the state of intents that it has run into in the past. This change is founded on the conjecture that transactions which contend at one location have a high probability of contending in other locations. The reasoning for this is that clients typically have a fixed set of queries they run, each of which takes a set of parameters. If one or more of the parameters line up for the same transaction type between two transactions, one or more of their statements will touch the same rows. It is therefore beneficial for transactions to carry around some memory about their related transactions so that they can optimize for interactions with them. For instance, a transaction A that pushes a transaction B after seeing one of B's intents and finds the B is ABORTED should not need to push transaction B again to know that it can clean up any other intents that B has abandoned. To test this, I ran `kv0 --batch=100 --splits=5` for 10 second with async intent resolution disabled and a 1ms delay added to `PushTxnRequest`s to simulate transaction records living on different nodes than their external intents. I then ran the same command again with the exact same write sequence. The effect of this is that the second run hit all of the intents abandoned by the first run and had to clean them up as it went This is the situation we saw in the linked trace. Here are the results: First run: ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result 10.0s 0 4080 407.8 19.5 17.8 33.6 75.5 142.6 ``` Second run without this change: ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result 10.0s 0 454 45.4 173.2 167.8 260.0 302.0 352.3 ``` Second run with this change: ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result 10.0s 0 2383 238.2 33.5 30.4 60.8 104.9 209.7 ``` Remembering the status of related transactions allows us to improve throughput in this degenerate case by **425%** and reduce average latency by **81%**. Of course, this is the best case scenerio for this change. I don't necessarily think we even need to implement it like this, but we should start thinking about this problem. For instance, another alternative would be to introduce an in-memory LRU cache on each `Store` that held similar information. This has a number of trade-offs compared to the state being local to transactions. Release note: None

andreimatei · 2021-03-10T19:39:35Z

Is this discussion still relevant, now that we have a separate lock table? We now discover more locks at once, and we also resolve more intents in batches. But I'm not entirely sure how that changes the calculus here.
cc @sumeerbhola

sumeerbhola · 2021-03-10T20:33:05Z

I think what is discussed above is trying to reduce the contention footprint, analogous to what is discussed in #41720 (comment) once we are fully converted to separated intents. It isn't implemented yet, and nor have we fully implemented discovering more locks at once (we discover multiple locks only if they are in the in-memory lock table data-structure), but the plan is to do these when interleaved intents are in the past.

tbg assigned nvanbenschoten Feb 3, 2018

petermattis added this to the 2.1 milestone Feb 21, 2018

tbg mentioned this issue Mar 28, 2018

RFC: parallel commits #24194

Merged

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-transactions Relating to MVCC and the transactional model. labels Apr 24, 2018

nvanbenschoten modified the milestones: 2.1, 2.2 Aug 15, 2018

petermattis removed this from the 2.2 milestone Oct 5, 2018

nvanbenschoten mentioned this issue Oct 20, 2018

[DNM] storage/kv: remember interacting transaction dispositions #31664

Closed

nvanbenschoten mentioned this issue Oct 18, 2019

storage: separated lock-table keyspace #41720

Closed

nvanbenschoten changed the title ~~storage: block less on intent resolutions in command queue~~ storage: block less on intent resolutions, reduce transaction contention footprint Aug 12, 2020

jlinder added the T-kv KV Team label Jun 16, 2021

nvanbenschoten changed the title ~~storage: block less on intent resolutions, reduce transaction contention footprint~~ kv: block less on intent resolutions, reduce transaction contention footprint Sep 10, 2021

nvanbenschoten mentioned this issue Sep 13, 2021

kv: eliminate range write latch during ranged intent resolution #70137

Open

ajwerner mentioned this issue Jan 5, 2022

kv: buffered writes #72614

Open

ajwerner mentioned this issue Apr 28, 2022

kv: high contention on SELECT statements with no competing INSERTs #80746

Closed

ajwerner mentioned this issue Oct 28, 2022

replication: avoid fsync during raft log append #88442

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: block less on intent resolutions, reduce transaction contention footprint #22349

kv: block less on intent resolutions, reduce transaction contention footprint #22349

tbg commented Feb 3, 2018 •

edited by cockroach-jira-scripts

Loading

nvanbenschoten commented Feb 8, 2018

nvanbenschoten commented Aug 15, 2018

andreimatei commented Mar 10, 2021

sumeerbhola commented Mar 10, 2021

kv: block less on intent resolutions, reduce transaction contention footprint #22349

kv: block less on intent resolutions, reduce transaction contention footprint #22349

Comments

tbg commented Feb 3, 2018 • edited by cockroach-jira-scripts Loading

nvanbenschoten commented Feb 8, 2018

nvanbenschoten commented Aug 15, 2018

andreimatei commented Mar 10, 2021

sumeerbhola commented Mar 10, 2021

tbg commented Feb 3, 2018 •

edited by cockroach-jira-scripts

Loading