-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: block less on intent resolutions, reduce transaction contention footprint #22349
Comments
I like the idea of maintaining a map from inflight intent resolution txn id to resolution status. Whenever we see an intent we can simply check this to determine whether we can safely ignore the intent or not. The way I've pictured this in the past is that we'd actually merge this logic into the command queue itself because this gives us a convenient synchronization point and allows us to make certain guarantees that would be impossible with an opportunistic synchronized map. On each command queue This tree would be passed all the way down with the read request to the MVCC layer, where it would be consulted in With respect the writes, I agree with you that the proposal "verges into the territory of allowing general pipelining of writes". For now, let's just focus on reads. |
Nothing is going to happen here for 2.1. Moving to 2.2. |
Informs cockroachdb#22349. This is inspired by the trace in cockroachdb#18684 (comment). The change adds new state to `roachpb.Transaction`, allowing it to remember the authoritative dispositions of a limited number of ABORTED and COMMITTED transactions. The transaction can use this new knowledge for two purposes: 1. when a write runs into a conflicting intent for a transaction that it knows about, it can immediately resolve the intent and write its new intent, all in the same WriteBatch and Raft proposal. 2. when a scan runs into an intent for a transaction it knows about, it still throws a WriteIntentError, but the intentResolver doesn't need to push the transaction before resolving the intents. Transactions use this local "memory" to remember the state of intents that it has run into in the past. This change is founded on the conjecture that transactions which contend at one location have a high probability of contending in other locations. The reasoning for this is that clients typically have a fixed set of queries they run, each of which takes a set of parameters. If one or more of the parameters line up for the same transaction type between two transactions, one or more of their statements will touch the same rows. It is therefore beneficial for transactions to carry around some memory about their related transactions so that they can optimize for interactions with them. For instance, a transaction A that pushes a transaction B after seeing one of B's intents and finds the B is ABORTED should not need to push transaction B again to know that it can clean up any other intents that B has abandoned. To test this, I ran `kv0 --batch=100 --splits=5` for 10 second with async intent resolution disabled and a 1ms delay added to `PushTxnRequest`s to simulate transaction records living on different nodes than their external intents. I then ran the same command again with the exact same write sequence. The effect of this is that the second run hit all of the intents abandoned by the first run and had to clean them up as it went This is the situation we saw in the linked trace. Here are the results: First run: ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result 10.0s 0 4080 407.8 19.5 17.8 33.6 75.5 142.6 ``` Second run without this change: ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result 10.0s 0 454 45.4 173.2 167.8 260.0 302.0 352.3 ``` Second run with this change: ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result 10.0s 0 2383 238.2 33.5 30.4 60.8 104.9 209.7 ``` Remembering the status of related transactions allows us to improve throughput in this degenerate case by **425%** and reduce average latency by **81%**. Of course, this is the best case scenerio for this change. I don't necessarily think we even need to implement it like this, but we should start thinking about this problem. For instance, another alternative would be to introduce an in-memory LRU cache on each `Store` that held similar information. This has a number of trade-offs compared to the state being local to transactions. Release note: None
Is this discussion still relevant, now that we have a separate lock table? We now discover more locks at once, and we also resolve more intents in batches. But I'm not entirely sure how that changes the calculus here. |
I think what is discussed above is trying to reduce the contention footprint, analogous to what is discussed in #41720 (comment) once we are fully converted to separated intents. It isn't implemented yet, and nor have we fully implemented discovering more locks at once (we discover multiple locks only if they are in the in-memory lock table data-structure), but the plan is to do these when interleaved intents are in the past. |
This is a brain dump on improving command queue congestion where caused by intent resolutions. I have no concrete data to pinpoint that as a problem, though I suspect it can be. In lots of workloads, the number of intent resolutions scales like the number of writes, so there can be a lot of them.
While an intent resolution is in flight, we block concurrent reads and writes to the affected keys. This seems overly strict. For example, consider a read that today conflicts with an in-flight resolution that attempts to
COMMIT
the value. Today, the read will patiently wait for the value to be resolved (eating a few WAN latencies) and then read it. But if we didn't enforce that, the read could sneak under the resolution. It would see the intent, which does not cause an anomaly, but is kind of pointless because it will then try to resolve the intent again. But imagine opportunistically keeping a map (inflight intent resolution's transaction id -> inflight request cache) around. Upon seeing the intent, the read could query that map and decide what the outcome will be once the intent is resolved (note that the existence of the inflight intent resolution alone is proof that this outcome will eventually be achieved).This is even more relevant with large ranged resolutions, such as introduced in #21078, where we put large spans in the command queue that block many reads that aren't even going to see any intents.
Something similar is potentially possible for write requests running into intents, but it's tricky (and I'm much less optimistic that it's a win) because there is write skew in emitting the proper stats update: Naively you would think the write could just do the intent resolution itself and then do its write, but the real intent resolution may have evaluated as well too, and the two need to synchronize and decide who will emit the MVCCStats update. The simple way to deal with is to prepare the write under the assumption that the intent resolution applies, and then submit the write with a bit of below-raft logic that checks that there isn't an intent. But this verges into the territory of allowing general pipelining of writes (i.e. not blocking in the command queue) and that would be a much bigger win.
Perhaps it's useful here to have two command queues, where one is only used for intent resolutions and is queried on demand (i.e., when a write runs into an intent, check the intent command queue. When it's a read, don't.)
cc @nvanbenschoten for thoughts.
Jira issue: CRDB-5860
The text was updated successfully, but these errors were encountered: