kv: buffered writes #72614

andreimatei · 2021-11-10T16:55:00Z

This issue explores the potential benefits (and costs) of introducing a new mode of operations for transactions, whereby writes would be buffered on the (KV) client until commit time. Currently, of course, writes are eagerly sent to their respective leaseholders and begin (async) replication.

Motivation

Avoiding pipeline stalls on future reads of the written key within the same txn. As things stand, if a txn writes a key and then reads it, the reads will block on latches until the prior writes finishes its async replication. This can be seen as a stall in the pipeline of transaction operations. But, given that the transaction knows what it's writing, and it knows that the successful replication of the write is checked at commit time, blocking on the latches seems unnecessary. We can imagine the value of the respective key coming from the transaction's write buffer, rather than from the leaseholder's storage.
- Common reads could avoid going to the leaseholder, being served instead exclusively on the client. For example, FK checks satisfied by a write in the same txn - e.g. insert in parent table and then into the child table in the same txn, as seen in TPCC (citation needed).
- This idea of avoiding blocking on latches seems to only be applicable for writes from the same txn. If there's an in-flight write from another txn, even if the reader somehow had access to the proposed value, it couldn't simply read it; the reader would have to block (on a lock) until the writer commits or aborts.
Avoiding writers blocking readers during the writer txn's evaluation. As things stand, writers eagerly take locks, which then block readers for the writer's lifetime. If instead writers were buffered until commit time, and no locks were taken eagerly, then readers would only conflict with committing writers, not with evaluating writers. This would mean that locks would be held for unbounded amounts of time only in cases of failures, not in cases of long-running writing transaction.
- There are also downsides to not taking locks eagerly: the writer does not get the protection of the respective locks, and is more susceptible to being forced to change its timestamp at commit time (and thus perform a refresh). We can imagine that the policy of acquiring locks eagerly on writes would be beneficial sometimes (perhaps on transaction retries - epochs >= 1). Even when locks are acquired eagerly, the buffering of writes still gives most other advantages listed here.
Amortizing write latency. At the moment, write pipelining amortizes the replication latency for writes. Still, each write performs a synchronous round-trip to the leaseholder. Buffering would avoid that.
- This only works for blind writes, which are not common in SQL. Read-write operations (e.g. CPut) would need to be split into a read phase (which could also optionally take locks; see above) and a (buffered) write.
More 1PC. At the moment, many stars need to align to get a coveted 1PC execution - the SQL statement needs to be run as an implicit txn, the statement needs to be simple enough, the SQL execution node needs to support committing the txn in the same batch as its mutations. It's impossible to get 1PC in an explicit txn (e.g; BEGIN; INSERT; COMMIT never gets it), and it's impossible to get 1PC when inserting into two tables (e.g. even when we had interleaved tables, we still couldn't get it). By buffering writes, we'd no longer need SQL's cooperation for getting 1PC execution; it can all be under the KV client's control.
- 1PC is good for throughput (less work to do per txn) and for tail latency (fewer round-trips between the client and the leaseholder).
- Interleaved tables are gone. But, maybe collocated tables will come back through kv,*: non-contiguous ranges #65726 allowing us to get 1PC across tables.

I think 1) and 4) are big.

Drawbacks

There'd be complexity involved with supporting read-your-writes within a transaction if writes were buffered. We'd have to decide who's in a good position to interleave the results coming from the buffer with results coming from storage. It could be the client or the leaseholder. For eliding reads going to the leaseholder when they're satisfied by the buffer, the client needs to be involved.
- One implementation idea would be to ship the buffer to leaseholders serving reads and have the MVCC layer use iterators to interleave the buffer contents and scan results.
- DistSQL is an extra complication because it would seem we need SQL to collaborate in shipping the buffer around with the flows it schedules remotely.
Buffering writes until commit time on the client will have a memory footprint. Presumably we'd bound it, like we do with other txn memory footprints.
Splitting non-blind writes into a read and a buffered write may add up to an extra request to the leaseholder.
The start of the replication of write intents would be deferred to commit time. But it would still be executed in parallel with the replication of the STAGING txn record.
Writers are more susceptible to needing refreshes (see above).
Write-write contention likes the existence of write locks. But maybe this point is moot now with SFU locking, which does not necessarily need imply blocking non-locking (snapshot in Spanner terms) reads (think Upgrade locks)

Jira issue: CRDB-11233

ajwerner · 2021-11-10T17:35:16Z

Relates to #60968.

Write-write contention likes the existence of write locks. But maybe this point is moot now with SFU locking, which does not necessarily need imply blocking non-locking (snapshot in Spanner terms) reads (think Upgrade locks)

Relates to #55896 and #52768

ajwerner · 2021-11-10T17:44:47Z

Problem 1. seems almost orthogonal. It seems like we could solve that with more code in the MVCC layer interacting with in-memory data structures.

Problem 2 is legit and interesting. It comes with a cost. One thought I've had on this front is that instead of buffering the write in the coordinator, we could send it to the leaseholder which could store it in memory and treat it as an upgrade lock as opposed to an exclusive lock. Then, before commit, we'd need to go and replicate. That'd help with the problems related to visibility and with coordinator RAM usage. It relates in implementation to 1.

is somewhat legit, though I wonder if there are other approaches. Namely, parallelizing batch requests.
is big but depends on kv,*: non-contiguous ranges #65726

All in all, my vote is that we do something about 1. and make a plan on #65726, leaving the rest for a later day.

ajwerner · 2022-01-05T17:04:40Z

Any opposition to pulling 1 into its own issue? I think it's important and pretty independent? I was thinking about contention footprints after @irfansharif's talk and feel like whatever we do for #22349 / #41720 (comment) we might be able to use the same data structure for this.

andreimatei · 2022-01-10T15:46:33Z

Any opposition to pulling 1 into its own issue? I think it's important and pretty independent?

You see it as independent from the other points because it can be addressed with buffering on the leaseholder, as opposed to buffering on the client?
I personally would rather have less issues than more. But I'm not against you doing anything you want.

First half of cockroachdb#71074. This commit replaces all client-side use of InitPut request with the equivalent use of CPut. It uses the following equivalence to perform this mapping: ``` InitPut(key, value) -> CPutAllowingIfNotExists(key, value, value) ``` A future change in v23.2 (or do we now allow skipping one major version?) can remove the server-side handling of InitPut and remove the proto message entirely, once we no longer need to ensure mixed-version compatibility with v22.2. This is primarily a clean-up that reduces the KV API surface area. However, it's also a useful simplification for cockroachdb#72614. Release justification: None. Release note: None.

nvanbenschoten · 2024-04-19T22:08:14Z

We've recently made substantial improvements to key-level locking infrastructure that should assist this project. We now support replicated key-level locks (#100193) and support pipelined replicated lock acquisition (#117978).

These new capabilities can allow a mutation statement (UPDATE, DELETE, INSERT) to acquire a replicated lock on its initial row fetch and avoid traversing back into KV just to write an intent value. Instead, the statement can buffer the intent value in the client with assurance that the row it is modifying has been locked by the transaction.

ajwerner mentioned this issue Dec 2, 2021

kv: add ResolveTimestampRequest #73399

Closed

nvanbenschoten mentioned this issue Sep 22, 2022

kv: stop sending InitPut, replace with CPut #88519

Open

This comment was marked as outdated.

Sign in to view

github-actions bot added the no-issue-activity label Aug 24, 2023

github-actions bot added the X-stale label Sep 4, 2023

github-actions bot closed this as completed Sep 4, 2023

irfansharif removed X-stale no-issue-activity labels Sep 4, 2023

irfansharif reopened this Sep 4, 2023

nvanbenschoten changed the title *: buffered writes kv: buffered writes Apr 19, 2024

nvanbenschoten mentioned this issue Apr 22, 2024

kv: prototype buffered writes #122862

Draft

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

tbg mentioned this issue Jan 30, 2025

kv: abort span access is expensive (2% cpu on oltp_read_write) #122719

Open

arulajmani mentioned this issue Feb 7, 2025

kvclient: microbenchmark and optimize txnWriteBuffer #140672

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: buffered writes #72614

kv: buffered writes #72614

andreimatei commented Nov 10, 2021 •

edited by cockroach-jira-scripts

Loading

ajwerner commented Nov 10, 2021

ajwerner commented Nov 10, 2021

ajwerner commented Jan 5, 2022

andreimatei commented Jan 10, 2022

This comment was marked as outdated.

nvanbenschoten commented Apr 19, 2024

kv: buffered writes #72614

kv: buffered writes #72614

Comments

andreimatei commented Nov 10, 2021 • edited by cockroach-jira-scripts Loading

Motivation

Drawbacks

ajwerner commented Nov 10, 2021

ajwerner commented Nov 10, 2021

ajwerner commented Jan 5, 2022

andreimatei commented Jan 10, 2022

This comment was marked as outdated.

nvanbenschoten commented Apr 19, 2024

andreimatei commented Nov 10, 2021 •

edited by cockroach-jira-scripts

Loading