Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: buffered writes #72614

Open
andreimatei opened this issue Nov 10, 2021 · 6 comments
Open

kv: buffered writes #72614

andreimatei opened this issue Nov 10, 2021 · 6 comments
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Nov 10, 2021

This issue explores the potential benefits (and costs) of introducing a new mode of operations for transactions, whereby writes would be buffered on the (KV) client until commit time. Currently, of course, writes are eagerly sent to their respective leaseholders and begin (async) replication.

Motivation

  1. Avoiding pipeline stalls on future reads of the written key within the same txn. As things stand, if a txn writes a key and then reads it, the reads will block on latches until the prior writes finishes its async replication. This can be seen as a stall in the pipeline of transaction operations. But, given that the transaction knows what it's writing, and it knows that the successful replication of the write is checked at commit time, blocking on the latches seems unnecessary. We can imagine the value of the respective key coming from the transaction's write buffer, rather than from the leaseholder's storage.
    • Common reads could avoid going to the leaseholder, being served instead exclusively on the client. For example, FK checks satisfied by a write in the same txn - e.g. insert in parent table and then into the child table in the same txn, as seen in TPCC (citation needed).
    • This idea of avoiding blocking on latches seems to only be applicable for writes from the same txn. If there's an in-flight write from another txn, even if the reader somehow had access to the proposed value, it couldn't simply read it; the reader would have to block (on a lock) until the writer commits or aborts.
  2. Avoiding writers blocking readers during the writer txn's evaluation. As things stand, writers eagerly take locks, which then block readers for the writer's lifetime. If instead writers were buffered until commit time, and no locks were taken eagerly, then readers would only conflict with committing writers, not with evaluating writers. This would mean that locks would be held for unbounded amounts of time only in cases of failures, not in cases of long-running writing transaction.
    • There are also downsides to not taking locks eagerly: the writer does not get the protection of the respective locks, and is more susceptible to being forced to change its timestamp at commit time (and thus perform a refresh). We can imagine that the policy of acquiring locks eagerly on writes would be beneficial sometimes (perhaps on transaction retries - epochs >= 1). Even when locks are acquired eagerly, the buffering of writes still gives most other advantages listed here.
  3. Amortizing write latency. At the moment, write pipelining amortizes the replication latency for writes. Still, each write performs a synchronous round-trip to the leaseholder. Buffering would avoid that.
    • This only works for blind writes, which are not common in SQL. Read-write operations (e.g. CPut) would need to be split into a read phase (which could also optionally take locks; see above) and a (buffered) write.
  4. More 1PC. At the moment, many stars need to align to get a coveted 1PC execution - the SQL statement needs to be run as an implicit txn, the statement needs to be simple enough, the SQL execution node needs to support committing the txn in the same batch as its mutations. It's impossible to get 1PC in an explicit txn (e.g; BEGIN; INSERT; COMMIT never gets it), and it's impossible to get 1PC when inserting into two tables (e.g. even when we had interleaved tables, we still couldn't get it). By buffering writes, we'd no longer need SQL's cooperation for getting 1PC execution; it can all be under the KV client's control.
    • 1PC is good for throughput (less work to do per txn) and for tail latency (fewer round-trips between the client and the leaseholder).
    • Interleaved tables are gone. But, maybe collocated tables will come back through kv,*: non-contiguous ranges #65726 allowing us to get 1PC across tables.

I think 1) and 4) are big.

Drawbacks

  • There'd be complexity involved with supporting read-your-writes within a transaction if writes were buffered. We'd have to decide who's in a good position to interleave the results coming from the buffer with results coming from storage. It could be the client or the leaseholder. For eliding reads going to the leaseholder when they're satisfied by the buffer, the client needs to be involved.
    • One implementation idea would be to ship the buffer to leaseholders serving reads and have the MVCC layer use iterators to interleave the buffer contents and scan results.
    • DistSQL is an extra complication because it would seem we need SQL to collaborate in shipping the buffer around with the flows it schedules remotely.
  • Buffering writes until commit time on the client will have a memory footprint. Presumably we'd bound it, like we do with other txn memory footprints.
  • Splitting non-blind writes into a read and a buffered write may add up to an extra request to the leaseholder.
  • The start of the replication of write intents would be deferred to commit time. But it would still be executed in parallel with the replication of the STAGING txn record.
  • Writers are more susceptible to needing refreshes (see above).
  • Write-write contention likes the existence of write locks. But maybe this point is moot now with SFU locking, which does not necessarily need imply blocking non-locking (snapshot in Spanner terms) reads (think Upgrade locks)

Jira issue: CRDB-11233

@andreimatei andreimatei added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Nov 10, 2021
@ajwerner
Copy link
Contributor

Relates to #60968.


Write-write contention likes the existence of write locks. But maybe this point is moot now with SFU locking, which does not necessarily need imply blocking non-locking (snapshot in Spanner terms) reads (think Upgrade locks)

Relates to #55896 and #52768

@ajwerner
Copy link
Contributor

Problem 1. seems almost orthogonal. It seems like we could solve that with more code in the MVCC layer interacting with in-memory data structures.

Problem 2 is legit and interesting. It comes with a cost. One thought I've had on this front is that instead of buffering the write in the coordinator, we could send it to the leaseholder which could store it in memory and treat it as an upgrade lock as opposed to an exclusive lock. Then, before commit, we'd need to go and replicate. That'd help with the problems related to visibility and with coordinator RAM usage. It relates in implementation to 1.

  1. is somewhat legit, though I wonder if there are other approaches. Namely, parallelizing batch requests.

  2. is big but depends on kv,*: non-contiguous ranges #65726

All in all, my vote is that we do something about 1. and make a plan on #65726, leaving the rest for a later day.

@ajwerner
Copy link
Contributor

ajwerner commented Jan 5, 2022

Any opposition to pulling 1 into its own issue? I think it's important and pretty independent? I was thinking about contention footprints after @irfansharif's talk and feel like whatever we do for #22349 / #41720 (comment) we might be able to use the same data structure for this.

@andreimatei
Copy link
Contributor Author

Any opposition to pulling 1 into its own issue? I think it's important and pretty independent?

You see it as independent from the other points because it can be addressed with buffering on the leaseholder, as opposed to buffering on the client?
I personally would rather have less issues than more. But I'm not against you doing anything you want.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 22, 2022
First half of cockroachdb#71074.

This commit replaces all client-side use of InitPut request with the
equivalent use of CPut. It uses the following equivalence to perform
this mapping:
```
InitPut(key, value) -> CPutAllowingIfNotExists(key, value, value)
```

A future change in v23.2 (or do we now allow skipping one major
version?) can remove the server-side handling of InitPut and remove the
proto message entirely, once we no longer need to ensure mixed-version
compatibility with v22.2.

This is primarily a clean-up that reduces the KV API surface area.
However, it's also a useful simplification for cockroachdb#72614.

Release justification: None.

Release note: None.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 23, 2022
First half of cockroachdb#71074.

This commit replaces all client-side use of InitPut request with the
equivalent use of CPut. It uses the following equivalence to perform
this mapping:
```
InitPut(key, value) -> CPutAllowingIfNotExists(key, value, value)
```

A future change in v23.2 (or do we now allow skipping one major
version?) can remove the server-side handling of InitPut and remove the
proto message entirely, once we no longer need to ensure mixed-version
compatibility with v22.2.

This is primarily a clean-up that reduces the KV API surface area.
However, it's also a useful simplification for cockroachdb#72614.

Release justification: None.

Release note: None.
@github-actions

This comment was marked as outdated.

@github-actions github-actions bot closed this as completed Sep 4, 2023
@irfansharif irfansharif reopened this Sep 4, 2023
@nvanbenschoten nvanbenschoten changed the title *: buffered writes kv: buffered writes Apr 19, 2024
@nvanbenschoten
Copy link
Member

We've recently made substantial improvements to key-level locking infrastructure that should assist this project. We now support replicated key-level locks (#100193) and support pipelined replicated lock acquisition (#117978).

These new capabilities can allow a mutation statement (UPDATE, DELETE, INSERT) to acquire a replicated lock on its initial row fetch and avoid traversing back into KV just to write an intent value. Instead, the statement can buffer the intent value in the client with assurance that the row it is modifying has been locked by the transaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

4 participants