-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: analyze INSERT INTO .... VALUES(...) ON CONFLICT DO UPDATE SET n = n+1 #18657
Comments
@arctica Thanks for the detailed issue. First off, I'm curious why you set
Unless you were running for a long period of time, the LSM write amplification should be a non-factor. Note that every write involves a 4x write amplification outside of the LSM due to writing the Raft log and then subsequently committing the command. Both the Raft log write and the committing of the command involve a 2x write amplification due to RocksDB writing to its internal WAL and then subsequently flushing the memtable. There is some a possibility this will be reduced in the future. See #16948. But that write amplification still does explain the 26MiB/sec that you're seeing. The 64 bytes per entry might be a bit pessimistic given some of the other Raft and internal state that is written on each write. I'm going to file an issue to investigate in detail where the space overhead for each write is coming from.
No, ranges are never merged. Merging ranges is actually more difficult than splitting ranges and is still TODO.
I believe we just fixed this bug.
Can you file a separate issues about these admin UI problems?
Yes, please file a separate issue about this. We frequently run load against clusters where machines are taken down and brought back up without problem. It is possible this is a problem in your load generator. |
There is a short term option to manually split ranges using |
@couchand could you scan through the 10 notes above and create separate UI/backend issues as appropriate. I have a PR for item 9 already. |
Item 7 should be fixed by #18260, and item 9 should be fixed by #18603. Here is the issue for the The disk usage concerns may have been taken care of by #17733, but I made a new issue to look into it anyway: #18667 The issue of initial load performance is already on our radar, but since I couldn't find an existing issue I went ahead and made one: #18668 |
I'll create tickets if I find some time. In the meantime I have now expanded the cluster to 8 nodes. Performance did not increase, which was surprising and disappointing. Adding the load generator to 2 nodes also did not increase qps. Instead it seemed the cluster got overloaded as I started seeing At this point I'd be interested in what kind of numbers other people have achieved in their clusters and how. I'm just wondering if I am pushing things to the limit or if I'm just somehow running into a bug or wrong configuration.
@petermattis Could you give an example for this statement as the documentation seems pending. And can this be undone to revert back to size based range splitting? I would like to experiment a bit. |
We see 20k ops/sec using
Yes, this isn't documented. See https://github.com/cockroachdb/loadgen/blob/master/kv/main.go#L364 for how this is done in the |
@petermattis thanks. I still am to try this SPLIT AT statement but I think it might help because I just saw this: Note how "Leaseholds per Store" converge nicely but the "Keys written per Second per Store" are extremely unevenly distributed. 3 nodes seem to do nearly all the kv writes in the cluster. At the time of screenshot, the table was 1014MiB indicated in the admin UI and had 256 ranges. The table has a replica zone set to 4MiB ranges. It's processing now close to 5000 qps. |
I have now waited for another split to 512 ranges but that didn't change the qps nor the load distribution. I have also confirmed the actual load on the nodes varies heavily. The 3 nodes that are doing all those kv operations are using around 40% CPU and are doing a crazy 70-80 MiB/s in disk writes. The other nodes are all using around 15% CPU and 17-19 MiB/s in disk writes. It seems for some reason there are serious issues with the loadbalancing. Note: the 3 nodes that are doing the write/s don't equal the 3 nodes that are running the load generators. |
I've set the range max size to 1MiB which caused the creation of many more ranges. After a while the distribution of kv write/s has improved but is still uneven between the nodes (factor 3x). Overall qps in the cluster improved to about 5500 qps. I also tried a After a couple hours, I've taken out 2 of the nodes and performance is back to 3200 qps. An unexpectedly big drop. |
cc @a-robinson |
Thanks for the great descriptions of what you're doing and seeing, @arctica.
Unfortunately the current release has load-based balancing of ranges disabled, so it's possible that things are working correctly and you just got really unlucky with the placement of replicas. If you want to try that out, you can enable it by running However, in this case, the fact that just three nodes were handling almost all the load, it kind of looks like your writes may not have been distributing properly across the ranges. I realize you've changed things since then, but it would have been nice to check the
|
I've been playing around with this load generator on a 3 node GCE cluster today (using n1-highmem-8 with 100GB PD-SSDs). The bottleneck in the workload (assuming no splits) is single-replica raft throughput. A single raft instance of ours simply can't process requests very quickly, topping out around 750 qps for me. A lot of raft messages get dropped due to having to many requests already queued up. However, bumping up the max number of queued requests eliminates dropped raft messages but doesn't help throughput. Adding some extra instrumentation shows that the lack of batching when we're syncing new raft entries to disk appears to be the bottleneck. There's almost no batching on the followers, and each sync takes around 1.5ms. The diff that I'm running with (on top of
Example logging on the leader:
Example logging on a follower:
I'll need to page in this code a little more to figure out what sort of parallelism we should add. At a high level, it certainly doesn't seem like there's any good reason why we couldn't be syncing all of the queued-up raft log entries to disk at once. |
Hmm, |
We already batch up writes to the raft log. We don't batch the application of commands, but in the log snippets you've provided there's only one of those happening at a time anyway. |
@petermattis you're saying that we're consuming the raft readies too frequently, right? As in, we consume them each time the committed index gets bumped by one, at which point we'll sync that (and at least on a follower, nothing else is going on, so no synergy happens). |
@bdarnell the batching that you mean is that multiple entries get put in the log before a ready is emitted? That's right (and actually that should translate to the followers, right?). Maybe we're also just committing a lot of empty batches for some reason (judging by the log messages) |
I was talking about the fact that if a ready contains multiple entries, those get written as one rocksdb batch. But I think Peter is right; the fact that we call into handleRaftReady after every message prevents this batching from having much effect (on the other hand, I think this call was introduced because it improved performance for some workloads). Raft will also batch multiple entries into one message on the leader side, if multiple proposals come in before we get through a ready cycle. These entries would be batched through the log write. |
Yes, that's my concern. |
Probably workloads with multiple ranges. Should be easy to pass a flag to |
Sorry for being away since my last message, I had to do an interview. But @petermattis has stated what I was trying to say more clearly. When we pull off, for example, 111 requests in
|
Doesn't the sync batching (if there's only ever one caller) also eat the millisecond timeout we have on the batch-internal batching mechanism every time? That would explain why we never get near 1000qps and easily go past that without sync commits.
That makes a lot of sense. |
I'll try throwing a patch together and testing it out.
Is there a separate knob for that other than the |
10th issue is filed here: #19262 |
The leases were initially indeed all on one node but then got spread out over the other nodes over a timeframe of about 5h. Regarding the Raft Other graph: before the upgrade there usually were a few Vote/VoteResp messages and a tiny bit of Prop/TImeoutNow. After the initialy peak of Vote/VoteResp after the upgrade, the blue graph shows a ton of Prop messages. These Prop messages went down slowly over a couple hours but the stabalized at around 10 which is 100x the amount as before the upgrade (0.1 at that time). Vote and VoteResp are also stable at about 3x the old amounts. There are no Snap messages as far as I can see. |
OK. I bet you also have a small but non-zero value for "Leaders w/o lease" on the Ranges graph of the replication page (and it was zero before the upgrade). That's the usual cause of a steady stream of Prop/Vote/VoteResp/TimeoutNow messages. If you go to |
In addition to what @bdarnell is saying, the QPS increase in #19056 is primarily for the case where you only have one range (or just a few ranges) -- if your writes are well distributed across a lot of ranges, its effect is considerably smaller (as you've observed). |
@bdarnell Correct, "Leaders w/o lease" was always zero before the upgrade and after has a small but non-zero value. 0.x out of 25K+ ranges. Unfortunately I cannot load the problemranges page. It errors out with an error 500 and in the network tab I can see |
I've been looking at Specifically, the current implementation of An alternative would be to skip the initial scan batch and write all the keys with CPuts. CPut failures on the conflict index would then be caught. At that point we'd re-scan the conflicting rows and issue the correct write. That's 1 round-trip (the CPut batch) for the non-conflicting path, and 3 round-trips (CPut batch, read batch, write batch) in the conflicting path. In my opinion, the current implementation chooses the wrong side of the tradeoff - the common case in most workloads is that the rows won't be conflicted. That common case should get preferential treatment and only have to do a single roundtrip. After writing this I searched around in the git history and discovered #14482, which I suppose is the issue that is tracking this deficiency. Right @danhhz? |
Currently, we assume that the transaction is not continued after a ConditionFailedError. This shows up in a few places, such as the fact that the batch is not continued (so if you send several CPuts and the first one fails, the later ones are not even attempted) and the timestamp cache is not updated to reflect the read. (this was just raised in #21140 too). I think that to have a 1RTT fast path for this, we need to reclassify ConditionFailedError as a return value instead of an error (at least for CPut requests that opt in to this). Note that the slow path for conflict handling should only need two round trips: the failed CPut can return the current values, so I don't think you'd need another read. I think #14482 is slightly different, and has to do with the way INSERT ON CONFLICT DO UPDATE works when not all the columns are specified (and how IOCDU is different from UPSERT, for reasons I can't remember). |
In my case and I could imagine pretty much all other cases involving tracking statistics via UPSERTs or ON CONFLICT DO queries, the case where a row already exists is the common case. Imagine I understand from the above discussion that a fast path for conflict free queries is possible and the worst case for conflicting writes can remain at 2 RTT. Is it theoretically possible to have 1 RTT conflicting writes that reference current values? First thought that comes to mind would be no because it has to read the old value before writing back new values that reference the old one but maybe there's a trick with commiting only the query instead of the actual value similar to statement based replication. Anyways I have a feeling the disk writes are causing a lot of the slowness, especially if everything goes through fsync() all the time. There's simply too many writes to disk going on for the given workload. |
Yes. The proposed fast-path would slow conflicting cases down a little bit (due to increased queuing as we try and fail a write instead of a read), but network-wise the conflicting case should have the same number of round trips as today.
Yes, it's possible to push the read-modify-write steps down into the KV layer so it can all be done in one round trip. We have a KV |
Great to hear. I looked around a bit and it seems the KV store isn't really exposed to clients nor is that on the roadmap. Totally understandable that there are other priorities. I guess in an ideal world there would be a KV store API that could accept either some simple scripting language or bytecode so one could push down a bit of logic right into the storage layer and the concensus layer basically just ensures proper routing, ordering and commiting of commands. Anyways, coming back to the current situation, I've updated to the latest build yesterday and am re-running the load test. On the positive side: QPS start right away at around 2k. On the flip side: QPS start to slowly go down to around 1.5k - 1.6k over a course of 2h. Then, for some reason, performance improves gradually over a timeframe of 2h or so up to 2k - 2.1k, slightly better than the starting QPS. 99th latency follows a similar pattern. It starts at around 150ms and increases up to 500ms over ~2.5h, then it makes a drastic dive to around 200ms and stays pretty stable. The latency improvement happens a bit after the QPS improve and is more drastic. It's a big jump (the improvement, the degradation is gradual). As far as I can tell, the change in performance does not correlate with range split operations. I couldn't find any correlation with any other graphs. I have also modified the load generator script so that I can specify a chance for a query doing a
A So between reads and writes is an order of magnitude difference and judging by the numbers in the table above, writes also heavily impact the performance of concurrent reads in the system. For example at 50% writes, we are at 20% QPS compared to the case of 0% writes. |
The 2.1 release will contain several improvements to contended workloads such as #25014. (There were other improvements as well that I haven't been able to track down right now). I've created #29431 to track the idea of adding a KV read-modify-write primitive to reduce the round-trips for |
@arctica have you gotten the chance to test this on 2.1? I'm going to close this for now because we made a few big wins in that release around contended workloads, but please feel free to re-open if you'd like to continue the discussion. |
Running v1.1-beta.20170907 on debian stretch with ext4 filesystem.
Nodes use --cache=20% --max-sql-memory=10GB
3 nodes in the same datacenter, all with the following hardware:
CPU: Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz (8 threads)
RAM: 64GiB
Disks: 2x 512GiB NVME in software RAID1
Test benchmark table:
CREATE TABLE bench (s string primary key, n int);
A simple load generating script written in Go is creating 100 concurrent connections and runs the following query on one of the nodes where $1 is a random string of length 3:
INSERT INTO bench (s, n) VALUES($1, 1) ON CONFLICT (s) DO UPDATE SET n = bench.n + 1
Base performance without any further modifications is around 300 qps which is extremely slow. This is likely due to the small datasize resulting in just 1 range which seems like a unit of concurreny in the system. The default max range size is 64 MiB.
Then I dropped and recreated the table and modified the size of ranges to a min/max of 64/128 KiB to force it to split ranges very quickly. Inserts start slowly again at under 300 qps but at something like 8 ranges already, qps rise to around 700.
Once the system reaches 64 ranges after a few minutes, performance stabalizes around 2500 qps.
This is still quite low as some other databases can do nearly an order of magnitude more qps on the same hardware.
At the stable 2.5k qps, each node is using less than 50% of available CPU power, have plenty of RAM free and network throuput is about 1MiB up and down each on a gbit network.
The disk io though is quite worrying at around 26MiB writes/s and 8% CPU spent in iowait as indicated by dstat.
The data being updated is very small (one integer). Granted, CockroachDB keeps all past values so let's assume each update is like an insert. The string has 3 bytes plus 4 byte integer plus overhead for metadata and encoding. Let's assume a generous 64 bytes per entry. At 2500 qps, that would be around 256KiB/s. LSM storage engines have write amplification. Not sure how many levels were generated in this test but I'd assume not too many. So let's assume each row is actually written 4 times as time goes by. That's 1MiB/s. Still off by a factor of 26. Not sure where all this disk io comes from but it seems excessive.
Batching, as suggested on Gitter, didn't help. I tried to write 10 rows per query and qps dropped by a factor of 10 accordingly. KV operations seemed stable so it's writing the same amount of rows. Additionally one has to be very careful as the same primary key can't appear twice in a single insert so one has to pre-process the batch items before executing the query or otherwise the query will fail with an error.
@tschottdorf asked on Gitter to see a SHOW KV TRACE of an example query. Please see below.
This was run while the load generator is still running.
I couldn't observe any benefit from larger ranges. I think if a table started out with a small range size and automatically increased this as it grows, performance could be greatly improved. At least the default of 64MiB seems way too high.
Side observations:
When using a shorter length for the random primary key string like 2 which creates a lot more conflicts, the load generator quickly dies with this error:
ERROR: TransactionStatusError: does not exist (SQLSTATE XX000)
I am not sure what this error indicates. It might warrant its own issue.
Doing a
TRUNCATE TABLE bench;
while inserts are running, results in the table not being displayed in the admin UI. It re-appears once the TRUNCATE is finished.Changing the queries to pure SELECTs for a single row, results in around 2200 qps.
Changing the queries to ON CONFLICT DO NOTHING, results in around 7100 qps.
To refesh the table overview in the admin UI takes several seconds because each time nearly 900KiB (3.46MiB uncompressed) of javascript are downloaded each time. The servers are not close to me so this causes quite some lag. CockroachDB prevents the browser from caching the assets and I think that should be changed. It should at least support Etags so the browser can cache it as long as the file didn't change. An alternative solution would be to use a URL which contains the hash or mtime of the binary.
Increasing the range size after over 1000 ranges were created didn't seem to result in a lower amount of ranges. Are ranges ever merged?
The admin UI seems sensitive to the machine running the browser having a synchronized clock. I saw nodes being randomly reported as suspect and couldn't figure out what's wrong until I noticed my laptops clock was off by a bit. It also causes the queries per second value to be 0 every now and then.
The database size in the admin UI might be off. For one table it shows me a size of 9.3GiB while the whole cluster in the overview shows a usage of 3.6GiB which also matches the 1.2GiB size of the cockroach-data directory on each node.
The number of indices in the admin ui tables page seems wrong. I have a table with a primary key over 3 columns and it lists 3 indices while it should be 1.
I shut down node 3 via "cockroach quit" which made the load generator get stuck. No errors. After restarting the load generator, it quickly becomes stuck again. Once I brought node 3 back up, queries continued. That's a real problem for a production setup. Note that the load generator only connects to node 1. The admin UI correctly identified node 3 as dead. This also probably warrants its own issue.
The text was updated successfully, but these errors were encountered: