Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: analyze INSERT INTO .... VALUES(...) ON CONFLICT DO UPDATE SET n = n+1 #18657

Closed
arctica opened this issue Sep 21, 2017 · 54 comments
Closed
Assignees
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-community Originated from the community

Comments

@arctica
Copy link

arctica commented Sep 21, 2017

Running v1.1-beta.20170907 on debian stretch with ext4 filesystem.
Nodes use --cache=20% --max-sql-memory=10GB

3 nodes in the same datacenter, all with the following hardware:
CPU: Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz (8 threads)
RAM: 64GiB
Disks: 2x 512GiB NVME in software RAID1

Test benchmark table:
CREATE TABLE bench (s string primary key, n int);

A simple load generating script written in Go is creating 100 concurrent connections and runs the following query on one of the nodes where $1 is a random string of length 3:

INSERT INTO bench (s, n) VALUES($1, 1) ON CONFLICT (s) DO UPDATE SET n = bench.n + 1

Base performance without any further modifications is around 300 qps which is extremely slow. This is likely due to the small datasize resulting in just 1 range which seems like a unit of concurreny in the system. The default max range size is 64 MiB.

Then I dropped and recreated the table and modified the size of ranges to a min/max of 64/128 KiB to force it to split ranges very quickly. Inserts start slowly again at under 300 qps but at something like 8 ranges already, qps rise to around 700.

Once the system reaches 64 ranges after a few minutes, performance stabalizes around 2500 qps.

This is still quite low as some other databases can do nearly an order of magnitude more qps on the same hardware.

At the stable 2.5k qps, each node is using less than 50% of available CPU power, have plenty of RAM free and network throuput is about 1MiB up and down each on a gbit network.

The disk io though is quite worrying at around 26MiB writes/s and 8% CPU spent in iowait as indicated by dstat.
The data being updated is very small (one integer). Granted, CockroachDB keeps all past values so let's assume each update is like an insert. The string has 3 bytes plus 4 byte integer plus overhead for metadata and encoding. Let's assume a generous 64 bytes per entry. At 2500 qps, that would be around 256KiB/s. LSM storage engines have write amplification. Not sure how many levels were generated in this test but I'd assume not too many. So let's assume each row is actually written 4 times as time goes by. That's 1MiB/s. Still off by a factor of 26. Not sure where all this disk io comes from but it seems excessive.

Batching, as suggested on Gitter, didn't help. I tried to write 10 rows per query and qps dropped by a factor of 10 accordingly. KV operations seemed stable so it's writing the same amount of rows. Additionally one has to be very careful as the same primary key can't appear twice in a single insert so one has to pre-process the batch items before executing the query or otherwise the query will fail with an error.

@tschottdorf asked on Gitter to see a SHOW KV TRACE of an example query. Please see below.
This was run while the load generator is still running.

root@:26257/test> SHOW KV TRACE FOR INSERT INTO bench(s, n) VALUES ('ABC', 1) ON CONFLICT (s) DO UPDATE SET n = bench.n + 1;
+----------------------------------+---------------+---------------------------------------------------------------+------------------------------------------------------+------------------+-------+
|            timestamp             |      age      |                            message                            |                       context                        |    operation     | span  |
+----------------------------------+---------------+---------------------------------------------------------------+------------------------------------------------------+------------------+-------+
| 2017-09-21 09:24:35.115653+00:00 | 0s            | output row: []                                                | [client=127.0.0.1:36494,user=root,n1] output row: [] | consuming rows   | (0,2) |
| 2017-09-21 09:24:35.115674+00:00 | 21µs28ns      | querying next range at /Table/56/1/"ABC"                      | [client=127.0.0.1:36494,user=root,n1]                | sql txn implicit | (0,0) |
| 2017-09-21 09:24:35.115696+00:00 | 42µs735ns     | r2522: sending batch 1 Scan to (n3,s3):1                      | [client=127.0.0.1:36494,user=root,n1]                | sql txn implicit | (0,0) |
| 2017-09-21 09:24:35.117193+00:00 | 1ms539µs422ns | fetched: /bench/primary/'ABC'/n -> /305                       | [client=127.0.0.1:36494,user=root,n1]                | sql txn implicit | (0,0) |
| 2017-09-21 09:24:35.117218+00:00 | 1ms565µs47ns  | Put /Table/56/1/"ABC"/0 -> /TUPLE/2:2:Int/306                 | [client=127.0.0.1:36494,user=root,n1]                | sql txn implicit | (0,0) |
| 2017-09-21 09:24:35.11724+00:00  | 1ms586µs925ns | querying next range at /Table/56/1/"ABC"/0                    | [client=127.0.0.1:36494,user=root,n1]                | sql txn implicit | (0,0) |
| 2017-09-21 09:24:35.117266+00:00 | 1ms612µs181ns | r2522: sending batch 1 Put, 1 BeginTxn, 1 EndTxn to (n3,s3):1 | [client=127.0.0.1:36494,user=root,n1]                | sql txn implicit | (0,0) |
+----------------------------------+---------------+---------------------------------------------------------------+------------------------------------------------------+------------------+-------+
(7 rows)

I couldn't observe any benefit from larger ranges. I think if a table started out with a small range size and automatically increased this as it grows, performance could be greatly improved. At least the default of 64MiB seems way too high.

Side observations:

  1. When using a shorter length for the random primary key string like 2 which creates a lot more conflicts, the load generator quickly dies with this error:
    ERROR: TransactionStatusError: does not exist (SQLSTATE XX000)
    I am not sure what this error indicates. It might warrant its own issue.

  2. Doing a TRUNCATE TABLE bench; while inserts are running, results in the table not being displayed in the admin UI. It re-appears once the TRUNCATE is finished.

  3. Changing the queries to pure SELECTs for a single row, results in around 2200 qps.

  4. Changing the queries to ON CONFLICT DO NOTHING, results in around 7100 qps.

  5. To refesh the table overview in the admin UI takes several seconds because each time nearly 900KiB (3.46MiB uncompressed) of javascript are downloaded each time. The servers are not close to me so this causes quite some lag. CockroachDB prevents the browser from caching the assets and I think that should be changed. It should at least support Etags so the browser can cache it as long as the file didn't change. An alternative solution would be to use a URL which contains the hash or mtime of the binary.

  6. Increasing the range size after over 1000 ranges were created didn't seem to result in a lower amount of ranges. Are ranges ever merged?

  7. The admin UI seems sensitive to the machine running the browser having a synchronized clock. I saw nodes being randomly reported as suspect and couldn't figure out what's wrong until I noticed my laptops clock was off by a bit. It also causes the queries per second value to be 0 every now and then.

  8. The database size in the admin UI might be off. For one table it shows me a size of 9.3GiB while the whole cluster in the overview shows a usage of 3.6GiB which also matches the 1.2GiB size of the cockroach-data directory on each node.

  9. The number of indices in the admin ui tables page seems wrong. I have a table with a primary key over 3 columns and it lists 3 indices while it should be 1.

  10. I shut down node 3 via "cockroach quit" which made the load generator get stuck. No errors. After restarting the load generator, it quickly becomes stuck again. Once I brought node 3 back up, queries continued. That's a real problem for a production setup. Note that the load generator only connects to node 1. The admin UI correctly identified node 3 as dead. This also probably warrants its own issue.

@petermattis petermattis self-assigned this Sep 21, 2017
@petermattis petermattis added this to the 1.2 milestone Sep 21, 2017
@petermattis
Copy link
Collaborator

@arctica Thanks for the detailed issue.

First off, I'm curious why you set --max-sql-memory to 10GB. That's significantly higher than I think would be useful. Is there something in our docs that led you to the setting? You've got beefy machines, so I don't think this is actually causing a problem.

The disk io though is quite worrying at around 26MiB writes/s and 8% CPU spent in iowait as indicated by dstat. The data being updated is very small (one integer). Granted, CockroachDB keeps all past values so let's assume each update is like an insert. The string has 3 bytes plus 4 byte integer plus overhead for metadata and encoding. Let's assume a generous 64 bytes per entry. At 2500 qps, that would be around 256KiB/s. LSM storage engines have write amplification. Not sure how many levels were generated in this test but I'd assume not too many. So let's assume each row is actually written 4 times as time goes by. That's 1MiB/s. Still off by a factor of 26. Not sure where all this disk io comes from but it seems excessive.

Unless you were running for a long period of time, the LSM write amplification should be a non-factor. Note that every write involves a 4x write amplification outside of the LSM due to writing the Raft log and then subsequently committing the command. Both the Raft log write and the committing of the command involve a 2x write amplification due to RocksDB writing to its internal WAL and then subsequently flushing the memtable. There is some a possibility this will be reduced in the future. See #16948.

But that write amplification still does explain the 26MiB/sec that you're seeing. The 64 bytes per entry might be a bit pessimistic given some of the other Raft and internal state that is written on each write. I'm going to file an issue to investigate in detail where the space overhead for each write is coming from.

Increasing the range size after over 1000 ranges were created didn't seem to result in a lower amount of ranges. Are ranges ever merged?

No, ranges are never merged. Merging ranges is actually more difficult than splitting ranges and is still TODO.

The admin UI seems sensitive to the machine running the browser having a synchronized clock. I saw nodes being randomly reported as suspect and couldn't figure out what's wrong until I noticed my laptops clock was off by a bit. It also causes the queries per second value to be 0 every now and then.

I believe we just fixed this bug.

To refesh the table overview in the admin UI takes several seconds because each time nearly 900KiB (3.46MiB uncompressed) of javascript are downloaded each time. The servers are not close to me so this causes quite some lag. CockroachDB prevents the browser from caching the assets and I think that should be changed. It should at least support Etags so the browser can cache it as long as the file didn't change. An alternative solution would be to use a URL which contains the hash or mtime of the binary.

The database size in the admin UI might be off. For one table it shows me a size of 9.3GiB while the whole cluster in the overview shows a usage of 3.6GiB which also matches the 1.2GiB size of the cockroach-data directory on each node.

The number of indices in the admin ui tables page seems wrong. I have a table with a primary key over 3 columns and it lists 3 indices while it should be 1.

Can you file a separate issues about these admin UI problems?

I shut down node 3 via "cockroach quit" which made the load generator get stuck. No errors. After restarting the load generator, it quickly becomes stuck again. Once I brought node 3 back up, queries continued. That's a real problem for a production setup. Note that the load generator only connects to node 1. The admin UI correctly identified node 3 as dead. This also probably warrants its own issue.

Yes, please file a separate issue about this. We frequently run load against clusters where machines are taken down and brought back up without problem. It is possible this is a problem in your load generator.

@petermattis
Copy link
Collaborator

I couldn't observe any benefit from larger ranges. I think if a table started out with a small range size and automatically increased this as it grows, performance could be greatly improved. At least the default of 64MiB seems way too high.

There is a short term option to manually split ranges using ALTER TABLE ... SPLIT AT. This isn't very satisfying. We also have a roadmap item to perform load-based splitting of ranges.

@knz
Copy link
Contributor

knz commented Sep 21, 2017

@couchand could you scan through the 10 notes above and create separate UI/backend issues as appropriate. I have a PR for item 9 already.

@couchand
Copy link
Contributor

Item 7 should be fixed by #18260, and item 9 should be fixed by #18603.

Here is the issue for the TRUNCATE TABLE observation: #18666

The disk usage concerns may have been taken care of by #17733, but I made a new issue to look into it anyway: #18667

The issue of initial load performance is already on our radar, but since I couldn't find an existing issue I went ahead and made one: #18668

@arctica
Copy link
Author

arctica commented Sep 21, 2017

I'll create tickets if I find some time.

In the meantime I have now expanded the cluster to 8 nodes. Performance did not increase, which was surprising and disappointing. Adding the load generator to 2 nodes also did not increase qps. Instead it seemed the cluster got overloaded as I started seeing TransactionStatusError: transaction deadline exceeded

At this point I'd be interested in what kind of numbers other people have achieved in their clusters and how. I'm just wondering if I am pushing things to the limit or if I'm just somehow running into a bug or wrong configuration.

There is a short term option to manually split ranges using ALTER TABLE ... SPLIT AT. This isn't very satisfying. We also have a roadmap item to perform load-based splitting of ranges.

@petermattis Could you give an example for this statement as the documentation seems pending. And can this be undone to revert back to size based range splitting? I would like to experiment a bit.

@petermattis
Copy link
Collaborator

We see 20k ops/sec using kv on a 6-node cluster where each node has 8-cores. That's for inserts to new keys which is a somewhat different workload that the updates you're performing.

@petermattis Could you give an example for this statement as the documentation seems pending. And can this be undone to revert back to size based range splitting? I would like to experiment a bit.

Yes, this isn't documented. See https://github.com/cockroachdb/loadgen/blob/master/kv/main.go#L364 for how this is done in the kv tool.

@arctica
Copy link
Author

arctica commented Sep 22, 2017

@petermattis thanks. I still am to try this SPLIT AT statement but I think it might help because I just saw this:

screenshot-2017-9-22 cockroach console 1

Note how "Leaseholds per Store" converge nicely but the "Keys written per Second per Store" are extremely unevenly distributed. 3 nodes seem to do nearly all the kv writes in the cluster.

At the time of screenshot, the table was 1014MiB indicated in the admin UI and had 256 ranges. The table has a replica zone set to 4MiB ranges. It's processing now close to 5000 qps.

@arctica
Copy link
Author

arctica commented Sep 22, 2017

I have now waited for another split to 512 ranges but that didn't change the qps nor the load distribution.

I have also confirmed the actual load on the nodes varies heavily. The 3 nodes that are doing all those kv operations are using around 40% CPU and are doing a crazy 70-80 MiB/s in disk writes. The other nodes are all using around 15% CPU and 17-19 MiB/s in disk writes.

It seems for some reason there are serious issues with the loadbalancing. Note: the 3 nodes that are doing the write/s don't equal the 3 nodes that are running the load generators.

@arctica
Copy link
Author

arctica commented Sep 22, 2017

I've set the range max size to 1MiB which caused the creation of many more ranges. After a while the distribution of kv write/s has improved but is still uneven between the nodes (factor 3x). Overall qps in the cluster improved to about 5500 qps. I also tried a ALTER TABLE bench SCATTER which didn't seem to have done much.

After a couple hours, I've taken out 2 of the nodes and performance is back to 3200 qps. An unexpectedly big drop.

@knz
Copy link
Contributor

knz commented Sep 25, 2017

cc @a-robinson

@a-robinson
Copy link
Contributor

Thanks for the great descriptions of what you're doing and seeing, @arctica.

Note how "Leaseholds per Store" converge nicely but the "Keys written per Second per Store" are extremely unevenly distributed. 3 nodes seem to do nearly all the kv writes in the cluster.

Unfortunately the current release has load-based balancing of ranges disabled, so it's possible that things are working correctly and you just got really unlucky with the placement of replicas. If you want to try that out, you can enable it by running SET CLUSTER SETTING kv.allocator.load_based_lease_rebalancing.enabled = true;. The issue tracking enabling it for 1.2 is #17979.

However, in this case, the fact that just three nodes were handling almost all the load, it kind of looks like your writes may not have been distributing properly across the ranges. I realize you've changed things since then, but it would have been nice to check the writesPerSecond stat on all the ranges via the /_status/ranges/local UI endpoint to see how many ranges are actually receiving the load.

I also tried a ALTER TABLE bench SCATTER which didn't seem to have done much.

SCATTER doesn't do a whole lot right now unless it's run right after a split, since all it does is expedite the rebalancing process that would happen anyway. We'll be changing this for the next release (#17341).

@a-robinson
Copy link
Contributor

a-robinson commented Oct 4, 2017

I've been playing around with this load generator on a 3 node GCE cluster today (using n1-highmem-8 with 100GB PD-SSDs).

The bottleneck in the workload (assuming no splits) is single-replica raft throughput. A single raft instance of ours simply can't process requests very quickly, topping out around 750 qps for me. A lot of raft messages get dropped due to having to many requests already queued up. However, bumping up the max number of queued requests eliminates dropped raft messages but doesn't help throughput.

Adding some extra instrumentation shows that the lack of batching when we're syncing new raft entries to disk appears to be the bottleneck. There's almost no batching on the followers, and each sync takes around 1.5ms.

The diff that I'm running with (on top of v1.1-beta.20170928):

diff --git a/pkg/storage/replica.go b/pkg/storage/replica.go
index 3351713f1..c6ecc0f01 100644
--- a/pkg/storage/replica.go
+++ b/pkg/storage/replica.go
@@ -3355,6 +3355,7 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
                return stats, expl, errors.Wrap(err, expl)
        }
        elapsed := timeutil.Since(start)
+       log.Infof(ctx, "committing %d raft log entries took %v", len(rd.Entries), elapsed)
        r.store.metrics.RaftLogCommitLatency.RecordValue(elapsed.Nanoseconds())

        if len(rd.Entries) > 0 {
@@ -3433,6 +3434,7 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
                }
        }

+       applyStart := timeutil.Now()
        for _, e := range rd.CommittedEntries {
                switch e.Type {
                case raftpb.EntryNormal:
@@ -3543,6 +3545,8 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
                        log.Fatalf(ctx, "unexpected Raft entry: %v", e)
                }
        }
+       applyElapsed := timeutil.Since(applyStart)
+       log.Infof(ctx, "applying %d raft log entries took %v", len(rd.CommittedEntries), applyElapsed)
        if refreshReason != noReason {
                r.mu.Lock()
                r.refreshProposalsLocked(0, refreshReason)
diff --git a/pkg/storage/store.go b/pkg/storage/store.go
index a63d2e297..8e7d66ed6 100644
--- a/pkg/storage/store.go
+++ b/pkg/storage/store.go
@@ -78,7 +78,7 @@ const (

        // replicaRequestQueueSize specifies the maximum number of requests to queue
        // for a replica.
-       replicaRequestQueueSize = 100
+       replicaRequestQueueSize = 10000

        defaultGossipWhenCapacityDeltaExceedsFraction = 0.01

@@ -3537,6 +3537,7 @@ func (s *Store) processRequestQueue(ctx context.Context, rangeID roachpb.RangeID
        q.infos = nil
        q.Unlock()

+       start := timeutil.Now()
        for _, info := range infos {
                if pErr := s.processRaftRequest(info.respStream.Context(), info.req, IncomingSnapshot{}); pErr != nil {
                        // If we're unable to process the request, clear the request queue. This
@@ -3556,6 +3557,8 @@ func (s *Store) processRequestQueue(ctx context.Context, rangeID roachpb.RangeID
                        }
                }
        }
+       elapsed := timeutil.Since(start)
+       log.Infof(ctx, "processing %d infos from raftRequestQueue took %v", len(infos), elapsed)
 }

Example logging on the leader:

I171004 19:41:07.210129 79 storage/store.go:3561  [n1,s1] processing 2 infos from raftRequestQueue took 166.2µs
I171004 19:41:07.211898 88 storage/replica.go:3358  [n1,s1,r18/1:/{Table/51-Max}] committing 3 raft log entries took 1.667688ms
I171004 19:41:07.211938 88 storage/replica.go:3549  [n1,s1,r18/1:/{Table/51-Max}] applying 0 raft log entries took 133ns
I171004 19:41:07.212004 88 storage/replica.go:3358  [n1,s1,r18/1:/{Table/51-Max}] committing 0 raft log entries took 24.741µs
I171004 19:41:07.212122 88 storage/replica.go:3549  [n1,s1,r18/1:/{Table/51-Max}] applying 1 raft log entries took 97.993µs
I171004 19:41:07.213874 88 storage/replica.go:3358  [n1,s1,r18/1:/{Table/51-Max}] committing 0 raft log entries took 1.713828ms
I171004 19:41:07.213895 88 storage/replica.go:3549  [n1,s1,r18/1:/{Table/51-Max}] applying 0 raft log entries took 123ns
I171004 19:41:07.213908 88 storage/store.go:3561  [n1,s1] processing 2 infos from raftRequestQueue took 1.954853ms
I171004 19:41:07.213972 88 storage/replica.go:3358  [n1,s1,r18/1:/{Table/51-Max}] committing 0 raft log entries took 23.518µs
I171004 19:41:07.214072 88 storage/replica.go:3549  [n1,s1,r18/1:/{Table/51-Max}] applying 1 raft log entries took 80.328µs
I171004 19:41:07.215766 88 storage/replica.go:3358  [n1,s1,r18/1:/{Table/51-Max}] committing 0 raft log entries took 1.666169ms
I171004 19:41:07.215786 88 storage/replica.go:3549  [n1,s1,r18/1:/{Table/51-Max}] applying 0 raft log entries took 134ns

Example logging on a follower:

I171004 19:44:40.455127 102 storage/replica.go:3358  [n2,s2,r18/3:/{Table/51-Max}] committing 1 raft log entries took 1.709744ms
I171004 19:44:40.455154 102 storage/replica.go:3549  [n2,s2,r18/3:/{Table/51-Max}] applying 0 raft log entries took 96ns
I171004 19:44:40.456791 102 storage/replica.go:3358  [n2,s2,r18/3:/{Table/51-Max}] committing 1 raft log entries took 1.587003ms
I171004 19:44:40.456911 102 storage/replica.go:3549  [n2,s2,r18/3:/{Table/51-Max}] applying 1 raft log entries took 94.834µs
I171004 19:44:40.458605 102 storage/replica.go:3358  [n2,s2,r18/3:/{Table/51-Max}] committing 4 raft log entries took 1.639141ms
I171004 19:44:40.458735 102 storage/replica.go:3549  [n2,s2,r18/3:/{Table/51-Max}] applying 1 raft log entries took 79.508µs
I171004 19:44:40.460454 102 storage/replica.go:3358  [n2,s2,r18/3:/{Table/51-Max}] committing 1 raft log entries took 1.671677ms
I171004 19:44:40.460561 102 storage/replica.go:3549  [n2,s2,r18/3:/{Table/51-Max}] applying 1 raft log entries took 74.532µs
I171004 19:44:40.460573 102 storage/store.go:3561  [n2,s2] processing 111 infos from raftRequestQueue took 137.886315ms

I'll need to page in this code a little more to figure out what sort of parallelism we should add. At a high level, it certainly doesn't seem like there's any good reason why we couldn't be syncing all of the queued-up raft log entries to disk at once.

@a-robinson
Copy link
Contributor

And, indeed, setting kv.raft_log.synchronize = false drastically reduces the log commit latency (from 2.5ms to 100us) and the raft work time (show below) and increases the qps for a single range up to about 2200qps

screen shot 2017-10-04 at 3 58 01 pm

@petermattis
Copy link
Collaborator

Hmm, Store.processRaftRequest is calling Replica.handleRaftReadyRaftMuLocked, but perhaps it shouldn't. Or it should only do so if this is the last request being processed for a queue.

@bdarnell
Copy link
Contributor

bdarnell commented Oct 4, 2017

We already batch up writes to the raft log. We don't batch the application of commands, but in the log snippets you've provided there's only one of those happening at a time anyway. kv.raft_log.synchronize only affects log writes, not command application.

@tbg
Copy link
Member

tbg commented Oct 4, 2017

@petermattis you're saying that we're consuming the raft readies too frequently, right? As in, we consume them each time the committed index gets bumped by one, at which point we'll sync that (and at least on a follower, nothing else is going on, so no synergy happens).

@tbg
Copy link
Member

tbg commented Oct 4, 2017

@bdarnell the batching that you mean is that multiple entries get put in the log before a ready is emitted? That's right (and actually that should translate to the followers, right?). Maybe we're also just committing a lot of empty batches for some reason (judging by the log messages)

@bdarnell
Copy link
Contributor

bdarnell commented Oct 4, 2017

the batching that you mean is that multiple entries get put in the log before a ready is emitted? That's right (and actually that should translate to the followers, right?). Maybe we're also just committing a lot of empty batches for some reason (judging by the log messages)

I was talking about the fact that if a ready contains multiple entries, those get written as one rocksdb batch. But I think Peter is right; the fact that we call into handleRaftReady after every message prevents this batching from having much effect (on the other hand, I think this call was introduced because it improved performance for some workloads).

Raft will also batch multiple entries into one message on the leader side, if multiple proposals come in before we get through a ready cycle. These entries would be batched through the log write.

@petermattis
Copy link
Collaborator

@petermattis you're saying that we're consuming the raft readies too frequently, right? As in, we consume them each time the committed index gets bumped by one, at which point we'll sync that (and at least on a follower, nothing else is going on, so no synergy happens).

Yes, that's my concern.

@petermattis
Copy link
Collaborator

(on the other hand, I think this call was introduced because it improved performance for some workloads).

Probably workloads with multiple ranges. Should be easy to pass a flag to processRaftRequest which indicates that it should call handleRaftReady.

@a-robinson
Copy link
Contributor

Sorry for being away since my last message, I had to do an interview. But @petermattis has stated what I was trying to say more clearly. When we pull off, for example, 111 requests in processRequestQueue on a follower, each with a single entry, we shouldn't be syncing each entry to the raft log individually:

I171004 19:44:40.460573 102 storage/store.go:3561  [n2,s2] processing 111 infos from raftRequestQueue took 137.886315ms
I171004 19:44:40.462339 102 storage/replica.go:3358  [n2,s2,r18/3:/{Table/51-Max}] committing 1 raft log entries took 1.716379ms
I171004 19:44:40.462450 102 storage/replica.go:3549  [n2,s2,r18/3:/{Table/51-Max}] applying 1 raft log entries took 78.747µs
I171004 19:44:40.462502 102 storage/replica.go:3358  [n2,s2,r18/3:/{Table/51-Max}] committing 0 raft log entries took 14.664µs
I171004 19:44:40.462582 102 storage/replica.go:3549  [n2,s2,r18/3:/{Table/51-Max}] applying 1 raft log entries took 55.624µs
I171004 19:44:40.462615 102 storage/replica.go:3358  [n2,s2,r18/3:/{Table/51-Max}] committing 0 raft log entries took 10.081µs
I171004 19:44:40.462680 102 storage/replica.go:3549  [n2,s2,r18/3:/{Table/51-Max}] applying 1 raft log entries took 41.54µs
I171004 19:44:40.464327 102 storage/replica.go:3358  [n2,s2,r18/3:/{Table/51-Max}] committing 1 raft log entries took 1.622787ms
I171004 19:44:40.464350 102 storage/replica.go:3549  [n2,s2,r18/3:/{Table/51-Max}] applying 0 raft log entries took 114ns
I171004 19:44:40.466104 102 storage/replica.go:3358  [n2,s2,r18/3:/{Table/51-Max}] committing 1 raft log entries took 1.725859ms
I171004 19:44:40.466123 102 storage/replica.go:3549  [n2,s2,r18/3:/{Table/51-Max}] applying 0 raft log entries took 85ns
I171004 19:44:40.466175 102 storage/replica.go:3358  [n2,s2,r18/3:/{Table/51-Max}] committing 0 raft log entries took 22.239µs
I171004 19:44:40.466238 102 storage/replica.go:3549  [n2,s2,r18/3:/{Table/51-Max}] applying 1 raft log entries took 44.284µs
I171004 19:44:40.467845 102 storage/replica.go:3358  [n2,s2,r18/3:/{Table/51-Max}] committing 1 raft log entries took 1.57756ms
I171004 19:44:40.467865 102 storage/replica.go:3549  [n2,s2,r18/3:/{Table/51-Max}] applying 0 raft log entries took 60ns
...

@tbg
Copy link
Member

tbg commented Oct 4, 2017

Doesn't the sync batching (if there's only ever one caller) also eat the millisecond timeout we have on the batch-internal batching mechanism every time? That would explain why we never get near 1000qps and easily go past that without sync commits.

Should be easy to pass a flag to processRaftRequest which indicates that it should call handleRaftReady.

That makes a lot of sense.

@a-robinson
Copy link
Contributor

I'll try throwing a patch together and testing it out.

Doesn't the sync batching (if there's only ever one caller) also eat the millisecond timeout we have on the batch-internal batching mechanism every time? That would explain why we never get near 1000qps and easily go past that without sync commits.

Is there a separate knob for that other than the rocksdb.min_wal_sync_interval setting? Because that's set to 0 by default (including on this test cluster).

@dianasaur323
Copy link
Contributor

10th issue is filed here: #19262

@arctica
Copy link
Author

arctica commented Oct 16, 2017

Something strange is going on in that last set of graphs. All the leases are on one node, the "Raft Other" section is showing something new after the restart (is that Snap messages?), the raft log queue is growing, and the replication queue is more active than before. (My guess is that the leases being concentrated on one node is causing the other things, but it's hard to be sure).

The leases were initially indeed all on one node but then got spread out over the other nodes over a timeframe of about 5h.

Regarding the Raft Other graph: before the upgrade there usually were a few Vote/VoteResp messages and a tiny bit of Prop/TImeoutNow. After the initialy peak of Vote/VoteResp after the upgrade, the blue graph shows a ton of Prop messages. These Prop messages went down slowly over a couple hours but the stabalized at around 10 which is 100x the amount as before the upgrade (0.1 at that time). Vote and VoteResp are also stable at about 3x the old amounts. There are no Snap messages as far as I can see.

@bdarnell
Copy link
Contributor

OK. I bet you also have a small but non-zero value for "Leaders w/o lease" on the Ranges graph of the replication page (and it was zero before the upgrade). That's the usual cause of a steady stream of Prop/Vote/VoteResp/TimeoutNow messages.

If you go to /#/reports/problemranges it should show links to the affected range(s). Can we see a screenshot of that page, and also of one of the range pages when you click through?

@a-robinson
Copy link
Contributor

Unfortunately I am not seeing any performance improvement as big as I'd expected from the changes in #19056. There's maybe a 10% overall increase in qps throughput.

In addition to what @bdarnell is saying, the QPS increase in #19056 is primarily for the case where you only have one range (or just a few ranges) -- if your writes are well distributed across a lot of ranges, its effect is considerably smaller (as you've observed).

@arctica
Copy link
Author

arctica commented Oct 16, 2017

@bdarnell Correct, "Leaders w/o lease" was always zero before the upgrade and after has a small but non-zero value. 0.x out of 25K+ ranges. Unfortunately I cannot load the problemranges page. It errors out with an error 500 and in the network tab I can see error | failed to marshal error message. Couldn't spot anything in the log either to see what exactly fails.

@jordanlewis
Copy link
Member

I've been looking at INSERT ON CONFLICT for #20985. I think there's a more obvious performance problem than the one we're discussing here actually.

Specifically, the current implementation of INSERT ON CONFLICT always does 2 round-trips even in the non-conflicting case: a scan or get batch (depending on the kind of conflict index) to get all the current values for all of the keys to write, and a put batch to write the final values for all of the keys.

An alternative would be to skip the initial scan batch and write all the keys with CPuts. CPut failures on the conflict index would then be caught. At that point we'd re-scan the conflicting rows and issue the correct write. That's 1 round-trip (the CPut batch) for the non-conflicting path, and 3 round-trips (CPut batch, read batch, write batch) in the conflicting path.

In my opinion, the current implementation chooses the wrong side of the tradeoff - the common case in most workloads is that the rows won't be conflicted. That common case should get preferential treatment and only have to do a single roundtrip.

After writing this I searched around in the git history and discovered #14482, which I suppose is the issue that is tracking this deficiency. Right @danhhz?

@bdarnell
Copy link
Contributor

bdarnell commented Jan 3, 2018

Currently, we assume that the transaction is not continued after a ConditionFailedError. This shows up in a few places, such as the fact that the batch is not continued (so if you send several CPuts and the first one fails, the later ones are not even attempted) and the timestamp cache is not updated to reflect the read. (this was just raised in #21140 too).

I think that to have a 1RTT fast path for this, we need to reclassify ConditionFailedError as a return value instead of an error (at least for CPut requests that opt in to this). Note that the slow path for conflict handling should only need two round trips: the failed CPut can return the current values, so I don't think you'd need another read.

I think #14482 is slightly different, and has to do with the way INSERT ON CONFLICT DO UPDATE works when not all the columns are specified (and how IOCDU is different from UPSERT, for reasons I can't remember).

@jordanlewis
Copy link
Member

You're right about #14482 - after talking with @danhhz I've confirmed that it's about the UPSERT-style fast path where the write can happen with a single batch of non-conditional puts when there are no secondary indexes (and possibly some other conditions).

@arctica
Copy link
Author

arctica commented Jan 4, 2018

That's 1 round-trip (the CPut batch) for the non-conflicting path, and 3 round-trips (CPut batch, read batch, write batch) in the conflicting path.

In my opinion, the current implementation chooses the wrong side of the tradeoff - the common case in most workloads is that the rows won't be conflicted. That common case should get preferential treatment and only have to do a single roundtrip.

In my case and I could imagine pretty much all other cases involving tracking statistics via UPSERTs or ON CONFLICT DO queries, the case where a row already exists is the common case. Imagine INSERT INTO stats (domain, country, hits) VALUES('example.com', 'US', 1) ON CONFLICT (domain, country) DO UPDATE SET hits = stats.hits + 1. Naturally the keys with the most activity will generate a lot of conflicts unless someone has an extremely wide spread of stats dimensions. A ton of projects use UPSERTs in databases to do this kind of metric tracking.

I understand from the above discussion that a fast path for conflict free queries is possible and the worst case for conflicting writes can remain at 2 RTT.

Is it theoretically possible to have 1 RTT conflicting writes that reference current values? First thought that comes to mind would be no because it has to read the old value before writing back new values that reference the old one but maybe there's a trick with commiting only the query instead of the actual value similar to statement based replication.

Anyways I have a feeling the disk writes are causing a lot of the slowness, especially if everything goes through fsync() all the time. There's simply too many writes to disk going on for the given workload.

@bdarnell
Copy link
Contributor

bdarnell commented Jan 4, 2018

I understand from the above discussion that a fast path for conflict free queries is possible and the worst case for conflicting writes can remain at 2 RTT.

Yes. The proposed fast-path would slow conflicting cases down a little bit (due to increased queuing as we try and fail a write instead of a read), but network-wise the conflicting case should have the same number of round trips as today.

Is it theoretically possible to have 1 RTT conflicting writes that reference current values?

Yes, it's possible to push the read-modify-write steps down into the KV layer so it can all be done in one round trip. We have a KV Increment command that serves as a proof of concept for this, although it's not hooked up for use from SQL. This would be the ideal solution for your ON CONFLICT DO UPDATE SET x=x+1 (as opposed to UPSERT which just does a blind write of specified values).

@arctica
Copy link
Author

arctica commented Jan 5, 2018

Yes, it's possible to push the read-modify-write steps down into the KV layer so it can all be done in one round trip. We have a KV Increment command that serves as a proof of concept for this, although it's not hooked up for use from SQL. This would be the ideal solution for your ON CONFLICT DO UPDATE SET x=x+1 (as opposed to UPSERT which just does a blind write of specified values).

Great to hear. I looked around a bit and it seems the KV store isn't really exposed to clients nor is that on the roadmap. Totally understandable that there are other priorities. I guess in an ideal world there would be a KV store API that could accept either some simple scripting language or bytecode so one could push down a bit of logic right into the storage layer and the concensus layer basically just ensures proper routing, ordering and commiting of commands.

Anyways, coming back to the current situation, I've updated to the latest build yesterday and am re-running the load test. On the positive side: QPS start right away at around 2k. On the flip side: QPS start to slowly go down to around 1.5k - 1.6k over a course of 2h. Then, for some reason, performance improves gradually over a timeframe of 2h or so up to 2k - 2.1k, slightly better than the starting QPS. 99th latency follows a similar pattern. It starts at around 150ms and increases up to 500ms over ~2.5h, then it makes a drastic dive to around 200ms and stays pretty stable. The latency improvement happens a bit after the QPS improve and is more drastic. It's a big jump (the improvement, the degradation is gradual). As far as I can tell, the change in performance does not correlate with range split operations. I couldn't find any correlation with any other graphs.

screen shot 2018-01-05 at 12 06 16

I have also modified the load generator script so that I can specify a chance for a query doing a ON CONFLICT DO NOTHING instead of an ON CONFLICT DO UPDATE. This resulted in some interesting numbers as seen below. Note that all tests are run now on a table that is already filled with all possible random values for s so no new rows are actually ever inserted.

Update Chance ~QPS
0% 22k+
10% 13k+
25% 8-9k
50% 4.5k
75% 2.5k - 3.2k
100% 2k - 2.2k

A SELECT n on a random row results in roughly 21k QPS.
A SELECT as above but fetching AS OF SYSTEM TIME 1 minute ago results in roughly 1.2k QPS, that's surprisingly slow.
A pure UPDATE bench SET n = n + 1 WHERE s = ? results in roughly the same QPS as the case of INSERT with 100% update chance on conflict. I guess this finding means that the whole issue could be applied to UPDATEs instead of the whole INSERT ON CONFLICT DO UPDATE?

So between reads and writes is an order of magnitude difference and judging by the numbers in the table above, writes also heavily impact the performance of concurrent reads in the system. For example at 50% writes, we are at 20% QPS compared to the case of 0% writes.

@a-robinson a-robinson modified the milestones: 2.0, 2.1 Feb 27, 2018
@knz knz added C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-community Originated from the community and removed O-community-questions labels Apr 24, 2018
@tbg tbg added the A-coreperf label Aug 21, 2018
@petermattis
Copy link
Collaborator

The 2.1 release will contain several improvements to contended workloads such as #25014. (There were other improvements as well that I haven't been able to track down right now). I've created #29431 to track the idea of adding a KV read-modify-write primitive to reduce the round-trips for UPDATE operations.

@petermattis petermattis removed this from the 2.1 milestone Aug 31, 2018
@a-robinson a-robinson removed their assignment Jan 11, 2019
@nvanbenschoten
Copy link
Member

@arctica have you gotten the chance to test this on 2.1? I'm going to close this for now because we made a few big wins in that release around contended workloads, but please feel free to re-open if you'd like to continue the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-community Originated from the community
Projects
None yet
Development

No branches or pull requests

10 participants