changefeedccl: occasional inconsistent performance on long-running tpcc-1000 #32104

danhhz · 2018-11-01T19:22:55Z

If the cdc/w=1000/nodes=3/init=false roachtest is altered to run indefinitely instead of 2h, it will eventually see inconsistent performance.

Usually the latency between the latest changefeed-level resolved timestamp is 1-2s. Occasionally it will hit some badness and spike up to single-digit minutes. This can go on for several hours. I keep accidentally cutting the test short, so it's unclear if it ever recovers on it's own.

My first guess is that some range activity is happening (splits, rebalances) and this is causing the ExpertRequests to get slower.

danhhz · 2018-11-01T19:39:35Z

During the badness, the number of emitted messages and bytes squiggles a little, but not as much as other metrics. The amount of time spent emitting goes down, which I don't have an explanation for. (Ignore the drop, here and in other graphs, it's where I accidentally ended the test.)

However the sum of the amount of time it takes for all ExportRequests to return goes wild and varies quite a bit by node. Nothing immediately jumps out at me as correlating with this on the other admin ui graphs.

It appears that the fraction of local vs non-local requests served by distsender stays constant, though the number of requests varies more than it usually does.

danhhz · 2018-11-01T19:59:01Z

At the same time as the badness, sql and kv latencies elevate, as do log commit/command commit/and raft log queue latencies (though log commit latencies seems to trail the badness starting). The number of batches/rpcs/avg queries per store go up. CPU and disk read bytes go up, but seem to be trailing indicators. There happens to be a small spike of splits, but it may not be related, because there are similarly sized spikes elsewhere that seem okay.

danhhz · 2018-11-02T17:01:11Z

I added a histogram of the individual ExportRequest latencies and waited for a repro. The long tail spikes during the badness. Looks like 2 nodes were pseudo-healthy and 1 node had highly elevated p50 and pMax. Unclear what the cause of the slowness is, but an elevated p50 suggests it's happening on a node/store level instead of a range one. We should definitely dig into the ExportRequest timings. Note that there is a limiter on the number of concurrently served ExportRequests, which means a couple slow requests will back up all of them.

We've seen this sort of long tail ExportRequest latency in backup enough that I suspected it in tpcc-1000 at one point and filed #28660 with an idea how to contain the damage. I'll prototype that and see if it has any effect.

tbg · 2018-11-12T15:15:51Z

Are you seeing lots of snapshots? Just making sure you're not just hitting something like #32046.

danhhz · 2018-11-12T21:48:42Z

I looked at all the graphs and everthing that seemed to line up, I noted in #32104 (comment), but I'll be sure to look closely at snapshots the next time this happens.

So now there's two things to be done here. One is to look into the ExportRequest timings. Another is to be more resilient to a small number of ExportRequests being slow.

To I address the latter, I prototyped #28660 and have it running on roachprod now. Seems to be doing okay but no badness in the export request durations yet, so that doesn't prove much beyond the basic validity of the idea. End-to-end latency is 4-5s instead of the normal 1-2s, I haven't looked into why yet. Currently deciding whether I should leave this overnight to see what happens or try to force the issue with some randomized sleeps.

danhhz · 2018-11-13T15:52:37Z

Well my overnight test isn't very interesting because somehow I ended up with two clusters and didn't notice until this morning. Nevertheless, here are the results. No badness. I still don't have a theory on why the floor moved up so much, but p50-p99 are tightly grouped.

__avg(ms)__p50(ms)__p75(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
   4343.7   4295.0   4831.8   5100.3   5368.7   6174.0  36507.2

This morning, I ran a (short 15m) test where a 5s sleep is added with a 1/10000 random chance. There are about 5000 ranges in tpcc-1000 so this means it should happen ever 2s or so. This was intentionally picked so the sleeps should overlap a bit, but on different ranges. The results are quite encouraging. The pMax graph pegged at 5.1s but the latency percentiles never really got over the 30s we always seem to see at startup (this 30s is probably because the time bound iterator is less effective for the first bit, I'd guess it would go away if we warmed up tpcc). This suggests the #28660 prototype is indeed more resilient to ExportRequest latency longtails.

__avg(ms)__p50(ms)__p75(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
   6873.7   5905.6   7247.8  10737.4  13958.6  27917.3  32212.3

danhhz · 2018-11-13T17:43:33Z

Re-ran my experiment with 3 nodes this time. The floor is back to normal (so must have been an issue with the accidental 1 node cluster). Percentiles look good modulo two spikes in the beginning (which tends to happen right after a restore and was actually even less pronounced than usual). I'm confidant that if I'd let the test run longer, the p90 would have dropped to around where p75 is.

__avg(ms)__p50(ms)__p75(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  10638.5   5637.1   6174.0  24696.1  53687.1 107374.2 137439.0

One last thing to consider before polishing this and merging is that the table history blocking (table_metadata_nanos) went up considerably. It was essentially zero before.

danhhz · 2018-11-14T18:48:09Z

Ran a couple more experiments last night, but I'm having trouble making sense of what happened. I started up tpcc-1000 with my #28660 prototype and it was acting super funny. This was right before bed and I didn't have time to look into it, so I just started up a fresh cluster with the test from scratch, also with the prototype build.

The good news is that both kept end-to-end latency under 10m overnight. They both had several periods of instability and were able to recover from them. The bad news is I still have no idea why. In fact, not a single metric I've come up with tracks how far behind the resolved timestamp is.

Additional bad news: after I shut the tests down, they both had tons of zombie changefeedccl goroutines that are sticking around forever. It's possible this is an issue with my prototype build, so I'll try reproducing this using master. Filed #32319

We're still seeing elevated tail latencies in ExportRequests, but the number of requests being sent, number of bytes being received from ExportRequests, sum of walltime covered by the ExportRequest, and number of bytes being emitted to the sink are all stable while this is happening.

The odd test that I let run did some crazy stuff and I don't even begin to know how to make sense of it. Everything in the changefeed graphs was pretty jagged (this included a bunch of the normal metrics, so I think tpcc was also quite unhappy during this period). Many hours into the test, suddenly a bunch of splits happened and the emitted bytes smoothed out, while there rest of the changefeed metrics went crazy.

What's seriously odd about this is it was making ~300 ExportRequests per second before this happened and ~3000 after. Ideally we're processing each range once per second, but the ranges went from ~4600 to ~6000, which doesn't match either of those numbers.

danhhz · 2018-11-14T19:57:32Z

Okay, tonight's test will be the following. In ChangeFrontier, periodically print every span that's more than some amount of time behind. This should shed some light on whether it's one request taking a long time or many. And if the latter, it will tell if everything is backed up as of some certain time or different spans are behind at different times.

Also, I'll add a log that prints whenever an ExportRequest takes longer than the polling period. This will validate the poll_request_nanos histogram (a lot of analysis depends on this max being exactly accurate. i've got the downsampler and aggregator both set to max but who knows). It will also help track down what's happening with the slow spans being printed by the ChangeFrontier logging.

danhhz · 2018-11-15T02:27:58Z

Progress! Summary: pretty sure we're hitting the same time-bound iterator issue as #28358.

Discovered the prototype is checkpointing the jobs progress constantly. The polling strategy on master doesn't have this problem because it moves all the spans forward in lockstep, but the prototype was doing it O(ranges)/second, which is nuts. I suspect the craziness I saw in the one test last night is because of this.

Most of my confusion debugging this has been trying to figure out how a handful of export requests taking 10s of seconds could hold up the overall progress for ~10m. After some rubber ducking today, double checked the histogram and found out it's lying to me (sort of). The max latency trackable by metrics.NewLatency is 10s. I'd seen clipping at 10s, but handwaved it away as the max transaction duration. Found this by logging slow ExportRequests.

On a hunch from @benesch, added some logging to see if ExportRequests were getting stuck in a loop waiting for an intent to resolve and the logs are pretty definitive. Looks like it eventually finishes when a compaction is run.

I181114 23:37:11.033598 1213 ccl/storageccl/export.go:173  [n2,s2,r7374/2:/Table/56/1/96{1/5/-…-2/4/-…}] error exporting [/Table/56/1/961/5/-2110/14,/Table/56/1/962/4/-730/3): conflicting intents on /Table/56/1/962/2/-4300/1/0
I181114 23:37:11.060211 1213 ccl/storageccl/export.go:173  [n2,s2,r7374/2:/Table/56/1/96{1/5/-…-2/4/-…}] error exporting [/Table/56/1/961/5/-2110/14,/Table/56/1/962/4/-730/3): conflicting intents on /Table/56/1/962/2/-4300/1/0
... thousands of lines omitted ...
I181114 23:38:10.258637 1213 ccl/storageccl/export.go:173  [n2,s2,r7374/2:/Table/56/1/96{1/5/-…-2/4/-…}] error exporting [/Table/56/1/961/5/-2110/14,/Table/56/1/962/4/-730/3): conflicting intents on /Table/56/1/962/2/-4300/1/0
I181114 23:38:10.301715 1213 ccl/storageccl/export.go:173  [n2,s2,r7374/2:/Table/56/1/96{1/5/-…-2/4/-…}] error exporting [/Table/56/1/961/5/-2110/14,/Table/56/1/962/4/-730/3): conflicting intents on /Table/56/1/962/2/-4300/1/0
I181114 23:38:10.304733 1213 ccl/changefeedccl/poller.go:541  [n2] finished ExportRequest [/Table/56/1/961/5/-2110/14,/Table/56/1/962/4/-730/3) took 59.397918718s

danhhz · 2018-11-16T01:26:08Z

Went back to master (without the #28660 prototype) and added a workaround for the TBI bug that simply logs and skips over any unresolved intents in export.go. Normal contention with tpcc was triggering this, so also moved the poller to do a 1 second readbehind. After an initial spike that happens at the beginning of every run (still curious what's happening there, but it could just be tpcc warmup), the only logs I see the TBI bug. There are huge runs of them, sometimes spanning multiple minutes, but eventually going away (presumably compactions). Seems like this one's figured out, though we'll have to wait on benesch's permanent TBI fix. Running overnight tonight to see if we get any multi-minute lag.

danhhz · 2018-11-16T16:13:19Z

Wow, overnight run was a complete success. After running for ~8h.

__avg(ms)__p50(ms)__p75(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
   2647.8   2684.4   2952.8   3087.0   3221.2   3221.2   5637.1

Sounds like once the TBI bug is resolved for ExportRequest we can call this closed.

I've been running a branch with a bunch of extra metrics while debugging issue cockroachdb#32104. PR'ing the ones that have proven useful in case someone has performance issues in the wild. Release note (enterprise change): Added timeseries metrics for debugging CHANGEFEED performance issues.

32241: changefeedccl: add metrics for performance debugging r=mrtracy a=danhhz I've been running a branch with a bunch of extra metrics while debugging issue #32104. PR'ing the ones that have proven useful in case someone has performance issues in the wild. Release note (enterprise change): Added timeseries metrics for debugging CHANGEFEED performance issues. Co-authored-by: Daniel Harrison <[email protected]>

danhhz · 2018-12-03T19:35:21Z

TBI bug is being tracked in #32799

I've been running a branch with a bunch of extra metrics while debugging issue cockroachdb#32104. PR'ing the ones that have proven useful in case someone has performance issues in the wild. Release note (enterprise change): Added timeseries metrics for debugging CHANGEFEED performance issues.

For reasons described in cockroachdb#28358, a time-bound iterator will sometimes see an unresolved intent where there is none. A normal iterator doesn't have this problem, so we work around it in MVCCIncrementalIterator by double checking all non-value keys. If the normal iterator has a different value for the key, it's used instead. If the normal iterator doesn't have that key, it's skipped. This fixes both changefeeds and incremental backup. Closes cockroachdb#32104 Closes cockroachdb#32799 Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no longer indefinitely hang under an infrequent condition.

32909: engineccl/mvcc: work around time-bound iterator bug r=bdarnell,benesch a=danhhz For reasons described in #28358, a time-bound iterator will sometimes see an unresolved intent where there is none. A normal iterator doesn't have this problem, so we work around it in MVCCIncrementalIterator by double checking all non-value keys. If the normal iterator has a different value for the key, it's used instead. If the normal iterator doesn't have that key, it's skipped. This fixes both changefeeds and incremental backup. Closes #32104 Closes #32799 Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no longer indefinitely hang under an infrequent condition. Co-authored-by: Daniel Harrison <[email protected]>

For reasons described in cockroachdb#28358, a time-bound iterator will sometimes see an unresolved intent where there is none. A normal iterator doesn't have this problem, so we work around it in MVCCIncrementalIterator by double checking all non-value keys. If the normal iterator has a different value for the key, it's used instead. If the normal iterator doesn't have that key, it's skipped. This fixes both changefeeds and incremental backup. Closes cockroachdb#32104 Closes cockroachdb#32799 Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no longer indefinitely hang under an infrequent condition.

danhhz added C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-cdc Change Data Capture labels Nov 1, 2018

danhhz self-assigned this Nov 1, 2018

This was referenced Nov 12, 2018

oddity while running extended version of tpcc/nodes=3/w=max #32058

Closed

roachtest: cdc/w=1000/nodes=3/init=false failed #32197

Closed

danhhz mentioned this issue Nov 13, 2018

changefeedccl: add metrics for performance debugging #32241

Merged

danhhz mentioned this issue Nov 15, 2018

libroach: timebound iterator metadata doesn't consider intents #28358

Closed

danhhz mentioned this issue Nov 16, 2018

changefeedccl: measure overhead with tpccbench #28640

Closed

danhhz mentioned this issue Dec 6, 2018

engineccl/mvcc: work around time-bound iterator bug #32909

Merged

danhhz mentioned this issue Dec 13, 2018

release-2.1: engineccl/mvcc: work around time-bound iterator bug #33141

Merged

craig bot closed this as completed in #32909 Dec 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: occasional inconsistent performance on long-running tpcc-1000 #32104

changefeedccl: occasional inconsistent performance on long-running tpcc-1000 #32104

danhhz commented Nov 1, 2018

danhhz commented Nov 1, 2018

danhhz commented Nov 1, 2018

danhhz commented Nov 2, 2018

tbg commented Nov 12, 2018

danhhz commented Nov 12, 2018

danhhz commented Nov 13, 2018

danhhz commented Nov 13, 2018

danhhz commented Nov 14, 2018

danhhz commented Nov 14, 2018

danhhz commented Nov 15, 2018

danhhz commented Nov 16, 2018

danhhz commented Nov 16, 2018

danhhz commented Dec 3, 2018

changefeedccl: occasional inconsistent performance on long-running tpcc-1000 #32104

changefeedccl: occasional inconsistent performance on long-running tpcc-1000 #32104

Comments

danhhz commented Nov 1, 2018

danhhz commented Nov 1, 2018

danhhz commented Nov 1, 2018

danhhz commented Nov 2, 2018

tbg commented Nov 12, 2018

danhhz commented Nov 12, 2018

danhhz commented Nov 13, 2018

danhhz commented Nov 13, 2018

danhhz commented Nov 14, 2018

danhhz commented Nov 14, 2018

danhhz commented Nov 15, 2018

danhhz commented Nov 16, 2018

danhhz commented Nov 16, 2018

danhhz commented Dec 3, 2018