Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: occasional inconsistent performance on long-running tpcc-1000 #32104

Closed
danhhz opened this issue Nov 1, 2018 · 13 comments · Fixed by #32909
Closed

changefeedccl: occasional inconsistent performance on long-running tpcc-1000 #32104

danhhz opened this issue Nov 1, 2018 · 13 comments · Fixed by #32909
Assignees
Labels
A-cdc Change Data Capture C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@danhhz
Copy link
Contributor

danhhz commented Nov 1, 2018

If the cdc/w=1000/nodes=3/init=false roachtest is altered to run indefinitely instead of 2h, it will eventually see inconsistent performance.

Usually the latency between the latest changefeed-level resolved timestamp is 1-2s. Occasionally it will hit some badness and spike up to single-digit minutes. This can go on for several hours. I keep accidentally cutting the test short, so it's unclear if it ever recovers on it's own.

My first guess is that some range activity is happening (splits, rebalances) and this is causing the ExpertRequests to get slower.

@danhhz danhhz added C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-cdc Change Data Capture labels Nov 1, 2018
@danhhz danhhz self-assigned this Nov 1, 2018
@danhhz
Copy link
Contributor Author

danhhz commented Nov 1, 2018

During the badness, the number of emitted messages and bytes squiggles a little, but not as much as other metrics. The amount of time spent emitting goes down, which I don't have an explanation for. (Ignore the drop, here and in other graphs, it's where I accidentally ended the test.)

screen shot 2018-11-01 at 12 35 54 pm

However the sum of the amount of time it takes for all ExportRequests to return goes wild and varies quite a bit by node. Nothing immediately jumps out at me as correlating with this on the other admin ui graphs.

screen shot 2018-11-01 at 12 31 37 pm

It appears that the fraction of local vs non-local requests served by distsender stays constant, though the number of requests varies more than it usually does.

screen shot 2018-11-01 at 12 37 38 pm

@danhhz
Copy link
Contributor Author

danhhz commented Nov 1, 2018

At the same time as the badness, sql and kv latencies elevate, as do log commit/command commit/and raft log queue latencies (though log commit latencies seems to trail the badness starting). The number of batches/rpcs/avg queries per store go up. CPU and disk read bytes go up, but seem to be trailing indicators. There happens to be a small spike of splits, but it may not be related, because there are similarly sized spikes elsewhere that seem okay.

@danhhz
Copy link
Contributor Author

danhhz commented Nov 2, 2018

I added a histogram of the individual ExportRequest latencies and waited for a repro. The long tail spikes during the badness. Looks like 2 nodes were pseudo-healthy and 1 node had highly elevated p50 and pMax. Unclear what the cause of the slowness is, but an elevated p50 suggests it's happening on a node/store level instead of a range one. We should definitely dig into the ExportRequest timings. Note that there is a limiter on the number of concurrently served ExportRequests, which means a couple slow requests will back up all of them.

screen shot 2018-11-01 at 9 06 10 pm

We've seen this sort of long tail ExportRequest latency in backup enough that I suspected it in tpcc-1000 at one point and filed #28660 with an idea how to contain the damage. I'll prototype that and see if it has any effect.

@tbg
Copy link
Member

tbg commented Nov 12, 2018

Are you seeing lots of snapshots? Just making sure you're not just hitting something like #32046.

@danhhz
Copy link
Contributor Author

danhhz commented Nov 12, 2018

I looked at all the graphs and everthing that seemed to line up, I noted in #32104 (comment), but I'll be sure to look closely at snapshots the next time this happens.

So now there's two things to be done here. One is to look into the ExportRequest timings. Another is to be more resilient to a small number of ExportRequests being slow.

To I address the latter, I prototyped #28660 and have it running on roachprod now. Seems to be doing okay but no badness in the export request durations yet, so that doesn't prove much beyond the basic validity of the idea. End-to-end latency is 4-5s instead of the normal 1-2s, I haven't looked into why yet. Currently deciding whether I should leave this overnight to see what happens or try to force the issue with some randomized sleeps.

@danhhz
Copy link
Contributor Author

danhhz commented Nov 13, 2018

Well my overnight test isn't very interesting because somehow I ended up with two clusters and didn't notice until this morning. Nevertheless, here are the results. No badness. I still don't have a theory on why the floor moved up so much, but p50-p99 are tightly grouped.

__avg(ms)__p50(ms)__p75(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
   4343.7   4295.0   4831.8   5100.3   5368.7   6174.0  36507.2

This morning, I ran a (short 15m) test where a 5s sleep is added with a 1/10000 random chance. There are about 5000 ranges in tpcc-1000 so this means it should happen ever 2s or so. This was intentionally picked so the sleeps should overlap a bit, but on different ranges. The results are quite encouraging. The pMax graph pegged at 5.1s but the latency percentiles never really got over the 30s we always seem to see at startup (this 30s is probably because the time bound iterator is less effective for the first bit, I'd guess it would go away if we warmed up tpcc). This suggests the #28660 prototype is indeed more resilient to ExportRequest latency longtails.

__avg(ms)__p50(ms)__p75(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
   6873.7   5905.6   7247.8  10737.4  13958.6  27917.3  32212.3

@danhhz
Copy link
Contributor Author

danhhz commented Nov 13, 2018

Re-ran my experiment with 3 nodes this time. The floor is back to normal (so must have been an issue with the accidental 1 node cluster). Percentiles look good modulo two spikes in the beginning (which tends to happen right after a restore and was actually even less pronounced than usual). I'm confidant that if I'd let the test run longer, the p90 would have dropped to around where p75 is.

__avg(ms)__p50(ms)__p75(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  10638.5   5637.1   6174.0  24696.1  53687.1 107374.2 137439.0

One last thing to consider before polishing this and merging is that the table history blocking (table_metadata_nanos) went up considerably. It was essentially zero before.

screen shot 2018-11-13 at 9 41 18 am

@danhhz
Copy link
Contributor Author

danhhz commented Nov 14, 2018

Ran a couple more experiments last night, but I'm having trouble making sense of what happened. I started up tpcc-1000 with my #28660 prototype and it was acting super funny. This was right before bed and I didn't have time to look into it, so I just started up a fresh cluster with the test from scratch, also with the prototype build.

The good news is that both kept end-to-end latency under 10m overnight. They both had several periods of instability and were able to recover from them. The bad news is I still have no idea why. In fact, not a single metric I've come up with tracks how far behind the resolved timestamp is.

Additional bad news: after I shut the tests down, they both had tons of zombie changefeedccl goroutines that are sticking around forever. It's possible this is an issue with my prototype build, so I'll try reproducing this using master. Filed #32319

We're still seeing elevated tail latencies in ExportRequests, but the number of requests being sent, number of bytes being received from ExportRequests, sum of walltime covered by the ExportRequest, and number of bytes being emitted to the sink are all stable while this is happening.

screen shot 2018-11-14 at 10 33 50 am

screen shot 2018-11-14 at 10 38 10 am


The odd test that I let run did some crazy stuff and I don't even begin to know how to make sense of it. Everything in the changefeed graphs was pretty jagged (this included a bunch of the normal metrics, so I think tpcc was also quite unhappy during this period). Many hours into the test, suddenly a bunch of splits happened and the emitted bytes smoothed out, while there rest of the changefeed metrics went crazy.

What's seriously odd about this is it was making ~300 ExportRequests per second before this happened and ~3000 after. Ideally we're processing each range once per second, but the ranges went from ~4600 to ~6000, which doesn't match either of those numbers.

screen shot 2018-11-14 at 10 42 08 am

@danhhz
Copy link
Contributor Author

danhhz commented Nov 14, 2018

Okay, tonight's test will be the following. In ChangeFrontier, periodically print every span that's more than some amount of time behind. This should shed some light on whether it's one request taking a long time or many. And if the latter, it will tell if everything is backed up as of some certain time or different spans are behind at different times.

Also, I'll add a log that prints whenever an ExportRequest takes longer than the polling period. This will validate the poll_request_nanos histogram (a lot of analysis depends on this max being exactly accurate. i've got the downsampler and aggregator both set to max but who knows). It will also help track down what's happening with the slow spans being printed by the ChangeFrontier logging.

@danhhz
Copy link
Contributor Author

danhhz commented Nov 15, 2018

Progress! Summary: pretty sure we're hitting the same time-bound iterator issue as #28358.

Discovered the prototype is checkpointing the jobs progress constantly. The polling strategy on master doesn't have this problem because it moves all the spans forward in lockstep, but the prototype was doing it O(ranges)/second, which is nuts. I suspect the craziness I saw in the one test last night is because of this.

Most of my confusion debugging this has been trying to figure out how a handful of export requests taking 10s of seconds could hold up the overall progress for ~10m. After some rubber ducking today, double checked the histogram and found out it's lying to me (sort of). The max latency trackable by metrics.NewLatency is 10s. I'd seen clipping at 10s, but handwaved it away as the max transaction duration. Found this by logging slow ExportRequests.

On a hunch from @benesch, added some logging to see if ExportRequests were getting stuck in a loop waiting for an intent to resolve and the logs are pretty definitive. Looks like it eventually finishes when a compaction is run.

I181114 23:37:11.033598 1213 ccl/storageccl/export.go:173  [n2,s2,r7374/2:/Table/56/1/96{1/5/-…-2/4/-…}] error exporting [/Table/56/1/961/5/-2110/14,/Table/56/1/962/4/-730/3): conflicting intents on /Table/56/1/962/2/-4300/1/0
I181114 23:37:11.060211 1213 ccl/storageccl/export.go:173  [n2,s2,r7374/2:/Table/56/1/96{1/5/-…-2/4/-…}] error exporting [/Table/56/1/961/5/-2110/14,/Table/56/1/962/4/-730/3): conflicting intents on /Table/56/1/962/2/-4300/1/0
... thousands of lines omitted ...
I181114 23:38:10.258637 1213 ccl/storageccl/export.go:173  [n2,s2,r7374/2:/Table/56/1/96{1/5/-…-2/4/-…}] error exporting [/Table/56/1/961/5/-2110/14,/Table/56/1/962/4/-730/3): conflicting intents on /Table/56/1/962/2/-4300/1/0
I181114 23:38:10.301715 1213 ccl/storageccl/export.go:173  [n2,s2,r7374/2:/Table/56/1/96{1/5/-…-2/4/-…}] error exporting [/Table/56/1/961/5/-2110/14,/Table/56/1/962/4/-730/3): conflicting intents on /Table/56/1/962/2/-4300/1/0
I181114 23:38:10.304733 1213 ccl/changefeedccl/poller.go:541  [n2] finished ExportRequest [/Table/56/1/961/5/-2110/14,/Table/56/1/962/4/-730/3) took 59.397918718s

@danhhz
Copy link
Contributor Author

danhhz commented Nov 16, 2018

Went back to master (without the #28660 prototype) and added a workaround for the TBI bug that simply logs and skips over any unresolved intents in export.go. Normal contention with tpcc was triggering this, so also moved the poller to do a 1 second readbehind. After an initial spike that happens at the beginning of every run (still curious what's happening there, but it could just be tpcc warmup), the only logs I see the TBI bug. There are huge runs of them, sometimes spanning multiple minutes, but eventually going away (presumably compactions). Seems like this one's figured out, though we'll have to wait on benesch's permanent TBI fix. Running overnight tonight to see if we get any multi-minute lag.

@danhhz
Copy link
Contributor Author

danhhz commented Nov 16, 2018

Wow, overnight run was a complete success. After running for ~8h.

__avg(ms)__p50(ms)__p75(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
   2647.8   2684.4   2952.8   3087.0   3221.2   3221.2   5637.1

Sounds like once the TBI bug is resolved for ExportRequest we can call this closed.

danhhz added a commit to danhhz/cockroach that referenced this issue Nov 20, 2018
I've been running a branch with a bunch of extra metrics while debugging
issue cockroachdb#32104. PR'ing the ones that have proven useful in case someone
has performance issues in the wild.

Release note (enterprise change): Added timeseries metrics for debugging
CHANGEFEED performance issues.
craig bot pushed a commit that referenced this issue Nov 28, 2018
32241: changefeedccl: add metrics for performance debugging r=mrtracy a=danhhz

I've been running a branch with a bunch of extra metrics while debugging
issue #32104. PR'ing the ones that have proven useful in case someone
has performance issues in the wild.

Release note (enterprise change): Added timeseries metrics for debugging
CHANGEFEED performance issues.

Co-authored-by: Daniel Harrison <[email protected]>
@danhhz
Copy link
Contributor Author

danhhz commented Dec 3, 2018

TBI bug is being tracked in #32799

danhhz added a commit to danhhz/cockroach that referenced this issue Dec 5, 2018
I've been running a branch with a bunch of extra metrics while debugging
issue cockroachdb#32104. PR'ing the ones that have proven useful in case someone
has performance issues in the wild.

Release note (enterprise change): Added timeseries metrics for debugging
CHANGEFEED performance issues.
danhhz added a commit to danhhz/cockroach that referenced this issue Dec 11, 2018
For reasons described in cockroachdb#28358, a time-bound iterator will sometimes
see an unresolved intent where there is none. A normal iterator doesn't
have this problem, so we work around it in MVCCIncrementalIterator by
double checking all non-value keys. If the normal iterator has a
different value for the key, it's used instead. If the normal iterator
doesn't have that key, it's skipped.

This fixes both changefeeds and incremental backup.

Closes cockroachdb#32104
Closes cockroachdb#32799

Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no
longer indefinitely hang under an infrequent condition.
danhhz added a commit to danhhz/cockroach that referenced this issue Dec 13, 2018
For reasons described in cockroachdb#28358, a time-bound iterator will sometimes
see an unresolved intent where there is none. A normal iterator doesn't
have this problem, so we work around it in MVCCIncrementalIterator by
double checking all non-value keys. If the normal iterator has a
different value for the key, it's used instead. If the normal iterator
doesn't have that key, it's skipped.

This fixes both changefeeds and incremental backup.

Closes cockroachdb#32104
Closes cockroachdb#32799

Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no
longer indefinitely hang under an infrequent condition.
craig bot pushed a commit that referenced this issue Dec 13, 2018
32909: engineccl/mvcc: work around time-bound iterator bug r=bdarnell,benesch a=danhhz

For reasons described in #28358, a time-bound iterator will sometimes
see an unresolved intent where there is none. A normal iterator doesn't
have this problem, so we work around it in MVCCIncrementalIterator by
double checking all non-value keys. If the normal iterator has a
different value for the key, it's used instead. If the normal iterator
doesn't have that key, it's skipped.

This fixes both changefeeds and incremental backup.

Closes #32104
Closes #32799

Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no
longer indefinitely hang under an infrequent condition.

Co-authored-by: Daniel Harrison <[email protected]>
@craig craig bot closed this as completed in #32909 Dec 13, 2018
danhhz added a commit to danhhz/cockroach that referenced this issue Dec 13, 2018
For reasons described in cockroachdb#28358, a time-bound iterator will sometimes
see an unresolved intent where there is none. A normal iterator doesn't
have this problem, so we work around it in MVCCIncrementalIterator by
double checking all non-value keys. If the normal iterator has a
different value for the key, it's used instead. If the normal iterator
doesn't have that key, it's skipped.

This fixes both changefeeds and incremental backup.

Closes cockroachdb#32104
Closes cockroachdb#32799

Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no
longer indefinitely hang under an infrequent condition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants