distsql: saturating disk IOPS leads to liveness failures and canceled contexts #15332

asubiotto · 2017-04-25T18:29:01Z

Running

SELECT * FROM tpch.lineitem ORDER BY l_extendedprice

with 1227734 on navy fails with

Error: pq: [n1] communication error: rpc error: code = Canceled desc = context canceled

As far as I can see, the logs don't really shed light on the reason for the cancellation and I'm not sure where to start looking.

Subsequent runs sometimes end with the same failure or with an OOM. I think this is caused by aggregating the result set before outputting it and seems to be the same sort of memory issue we're seeing with other queries. Still, it's unfortunate that this crash happens sorting 2.5GiB of data.

The text was updated successfully, but these errors were encountered:

RaduBerinde · 2017-04-25T18:36:48Z

I would enable traces+lightstep and look at the traces in lightstep. Also #15323 should reduce memory usage of this query (or at least error out instead of causing OOM).

asubiotto · 2017-04-26T17:31:03Z

#15323 errors this query out and the errors haven't reoccured. It might be worth to investigate the reason behind the context cancellation further but not necessarily for 1.0. Moving to 1.1.

cuongdo · 2017-08-22T20:35:25Z

@asubiotto are we taking action as part of 1.1?

rjnn · 2017-08-22T21:35:52Z

This should be fixed with the better memory accounting as well as spilling to external storage. We need to attempt a repro, and then close this issue if its fixed.

rjnn · 2017-08-23T18:34:42Z

I am trying to reproduce this on cluster navy, but the query is not finishing, and the logs aren't showing anything obvious. The query just seems to take a while, and then vanish returning no output. Very puzzling.

asubiotto · 2017-08-30T20:06:14Z

I've reproduced this on navy and will look into it

asubiotto · 2017-09-07T17:11:03Z

tl;dr: node liveness updates slow way down due to disk throttling on azure caused by our external storage disk usage and this causes badness (and somehow leads to a context cancellation). @vivekmenezes mentioned some fixes to node liveness projected for 1.2 which might or might not fix the underlying context cancellation. I think we proceed by documenting this as a known limitation for 1.1 and the node liveness issues will hopefully be fixed for 1.2 (if not we can introduce write throttling but this does not fix the underlying issue).

I've been running SELECT * FROM tpch.lineitem ORDER BY l_extendedprice and noticed that a bunch of stuff was slowing down (raft commits, liveness updates, handle raft ready) to a couple of seconds. Ben suggested that this could be happening due to write throttling on azure due to our disk usage in external storage and these slow updates could somehow lead to context cancellation.

I inserted a sleep in the external storage code to slow down the frequency of writes and this resulted in the query completing (but it ran into an unrelated OOM on the client side, see #18329) so it seems that throttling writes lets the query complete but this does not fix the underlying issue. The next step in investigation should be to figure out why slow liveness updates result in context cancellation. Making node liveness updates more resilient is part of 1.2 work so this investigation could form part of that effort. This would mean that we don't have to do anything on our end. However, if the issue does not get fixed and we need this to work, we could introduce write throttling. I see this more as a band-aid than a cure though. We might also want external storage write throttling to avoid affecting the system.

Moving the milestone to 1.2.

maddyblue · 2017-09-21T19:53:11Z

Seeing exactly this when doing large CSV importing, too, which uses the temp store heavily.

maddyblue · 2017-09-25T01:49:20Z

Backup/Restore ran into these issues a lot too and various throttles were added to prevent this problem. I think it's worth looking into those techniques for 1.1 because otherwise CSV importing and some distsql queries could break the cluster.

jordanlewis · 2018-02-22T18:22:07Z

Let's try to reproduce some of this badness on a roachprod cluster - we should close it if it was an Azure-specific performance issue.

rjnn · 2018-03-05T02:54:40Z

@jordanlewis This isn't Azure specific, but I'm this isn't something we see anymore. It's essentially that when we saturate the VM's IOPS with query processing work, we starve the node liveness system of IO, which causes badness.

We've also made great strides in understanding and rate limiting the tempstore (thanks to @mjibson) which DistSQL queries also benefit from. Thus, this isn't something we see any more. I'm closing this issue as #18765 has fixed things sufficiently.

a-robinson · 2018-03-05T15:40:29Z

@arjunravinarayan it's great to hear that this isn't something we see anymore, but #18765 was never merged. Was any other rate limiting specifically put in place? Or were the few disk syncing PRs the only relevant changes?

maddyblue · 2018-03-05T19:42:14Z

Also note I'm still seeing this error in #22924. The exact source may be different (i.e., not identical to the title of this issue), but it does appear there are still issues related to distsql, disk writes, node liveness, and context cancellation.

asubiotto added this to the 1.1 milestone Apr 26, 2017

asubiotto mentioned this issue Apr 26, 2017

distsql: tracking issue for queries we expect to run through DistSQL #14288

Closed

11 tasks

cuongdo assigned asubiotto Aug 22, 2017

asubiotto mentioned this issue Sep 6, 2017

storage: unexpected RangeNotFound keeps query spinning indefinitely #18249

Closed

asubiotto modified the milestones: 1.2, 1.1 Sep 7, 2017

asubiotto changed the title ~~distsql: sorting TPCH lineitem (2.5GiB) failure due to canceled context (and sometimes OOM)~~ distsql: write throttling on azure leads to liveness failures and canceled context Sep 19, 2017

rjnn mentioned this issue Sep 19, 2017

distsql: tracking issue for continuous load tests #15775

Closed

maddyblue mentioned this issue Sep 21, 2017

sqlccl: cannot transform sf-100 csv with distsql #18652

Closed

maddyblue mentioned this issue Sep 25, 2017

engine: io limit RocksDBMap batch commits #18765

Closed

maddyblue mentioned this issue Oct 19, 2017

contextutil: add cancellation info log #19382

Merged

a-robinson mentioned this issue Oct 31, 2017

storage: Improve reliability of node liveness #19699

Closed

13 tasks

rjnn closed this as completed Mar 5, 2018

rjnn changed the title ~~distsql: write throttling on azure leads to liveness failures and canceled context~~ distsql: saturating disk IOPS leads to liveness failures and canceled contexts Mar 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsql: saturating disk IOPS leads to liveness failures and canceled contexts #15332

distsql: saturating disk IOPS leads to liveness failures and canceled contexts #15332

asubiotto commented Apr 25, 2017

RaduBerinde commented Apr 25, 2017

asubiotto commented Apr 26, 2017

cuongdo commented Aug 22, 2017

rjnn commented Aug 22, 2017

rjnn commented Aug 23, 2017

asubiotto commented Aug 30, 2017

asubiotto commented Sep 7, 2017

maddyblue commented Sep 21, 2017

maddyblue commented Sep 25, 2017

jordanlewis commented Feb 22, 2018

rjnn commented Mar 5, 2018

a-robinson commented Mar 5, 2018

maddyblue commented Mar 5, 2018

distsql: saturating disk IOPS leads to liveness failures and canceled contexts #15332

distsql: saturating disk IOPS leads to liveness failures and canceled contexts #15332

Comments

asubiotto commented Apr 25, 2017

RaduBerinde commented Apr 25, 2017

asubiotto commented Apr 26, 2017

cuongdo commented Aug 22, 2017

rjnn commented Aug 22, 2017

rjnn commented Aug 23, 2017

asubiotto commented Aug 30, 2017

asubiotto commented Sep 7, 2017

maddyblue commented Sep 21, 2017

maddyblue commented Sep 25, 2017

jordanlewis commented Feb 22, 2018

rjnn commented Mar 5, 2018

a-robinson commented Mar 5, 2018

maddyblue commented Mar 5, 2018