Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsql: saturating disk IOPS leads to liveness failures and canceled contexts #15332

Closed
asubiotto opened this issue Apr 25, 2017 · 13 comments
Closed
Assignees
Milestone

Comments

@asubiotto
Copy link
Contributor

Running

SELECT * FROM tpch.lineitem ORDER BY l_extendedprice

with 1227734 on navy fails with

Error: pq: [n1] communication error: rpc error: code = Canceled desc = context canceled

As far as I can see, the logs don't really shed light on the reason for the cancellation and I'm not sure where to start looking.

Subsequent runs sometimes end with the same failure or with an OOM. I think this is caused by aggregating the result set before outputting it and seems to be the same sort of memory issue we're seeing with other queries. Still, it's unfortunate that this crash happens sorting 2.5GiB of data.

screen shot 2017-04-25 at 2 23 24 pm

@RaduBerinde
Copy link
Member

I would enable traces+lightstep and look at the traces in lightstep. Also #15323 should reduce memory usage of this query (or at least error out instead of causing OOM).

@asubiotto
Copy link
Contributor Author

#15323 errors this query out and the errors haven't reoccured. It might be worth to investigate the reason behind the context cancellation further but not necessarily for 1.0. Moving to 1.1.

@cuongdo
Copy link
Contributor

cuongdo commented Aug 22, 2017

@asubiotto are we taking action as part of 1.1?

@rjnn
Copy link
Contributor

rjnn commented Aug 22, 2017

This should be fixed with the better memory accounting as well as spilling to external storage. We need to attempt a repro, and then close this issue if its fixed.

@rjnn
Copy link
Contributor

rjnn commented Aug 23, 2017

I am trying to reproduce this on cluster navy, but the query is not finishing, and the logs aren't showing anything obvious. The query just seems to take a while, and then vanish returning no output. Very puzzling.

@asubiotto
Copy link
Contributor Author

I've reproduced this on navy and will look into it

@asubiotto
Copy link
Contributor Author

tl;dr: node liveness updates slow way down due to disk throttling on azure caused by our external storage disk usage and this causes badness (and somehow leads to a context cancellation). @vivekmenezes mentioned some fixes to node liveness projected for 1.2 which might or might not fix the underlying context cancellation. I think we proceed by documenting this as a known limitation for 1.1 and the node liveness issues will hopefully be fixed for 1.2 (if not we can introduce write throttling but this does not fix the underlying issue).

I've been running SELECT * FROM tpch.lineitem ORDER BY l_extendedprice and noticed that a bunch of stuff was slowing down (raft commits, liveness updates, handle raft ready) to a couple of seconds. Ben suggested that this could be happening due to write throttling on azure due to our disk usage in external storage and these slow updates could somehow lead to context cancellation.

I inserted a sleep in the external storage code to slow down the frequency of writes and this resulted in the query completing (but it ran into an unrelated OOM on the client side, see #18329) so it seems that throttling writes lets the query complete but this does not fix the underlying issue. The next step in investigation should be to figure out why slow liveness updates result in context cancellation. Making node liveness updates more resilient is part of 1.2 work so this investigation could form part of that effort. This would mean that we don't have to do anything on our end. However, if the issue does not get fixed and we need this to work, we could introduce write throttling. I see this more as a band-aid than a cure though. We might also want external storage write throttling to avoid affecting the system.

Moving the milestone to 1.2.

@asubiotto asubiotto modified the milestones: 1.2, 1.1 Sep 7, 2017
@asubiotto asubiotto changed the title distsql: sorting TPCH lineitem (2.5GiB) failure due to canceled context (and sometimes OOM) distsql: write throttling on azure leads to liveness failures and canceled context Sep 19, 2017
@maddyblue
Copy link
Contributor

Seeing exactly this when doing large CSV importing, too, which uses the temp store heavily.

@maddyblue
Copy link
Contributor

Backup/Restore ran into these issues a lot too and various throttles were added to prevent this problem. I think it's worth looking into those techniques for 1.1 because otherwise CSV importing and some distsql queries could break the cluster.

@jordanlewis
Copy link
Member

Let's try to reproduce some of this badness on a roachprod cluster - we should close it if it was an Azure-specific performance issue.

@rjnn
Copy link
Contributor

rjnn commented Mar 5, 2018

@jordanlewis This isn't Azure specific, but I'm this isn't something we see anymore. It's essentially that when we saturate the VM's IOPS with query processing work, we starve the node liveness system of IO, which causes badness.

We've also made great strides in understanding and rate limiting the tempstore (thanks to @mjibson) which DistSQL queries also benefit from. Thus, this isn't something we see any more. I'm closing this issue as #18765 has fixed things sufficiently.

@rjnn rjnn closed this as completed Mar 5, 2018
@a-robinson
Copy link
Contributor

@arjunravinarayan it's great to hear that this isn't something we see anymore, but #18765 was never merged. Was any other rate limiting specifically put in place? Or were the few disk syncing PRs the only relevant changes?

@maddyblue
Copy link
Contributor

Also note I'm still seeing this error in #22924. The exact source may be different (i.e., not identical to the title of this issue), but it does appear there are still issues related to distsql, disk writes, node liveness, and context cancellation.

@rjnn rjnn changed the title distsql: write throttling on azure leads to liveness failures and canceled context distsql: saturating disk IOPS leads to liveness failures and canceled contexts Mar 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants