-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distsql: saturating disk IOPS leads to liveness failures and canceled contexts #15332
Comments
I would enable traces+lightstep and look at the traces in lightstep. Also #15323 should reduce memory usage of this query (or at least error out instead of causing OOM). |
#15323 errors this query out and the errors haven't reoccured. It might be worth to investigate the reason behind the context cancellation further but not necessarily for 1.0. Moving to 1.1. |
@asubiotto are we taking action as part of 1.1? |
This should be fixed with the better memory accounting as well as spilling to external storage. We need to attempt a repro, and then close this issue if its fixed. |
I am trying to reproduce this on cluster |
I've reproduced this on |
tl;dr: node liveness updates slow way down due to disk throttling on azure caused by our external storage disk usage and this causes badness (and somehow leads to a context cancellation). @vivekmenezes mentioned some fixes to node liveness projected for 1.2 which might or might not fix the underlying context cancellation. I think we proceed by documenting this as a known limitation for 1.1 and the node liveness issues will hopefully be fixed for 1.2 (if not we can introduce write throttling but this does not fix the underlying issue). I've been running I inserted a sleep in the external storage code to slow down the frequency of writes and this resulted in the query completing (but it ran into an unrelated OOM on the client side, see #18329) so it seems that throttling writes lets the query complete but this does not fix the underlying issue. The next step in investigation should be to figure out why slow liveness updates result in context cancellation. Making node liveness updates more resilient is part of 1.2 work so this investigation could form part of that effort. This would mean that we don't have to do anything on our end. However, if the issue does not get fixed and we need this to work, we could introduce write throttling. I see this more as a band-aid than a cure though. We might also want external storage write throttling to avoid affecting the system. Moving the milestone to 1.2. |
Seeing exactly this when doing large CSV importing, too, which uses the temp store heavily. |
Backup/Restore ran into these issues a lot too and various throttles were added to prevent this problem. I think it's worth looking into those techniques for 1.1 because otherwise CSV importing and some distsql queries could break the cluster. |
Let's try to reproduce some of this badness on a |
@jordanlewis This isn't Azure specific, but I'm this isn't something we see anymore. It's essentially that when we saturate the VM's IOPS with query processing work, we starve the node liveness system of IO, which causes badness. We've also made great strides in understanding and rate limiting the tempstore (thanks to @mjibson) which DistSQL queries also benefit from. Thus, this isn't something we see any more. I'm closing this issue as #18765 has fixed things sufficiently. |
@arjunravinarayan it's great to hear that this isn't something we see anymore, but #18765 was never merged. Was any other rate limiting specifically put in place? Or were the few disk syncing PRs the only relevant changes? |
Also note I'm still seeing this error in #22924. The exact source may be different (i.e., not identical to the title of this issue), but it does appear there are still issues related to distsql, disk writes, node liveness, and context cancellation. |
Running
with 1227734 on
navy
fails withAs far as I can see, the logs don't really shed light on the reason for the cancellation and I'm not sure where to start looking.
Subsequent runs sometimes end with the same failure or with an OOM. I think this is caused by aggregating the result set before outputting it and seems to be the same sort of memory issue we're seeing with other queries. Still, it's unfortunate that this crash happens sorting 2.5GiB of data.
The text was updated successfully, but these errors were encountered: