-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Improve reliability of node liveness #19699
Comments
I think there's an opportunity to add more upstream-of-raft deduping (for this and certain other operations including intent resolution). When lots of heartbeat updates occur, we queue them up in the command queue and apply them in order. If we have trouble keeping up, this queuing pushes us further and further behind. We could be more intelligent about this: for liveness updates, we would generally prefer LIFO rather than FIFO behavior in the "queue". |
Yeah, that's a good point. I'll add it to the list. However, I think it would only help us after nodes had already started having heartbeats time out and losing their liveness, since otherwise each node just sends one heartbeat at a time. |
GC is expensive too, which has been known to exacerbate problems here. A better solution would be to migrate liveness to non-versioned keys. This would enable various rocksdb optimizations when keys are overwritten (as opposed to new keys written and old ones deleted). The reason we have versioned liveness keys is just so we can (ab)use the commit trigger mechanism; in order to move to non-versioned keys we'd need to introduce some sort of trigger that can be used for normal KV ops. The migration will probably be the hardest part of making this change. I think this would be a very big win, though.
Even if we still need heartbeats to go to disk (we need something to go to disk, so that a new lease holder of the liveness range doesn't allow epoch increments that contradict a heartbeat acked by the previous lease holder), we could use batching/buffering of heartbeats. Heartbeats are never latency-sensitive (as long as they complete within the expiration time, which is relatively high), so the owner of the liveness range could run a new heartbeat service that would sit above raft and collect heartbeats to write as a batch. |
There are never any intents on the liveness range, so the only GC foe is removing old versions, which I believe is now optimized enough to not present a problem (given the default GC settings) on that range. At the end of the day, the biggest culprit seems to be disk i/o or more specifically large sync latencies, right? I don't feel we (or at least I) understand this well enough. It'll be near impossible to have things working smoothly in the presence of 40s+ commit latencies, so while addressing "compound errors" as things go wrong seems important, it doesn't seem that we can make things "work" unless we also increase the liveness duration well above the sync latency. |
Yeah, the disk overload issue is what I'm currently experimenting with. Pulling out the CSV sst-writing code and experimenting with it in isolation has already revealed some pretty pathological behavior (seconds of blocking by RocksDB) even when it's the only thing running on a machine. |
Does the sstable writing that occurs outside of RocksDB "trickle" fsyncs? RocksDB contains a |
I can look at that as well, but I'm not even getting to that part yet. Just pushing lots of Puts into RocksDB really fast is enough to get multi-second stalls. |
The To provide more detail, I'm running this experimental code, which mimics the writing done by the import code. Here's the output from running a couple different configurations for 3 minutes: disableWal=true,low_pri=false on top of https://github.com/cockroachdb/rocksdb/tree/crl-release-5.7.4: batch-16-disablewal-57.txt disableWal=true,low_pri=true on top of https://github.com/cockroachdb/rocksdb/tree/crl-release-5.7.4: batch-16-disablewal-lowpri.txt I still need to test how it'll effect a second RocksDB instance given that our temp store is separate from our main store, but it's looking promising enough to be worth bumping RocksDB and using for temporary store writes. |
Wow, really? That strikes me as completely unexpected. I'm not familiar with the I also assume this is specific to the cloud VMs and you don't see it locally? The |
Do you see any problems running on your laptop? Or does this only occur on GCE local SSD? Also, I leave having a small reproducible case of badness like this. We'll track down what is going on here very soon. |
I don't see the same issues on my laptop. The performance takes some extended dips to half its original throughput, but never all the way down to 0. Also, I tried experimenting with using your |
Have you set the |
@a-robinson What GCE machine type are you using? I just tried to reproduce the badness you're seeing using an |
It looks better than with the defaults, but still has sizable stalls (with the longest being 11 seconds in my few minutes of testing). |
My gceworker -- i.e. 24 vCPUs, 32 GB memory, 100GB non-local SSD. |
Ah, I've seen problems with throttling when using Persistent SSD. Never investigated where they are coming from. I recall reading that Persistent SSD throttles based on the size of the disk. Can you test on Local SSD and verify you're not seeing a problem? |
Yeah, using local SSD avoids the extended zero QPS periods. It's pretty frustrating that the throttling is so bad given how inconvenient local SSDs are to use and how things like our Kubernetes config completely rely on remote disks. I wonder whether the GCE PD team cares about the effects of the throttling behavior. |
Have the IMPORT tests that have been experiencing problems been using PD SSD or Local SSD?
Doesn't hurt to file a bug. Especially if we can narrow down the workload that exhibits badness. |
The import tests I was running were all on local SSD. |
Alright, so I have a simple program that tickles the disk badness even on local SSD. It basically just runs what I had before in parallel with synctest and only prints the sync throughput/latencies, although I slightly tweaked what was there before to even more closely mimic the import The I'm going to try writing a basic adaptive rate-limiter that limits writes to the temp store in response to the primary store's sync latencies and see how it affects things. Another avenue to explore is just changing the import code to restrict its own parallelism. Spinning up hundreds of rocksdb instances and writing to them all in parallel seems like asking for trouble. Also, as suggested for a similar situation recently, there's a good chance that #19229 would help with node liveness in the case that all nodes' disks are overloaded, since it'd switch us from waiting for the |
Two updates:
|
I don't think it's unreasonable to require sysctl or similar tweaks for heavily loaded servers. We already have to advise raising the default file descriptor limit and every database will have a page of settings that may need to be tweaked. Of course it's better if we can find a solution that doesn't require this, but that may not always be possible (especially when we're contending with non-cockroach processes).
Yikes! We should definitely stop doing that. |
Agreed. I wasn't aware we were doing this. |
Moving this to 2.2. There hasn't been a lot of noise about node liveness lately, so maybe we should just close this instead. Thoughts @a-robinson? Is there more you'd like to explore here? |
I'd probably close this issue. There are still things that can be done to improve the reliability of the node liveness system (up to and including replacing the whole thing with something like @andy-kimball 's failure detector), but I'm not sure there's any value left in keeping this broad issue open. |
I think the bulk IO team (@dt/@mjibson) might feel otherwise. Last I heard
it was still fairly easy for node licenses to cause a restore to fail.
I think there’s probably something to be done with prioritizing node
liveness batches. Would Andy K’s failure detection not be subject to the
same problem where maxing out disk bandwidth causes heartbeats to be
missed?
…On Fri, Sep 28, 2018 at 9:59 AM Ben Darnell ***@***.***> wrote:
I'd probably close this issue. There are still things that can be done to
improve the reliability of the node liveness system (up to and including
replacing the whole thing with something like @andy-kimball
<https://github.com/andy-kimball> 's failure detector), but I'm not sure
there's any value left in keeping this broad issue open.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19699 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA15IAjxXJWtqxzakA8bjBUvx_bQcQ8Hks5ufis4gaJpZM4QNkeF>
.
|
At this point this is an issue tracking a broad concern with the system rather than a specific action we want to take. Node liveness still needs to be improved further over time, but I don't care whether we keep this issue open or just open more specific ones when we decide to do more work here. |
One difficulty with prioritizing node liveness batches is that they're currently handled exactly the same way as regular traffic, so prioritizing them seems to require hacky checks on the keys/ranges involved. Andy K's failure detector would at least move any disk IO involved into a separate subsystem so the boundaries for prioritization would be more clear. (I'm not sure whether his scheme even requires disk access for heartbeats or only on failure). Another problem with the current scheme is that there are two critical ranges: both range 1 and the liveness range must be up for the system to work. I believe Andy's plan would make failure detection independent of range 1 (and of any single range). |
There are six proposed experiments in the issue description that don't seem to be associated with a more specific issue. It'd be a shame to lose track of those! |
I would also leave it open with the goal of (in the 2.2 time frame) examining/prototyping Andy's failure detector and coming to a decision of whether to try to implement it in the foreseeable future. |
@ajwerner I know you made progress here recently. Have you updated the checklist at the top/reviewed if it is still relevant? |
Many of the mitigations proposed in this issue deal with io starvation. We have yet to make progress on that front. In recent conversations with @dt it seems like ideas such as
Are as relevant as ever. That being said #39172 separates the network connection for range 1 and node licenses and seems to be effective at protecting node licenses in the face of cpu overload which has proven to be a bigger problem than this issue anticipated. It’s not clear that this is the “prioritization” mechanism envisioned by this issue. |
We have marked this issue as stale because it has been inactive for |
Opening a tracking/organizational issue for the work behind trying to make node liveness more reliable in clusters with very heavy workloads (e.g. #15332). More thoughts/ideas very welcome.
Problem definition
Node liveness heartbeats time out when a cluster is overloaded. This typically makes things even worse in the cluster, since nodes losing their liveness prevents pretty much all other work from completing. Slow node liveness heartbeats are particularly common/problematic during bulk I/O jobs like imports or non-rate-limited restores. Slow heartbeats have also become a problem due to GC queue badness in at least one case
Options
Experiments/TODOs
The text was updated successfully, but these errors were encountered: