Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Improve reliability of node liveness #19699

Closed
4 of 13 tasks
a-robinson opened this issue Oct 31, 2017 · 58 comments
Closed
4 of 13 tasks

storage: Improve reliability of node liveness #19699

a-robinson opened this issue Oct 31, 2017 · 58 comments
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. no-issue-activity S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@a-robinson
Copy link
Contributor

a-robinson commented Oct 31, 2017

Opening a tracking/organizational issue for the work behind trying to make node liveness more reliable in clusters with very heavy workloads (e.g. #15332). More thoughts/ideas very welcome.

Problem definition

Node liveness heartbeats time out when a cluster is overloaded. This typically makes things even worse in the cluster, since nodes losing their liveness prevents pretty much all other work from completing. Slow node liveness heartbeats are particularly common/problematic during bulk I/O jobs like imports or non-rate-limited restores. Slow heartbeats have also become a problem due to GC queue badness in at least one case

Options

  • Detect bad disks on startup and just refuse to run if they’re bad
    • This wouldn’t fix the IMPORT issues we’ve been seeing on GCE - GCE’s disks aren’t that bad
  • Implement prioritization of different kinds of work (e.g. liveness > normal KV operations > bulk I/O)
    • Would just doing this for disk I/O be enough? Or would we need network prioritization as well?
    • Likely a pretty large project to propagate priority info around everywhere, and it looks like the only idea of priority that RocksDB has is a low-priority option
  • Move node liveness in-memory
    • Would have the broadest benefits, but we’d need to be very careful, since node liveness is intimately tied to correctness. I’ve spent some time thinking about this but haven’t been able to convince myself of a way to make it safe. We may be able to do something that allows heartbeats to work in memory and only put epoch increments to disk, but the liveness range would need a lot more special handling. Details TBD.
    • Even if we did this, bulk I/O still might slow down OLTP workloads more than is acceptable, necessitating other work to keep it under control.
  • Reduce GC threshold for node liveness range
    • Seems like an easy, but very small, win. Unlikely to fix much by itself.
  • Learn how to detect and avoid I/O throttling / slowdown
    • Something along the lines of TCP’s flow control, e.g. start slow, measure latency, speed up writes until latency starts increasing beyond acceptable levels, then back off to stabler I/O rates.
  • Tweak RocksDB and/or filesystem settings to reduce impact of bulk I/O
    • Peter has already found a couple sysctl settings that help, and there may be others. This wouldn’t really improve node liveness in general, but might be able to avoid the worst issues during imports/restores.

Experiments/TODOs

@a-robinson a-robinson added A-disaster-recovery in progress C-performance Perf of queries or internals. Solution not expected to change functional behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Oct 31, 2017
@a-robinson a-robinson added this to the 1.2 milestone Oct 31, 2017
@a-robinson a-robinson self-assigned this Oct 31, 2017
@bdarnell
Copy link
Contributor

bdarnell commented Nov 1, 2017

I think there's an opportunity to add more upstream-of-raft deduping (for this and certain other operations including intent resolution). When lots of heartbeat updates occur, we queue them up in the command queue and apply them in order. If we have trouble keeping up, this queuing pushes us further and further behind. We could be more intelligent about this: for liveness updates, we would generally prefer LIFO rather than FIFO behavior in the "queue".

@a-robinson
Copy link
Contributor Author

Yeah, that's a good point. I'll add it to the list.

However, I think it would only help us after nodes had already started having heartbeats time out and losing their liveness, since otherwise each node just sends one heartbeat at a time.

@bdarnell
Copy link
Contributor

bdarnell commented Nov 1, 2017

  • Reduce GC threshold for node liveness range
    • Seems like an easy, but very small, win. Unlikely to fix much by itself.

GC is expensive too, which has been known to exacerbate problems here. A better solution would be to migrate liveness to non-versioned keys. This would enable various rocksdb optimizations when keys are overwritten (as opposed to new keys written and old ones deleted). The reason we have versioned liveness keys is just so we can (ab)use the commit trigger mechanism; in order to move to non-versioned keys we'd need to introduce some sort of trigger that can be used for normal KV ops. The migration will probably be the hardest part of making this change. I think this would be a very big win, though.

  • Move node liveness in-memory
    • Would have the broadest benefits, but we’d need to be very careful, since node liveness is intimately tied to correctness. I’ve spent some time thinking about this but haven’t been able to convince myself of a way to make it safe. We may be able to do something that allows heartbeats to work in memory and only put epoch increments to disk, but the liveness range would need a lot more special handling. Details TBD.

Even if we still need heartbeats to go to disk (we need something to go to disk, so that a new lease holder of the liveness range doesn't allow epoch increments that contradict a heartbeat acked by the previous lease holder), we could use batching/buffering of heartbeats. Heartbeats are never latency-sensitive (as long as they complete within the expiration time, which is relatively high), so the owner of the liveness range could run a new heartbeat service that would sit above raft and collect heartbeats to write as a batch.

@tbg
Copy link
Member

tbg commented Nov 1, 2017

There are never any intents on the liveness range, so the only GC foe is removing old versions, which I believe is now optimized enough to not present a problem (given the default GC settings) on that range.

At the end of the day, the biggest culprit seems to be disk i/o or more specifically large sync latencies, right? I don't feel we (or at least I) understand this well enough.
@petermattis has made some progress in understanding these large latencies as long as contending processes write to the file system (which is now the case since we have multiple RocksDB instances thanks to DistSQL, and also sstable ingestion which writes directly to disk), but that likely doesn't explain everything. Do we feel we understand well enough why on a decent GCE disk an import still causes liveness failures (unless sufficiently i/o restricted?)

It'll be near impossible to have things working smoothly in the presence of 40s+ commit latencies, so while addressing "compound errors" as things go wrong seems important, it doesn't seem that we can make things "work" unless we also increase the liveness duration well above the sync latency.

@a-robinson
Copy link
Contributor Author

Yeah, the disk overload issue is what I'm currently experimenting with. Pulling out the CSV sst-writing code and experimenting with it in isolation has already revealed some pretty pathological behavior (seconds of blocking by RocksDB) even when it's the only thing running on a machine.

@petermattis
Copy link
Collaborator

Does the sstable writing that occurs outside of RocksDB "trickle" fsyncs? RocksDB contains a bytes_per_sync option which we currently have disabled. The idea is to limit the amount of dirty data. Probably worthwhile to experiment with enabling this and adding something similar to all of our sstable writing.

@a-robinson
Copy link
Contributor Author

I can look at that as well, but I'm not even getting to that part yet. Just pushing lots of Puts into RocksDB really fast is enough to get multi-second stalls.

@a-robinson
Copy link
Contributor Author

The low_pri Rocksdb write option from the 5.6 release is very promising. Without it, doing Puts in a tight loop on a non-local GCE SSD regularly gets blocked for seconds, often for more than 10 seconds. After enabling it, a 3 minute run only got blocked for 1 second once, with no other complete gaps. It hurts throughput noticeably, but that's for the best if maxing out throughput is killing latencies.

To provide more detail, I'm running this experimental code, which mimics the writing done by the import code. Here's the output from running a couple different configurations for 3 minutes:

disableWal=true,low_pri=false on top of https://github.com/cockroachdb/rocksdb/tree/crl-release-5.7.4: batch-16-disablewal-57.txt

disableWal=true,low_pri=true on top of https://github.com/cockroachdb/rocksdb/tree/crl-release-5.7.4: batch-16-disablewal-lowpri.txt

I still need to test how it'll effect a second RocksDB instance given that our temp store is separate from our main store, but it's looking promising enough to be worth bumping RocksDB and using for temporary store writes.

@tbg
Copy link
Member

tbg commented Nov 1, 2017

The low_pri Rocksdb write option from the 5.6 release is very promising. Without it, doing Puts in a tight loop on a non-local GCE SSD regularly gets blocked for seconds, often for more than 10 seconds.

Wow, really? That strikes me as completely unexpected. I'm not familiar with the RocksDBMultiMap used, but essentially this is just doing batched writes in a single goroutine? Does this all reproduce if you go really basic and just use a rocksdb.Batch()? I'd assume yes.

I also assume this is specific to the cloud VMs and you don't see it locally?

The low_pri option looks promising, but I'm curious to understand better why it's so bad when you don't use it in the first place.

@petermattis
Copy link
Collaborator

To provide more detail, I'm running this experimental code, which mimics the writing done by the import code.

Do you see any problems running on your laptop? Or does this only occur on GCE local SSD? Also, I leave having a small reproducible case of badness like this. We'll track down what is going on here very soon.

@a-robinson
Copy link
Contributor Author

I don't see the same issues on my laptop. The performance takes some extended dips to half its original throughput, but never all the way down to 0.

Also, I tried experimenting with using your debug synctest (#19618) command at the same time as my disk contention command (on GCE), and while setting low_pri on my code helps a little bit compared to not doing so, it's far from preventing long periods of zero sync throughput.

@petermattis
Copy link
Collaborator

Have you set the vm.dirty_* sysctls with your diskcontention test?

@petermattis
Copy link
Collaborator

@a-robinson What GCE machine type are you using? I just tried to reproduce the badness you're seeing using an n1-highcpu-8 machine and saw a few dips, but no periods of 0 throughput.

@a-robinson
Copy link
Contributor Author

Have you set the vm.dirty_* sysctls with your diskcontention test?

It looks better than with the defaults, but still has sizable stalls (with the longest being 11 seconds in my few minutes of testing).

@a-robinson
Copy link
Contributor Author

@a-robinson What GCE machine type are you using? I just tried to reproduce the badness you're seeing using an n1-highcpu-8 machine and saw a few dips, but no periods of 0 throughput.

My gceworker -- i.e. 24 vCPUs, 32 GB memory, 100GB non-local SSD.

@petermattis
Copy link
Collaborator

My gceworker -- i.e. 24 vCPUs, 32 GB memory, 100GB non-local SSD.

Ah, I've seen problems with throttling when using Persistent SSD. Never investigated where they are coming from. I recall reading that Persistent SSD throttles based on the size of the disk. Can you test on Local SSD and verify you're not seeing a problem?

@a-robinson
Copy link
Contributor Author

Yeah, using local SSD avoids the extended zero QPS periods. It's pretty frustrating that the throttling is so bad given how inconvenient local SSDs are to use and how things like our Kubernetes config completely rely on remote disks. I wonder whether the GCE PD team cares about the effects of the throttling behavior.

@petermattis
Copy link
Collaborator

Have the IMPORT tests that have been experiencing problems been using PD SSD or Local SSD?

I wonder whether the GCE PD team cares about the effects of the throttling behavior.

Doesn't hurt to file a bug. Especially if we can narrow down the workload that exhibits badness.

@a-robinson
Copy link
Contributor Author

The import tests I was running were all on local SSD.

@a-robinson
Copy link
Contributor Author

Alright, so I have a simple program that tickles the disk badness even on local SSD. It basically just runs what I had before in parallel with synctest and only prints the sync throughput/latencies, although I slightly tweaked what was there before to even more closely mimic the import sstWriter distsql flow.

The low_pri and disableWAL options don't appear sufficient to help, at least not enough to matter. Setting the vm.dirty_bytes and vm.dirty_background_bytes settings does, keeping latencies bound to no more than a few hundred ms. However, how realistic is it for us to expect users to modify these settings? It seems like quite a stretch.

I'm going to try writing a basic adaptive rate-limiter that limits writes to the temp store in response to the primary store's sync latencies and see how it affects things. Another avenue to explore is just changing the import code to restrict its own parallelism. Spinning up hundreds of rocksdb instances and writing to them all in parallel seems like asking for trouble.

Also, as suggested for a similar situation recently, there's a good chance that #19229 would help with node liveness in the case that all nodes' disks are overloaded, since it'd switch us from waiting for the <leader's sync> + <fastest follower's sync> time to waiting for max(<leader's sync>, <fastest follower's sync>) time.

@a-robinson
Copy link
Contributor Author

Two updates:

  1. The sysctl settings have made for a successful distributed scale-factor-10 IMPORT storage: harden node liveness mechanism #19436 (comment)
  2. It looks like investigating different RocksDB settings may be more fruitful than pure rate limiting (assuming we can't rely on changing the sysctl settings). Even with Puts rate limited to just 16MB/s, after enough time (3 minutes with low_pri/disableWal off, 10 minutes with them on) sync latencies still start to break down. This is almost certainly due to compactions in RocksDB -- I regularly see 150+MB/s of disk writes, with occasional spikes to 220+MB/s.

@bdarnell
Copy link
Contributor

bdarnell commented Nov 9, 2017

However, how realistic is it for us to expect users to modify these settings? It seems like quite a stretch.

I don't think it's unreasonable to require sysctl or similar tweaks for heavily loaded servers. We already have to advise raising the default file descriptor limit and every database will have a page of settings that may need to be tweaked. Of course it's better if we can find a solution that doesn't require this, but that may not always be possible (especially when we're contending with non-cockroach processes).

Spinning up hundreds of rocksdb instances and writing to them all in parallel seems like asking for trouble.

Yikes! We should definitely stop doing that.

@petermattis
Copy link
Collaborator

Spinning up hundreds of rocksdb instances and writing to them all in parallel seems like asking for trouble.

Yikes! We should definitely stop doing that.

Agreed. I wasn't aware we were doing this.

@tbg tbg removed A-storage Relating to our storage engine (Pebble) on-disk storage. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 31, 2018
@nvanbenschoten
Copy link
Member

Moving this to 2.2. There hasn't been a lot of noise about node liveness lately, so maybe we should just close this instead. Thoughts @a-robinson? Is there more you'd like to explore here?

@nvanbenschoten nvanbenschoten modified the milestones: 2.1, 2.2 Sep 28, 2018
@bdarnell
Copy link
Contributor

I'd probably close this issue. There are still things that can be done to improve the reliability of the node liveness system (up to and including replacing the whole thing with something like @andy-kimball 's failure detector), but I'm not sure there's any value left in keeping this broad issue open.

@benesch
Copy link
Contributor

benesch commented Sep 28, 2018 via email

@a-robinson
Copy link
Contributor Author

At this point this is an issue tracking a broad concern with the system rather than a specific action we want to take. Node liveness still needs to be improved further over time, but I don't care whether we keep this issue open or just open more specific ones when we decide to do more work here.

@bdarnell
Copy link
Contributor

I think there’s probably something to be done with prioritizing node
liveness batches. Would Andy K’s failure detection not be subject to the
same problem where maxing out disk bandwidth causes heartbeats to be
missed?

One difficulty with prioritizing node liveness batches is that they're currently handled exactly the same way as regular traffic, so prioritizing them seems to require hacky checks on the keys/ranges involved. Andy K's failure detector would at least move any disk IO involved into a separate subsystem so the boundaries for prioritization would be more clear. (I'm not sure whether his scheme even requires disk access for heartbeats or only on failure).

Another problem with the current scheme is that there are two critical ranges: both range 1 and the liveness range must be up for the system to work. I believe Andy's plan would make failure detection independent of range 1 (and of any single range).

@benesch
Copy link
Contributor

benesch commented Sep 28, 2018

At this point this is an issue tracking a broad concern with the system rather than a specific action we want to take. Node liveness still needs to be improved further over time, but I don't care whether we keep this issue open or just open more specific ones when we decide to do more work here.

There are six proposed experiments in the issue description that don't seem to be associated with a more specific issue. It'd be a shame to lose track of those!

@tbg
Copy link
Member

tbg commented Sep 28, 2018

I would also leave it open with the goal of (in the 2.2 time frame) examining/prototyping Andy's failure detector and coming to a decision of whether to try to implement it in the foreseeable future.

@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
@tbg tbg added A-kv-client Relating to the KV client and the KV interface. and removed A-coreperf labels Oct 11, 2018
@a-robinson a-robinson assigned tbg and unassigned a-robinson Jan 11, 2019
@awoods187
Copy link
Contributor

@ajwerner I know you made progress here recently. Have you updated the checklist at the top/reviewed if it is still relevant?

@awoods187 awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Aug 12, 2019
@ajwerner
Copy link
Contributor

Many of the mitigations proposed in this issue deal with io starvation. We have yet to make progress on that front. In recent conversations with @dt it seems like ideas such as

Prototype a throttler that adjusts bulk I/O throughput in reaction to the latency of disk syncs

Are as relevant as ever.

That being said #39172 separates the network connection for range 1 and node licenses and seems to be effective at protecting node licenses in the face of cpu overload which has proven to be a bigger problem than this issue anticipated. It’s not clear that this is the “prioritization” mechanism envisioned by this issue.

@github-actions
Copy link

github-actions bot commented Jun 8, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@nvanbenschoten
Copy link
Member

Closing. We made improvements in this area over the past few years. Most notably, we've isolated network resources for node liveness traffic in #39172 and improved the behavior of the Raft scheduler and its handling of the node liveness range under CPU starvation in #56860.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. no-issue-activity S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

9 participants