storage: Backpressure for extreme rebalancing needs #34752

bdarnell · 2019-02-08T19:45:08Z

A customer performed a load test by truncating a table and then running a large number of INSERT statements (uniformly distributed). The newly truncated table started from a single range and split and rebalanced from there. The problem is that rebalancing is throttled (2MB/s by default until #34747) but the incoming write traffic is not. This is not an apples-to-apples comparison, but the SQL connections were receiving a total of 9MB/s of queries. Initially all of those writes went to the single range in the truncated table. The node(s) holding that range were receiving more data faster than they could rebalance ranges away. That node grew until it eventually failed, even though there would have been plenty of capacity if the ranges had been more evenly distributed.

Note that pre-splitting wouldn't have helped without a working SCATTER command (#23358). Nodes were well-balanced in terms of number of ranges (from the pre-truncation table whose ranges were sticking around); there would be no pressure to move the split ranges until load began.

Ultimately it would have been better in this case to throttle writes for a while to allow the rebalancing to proceed. I'm not sure how exactly to tell when things have gotten bad enough that we should do this. It would be easier to buy some headroom by increasing the snapshot throttle cluster settings (on a case-by-case basis or change the defaults? The defaults were set back in #14718). There's also #14768 for a more dynamic rate limit.

bdarnell · 2019-02-08T19:56:36Z

Note that even doing a slow ramp-up of the test wouldn't necessarily solve this. Load-based rebalancing doesn't kick in until a certain threshold of load; until then we only have count-based rebalancing. It would be a delicate balance to ramp up the load enough to trigger LBR but avoid this problem.

Another option would be to make recently-split ranges "repel" each other even in the absence of load, although we'd have to be careful that this heuristic wouldn't fight with the merge queue's attempt to attract adjacent ranges together.

tbg · 2019-02-12T14:30:05Z

I think it's worth revisiting the limits in the short term. Preemptive snapshots are often recovery snapshots and Raft snapshots shouldn't play a large role in practice (assuming absence of the anomalous conditions in which they're plentiful).

That suggests that instead of throttling based on snapshot type, we should throttle by "purpose" of the snapshot. But to be honest I'm not sure the two classes of throttling buy us much and in particular the 2mb/s limit seems bad.

andreimatei · 2019-02-12T17:19:56Z

I've got an opinion, but I don't know if I agree with it. What if we stopped splitting when the rebalancing is backed up? This way, the writes will eventually be throttled by the range size check (which throttling, currently, I think attempts to perform a split itself, but we'd turn that off). The idea being that splits without rebalancing only serve for leaseholder rebalancing purposes, so if we're not about to do that for the newly created ranges, why even split?

bdarnell · 2019-02-12T17:54:09Z

Blocking splits is something that had occurred to me too, but given our history of problems in the other direction (ranges that get too large are unable to rebalance) I think it might create a cycle in the queue/subsystem dependencies that could get wedged up. And I think in this case there were a lot of ranges (1000+?) that would have to fill up individually before we'd finally throttle writes to the point that rebalancing could keep up.

andreimatei · 2019-02-12T19:15:18Z

Yeah...

cc @ajwerner for his general interest and work around admission control

I think the most interesting here is understanding better how egregious our behavior was and, if it was indeed very egregious, why did a node completely "fail" exactly.
Do you have info on the exact workload (queries and schema)?
How many inserts were being executed concurrently when the node "failed" - was it tens or tens of thousands?
What exactly was the symptom at the end? None of the inserts were returning? If so, what did the goroutine dump show around command evaluation?
And I think you mentioned it took an hour for the test to get into this complete node failure state. Any idea why it took so long / what might have degraded over time? Or perhaps was load increased over time? Like, why had we been (presumably) making progress over 1h?

bdarnell · 2019-02-12T19:52:31Z

Do you have info on the exact workload (queries and schema)?

Yes. One table with a UUID primary key and three secondary indexes. Two of the indexes were on timestamp columns (the third on an integer), but the values for all indexed columns were being chosen randomly in this test so there was no hotspot on the timestamp indexes.

The queries were single inserts in implicit transactions.

Ignore the "select" portion on the left; that's a separate test. We're concerned about the period from 19:15 to 21:30 (the right portion of the graph was also a separate test with the clients throttled to a lower rate. The test ramped up to 20k, experiencing a brief interruption at 20:30 and then falling off a cliff at 21:00.

What exactly was the symptom at the end? None of the inserts were returning? If so, what did the goroutine dump show around command evaluation?

The inserts didn't completely stop returning, but latency (and variance) increased (to minutes in the worst case). We only learned of the incident after the fact, so no goroutine dumps while it was happening.

And I think you mentioned it took an hour for the test to get into this complete node failure state. Any idea why it took so long / what might have degraded over time? Or perhaps was load increased over time? Like, why had we been (presumably) making progress over 1h?

The insert workload was uniform, so we saw that every range in this table split at about the same time. I think (although it's hard to be sure at this point) that the 20:30 outage corresponded to the set of splits that increased the number of ranges in this table to 512, and the 21:00 outage was when it went to 1024. This is part of why I blame the raft scheduler - until the non-quiesced range count had ramped up, there weren't enough active ranges to occupy enough threads.

knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Feb 11, 2019

awoods187 added the A-admission-control label Feb 13, 2019

awoods187 mentioned this issue Feb 21, 2019

Explain knobs for tuning load-based rebalancing cockroachdb/docs#4408

Open

3 tasks

lunevalex added the X-stale label Apr 23, 2021

lunevalex closed this as completed Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Backpressure for extreme rebalancing needs #34752

storage: Backpressure for extreme rebalancing needs #34752

bdarnell commented Feb 8, 2019

bdarnell commented Feb 8, 2019

tbg commented Feb 12, 2019

andreimatei commented Feb 12, 2019

bdarnell commented Feb 12, 2019

andreimatei commented Feb 12, 2019

bdarnell commented Feb 12, 2019

storage: Backpressure for extreme rebalancing needs #34752

storage: Backpressure for extreme rebalancing needs #34752

Comments

bdarnell commented Feb 8, 2019

bdarnell commented Feb 8, 2019

tbg commented Feb 12, 2019

andreimatei commented Feb 12, 2019

bdarnell commented Feb 12, 2019

andreimatei commented Feb 12, 2019

bdarnell commented Feb 12, 2019