-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Backpressure for extreme rebalancing needs #34752
Comments
Note that even doing a slow ramp-up of the test wouldn't necessarily solve this. Load-based rebalancing doesn't kick in until a certain threshold of load; until then we only have count-based rebalancing. It would be a delicate balance to ramp up the load enough to trigger LBR but avoid this problem. Another option would be to make recently-split ranges "repel" each other even in the absence of load, although we'd have to be careful that this heuristic wouldn't fight with the merge queue's attempt to attract adjacent ranges together. |
I think it's worth revisiting the limits in the short term. Preemptive snapshots are often recovery snapshots and Raft snapshots shouldn't play a large role in practice (assuming absence of the anomalous conditions in which they're plentiful). That suggests that instead of throttling based on snapshot type, we should throttle by "purpose" of the snapshot. But to be honest I'm not sure the two classes of throttling buy us much and in particular the 2mb/s limit seems bad. |
I've got an opinion, but I don't know if I agree with it. What if we stopped splitting when the rebalancing is backed up? This way, the writes will eventually be throttled by the range size check (which throttling, currently, I think attempts to perform a split itself, but we'd turn that off). The idea being that splits without rebalancing only serve for leaseholder rebalancing purposes, so if we're not about to do that for the newly created ranges, why even split? |
Blocking splits is something that had occurred to me too, but given our history of problems in the other direction (ranges that get too large are unable to rebalance) I think it might create a cycle in the queue/subsystem dependencies that could get wedged up. And I think in this case there were a lot of ranges (1000+?) that would have to fill up individually before we'd finally throttle writes to the point that rebalancing could keep up. |
Yeah... cc @ajwerner for his general interest and work around admission control I think the most interesting here is understanding better how egregious our behavior was and, if it was indeed very egregious, why did a node completely "fail" exactly. |
Yes. One table with a UUID primary key and three secondary indexes. Two of the indexes were on timestamp columns (the third on an integer), but the values for all indexed columns were being chosen randomly in this test so there was no hotspot on the timestamp indexes. The queries were single inserts in implicit transactions. Ignore the "select" portion on the left; that's a separate test. We're concerned about the period from 19:15 to 21:30 (the right portion of the graph was also a separate test with the clients throttled to a lower rate. The test ramped up to 20k, experiencing a brief interruption at 20:30 and then falling off a cliff at 21:00.
The inserts didn't completely stop returning, but latency (and variance) increased (to minutes in the worst case). We only learned of the incident after the fact, so no goroutine dumps while it was happening.
The insert workload was uniform, so we saw that every range in this table split at about the same time. I think (although it's hard to be sure at this point) that the 20:30 outage corresponded to the set of splits that increased the number of ranges in this table to 512, and the 21:00 outage was when it went to 1024. This is part of why I blame the raft scheduler - until the non-quiesced range count had ramped up, there weren't enough active ranges to occupy enough threads. |
A customer performed a load test by truncating a table and then running a large number of INSERT statements (uniformly distributed). The newly truncated table started from a single range and split and rebalanced from there. The problem is that rebalancing is throttled (2MB/s by default until #34747) but the incoming write traffic is not. This is not an apples-to-apples comparison, but the SQL connections were receiving a total of 9MB/s of queries. Initially all of those writes went to the single range in the truncated table. The node(s) holding that range were receiving more data faster than they could rebalance ranges away. That node grew until it eventually failed, even though there would have been plenty of capacity if the ranges had been more evenly distributed.
Note that pre-splitting wouldn't have helped without a working SCATTER command (#23358). Nodes were well-balanced in terms of number of ranges (from the pre-truncation table whose ranges were sticking around); there would be no pressure to move the split ranges until load began.
Ultimately it would have been better in this case to throttle writes for a while to allow the rebalancing to proceed. I'm not sure how exactly to tell when things have gotten bad enough that we should do this. It would be easier to buy some headroom by increasing the snapshot throttle cluster settings (on a case-by-case basis or change the defaults? The defaults were set back in #14718). There's also #14768 for a more dynamic rate limit.
The text was updated successfully, but these errors were encountered: