-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected behaviour in a split-brain scenario #20241
Comments
Without telling cockroach about your different zones, there is no way to know where the replicas will end up.
Depending on which ranges you're trying to talk to, different operations will fail. To better control replication, it is recommended to tell cockroach about your configuration using zone configurations Note that in a two datacenter scenarios, severing the connection between the two will always cause problems in at least one datacenter. |
Hey @mberhault , Thanks for the prompt reply, In our setup, we configured the Do you think that what happened was that the |
One clear case that would fail is the following. Let's say your workload is against a table with primary keys going from a to z.
If you perform a transaction touching key When system ranges become unavailable, all sorts of things can go out of whack. I'm not sure we've properly tested this for resiliency, but even something like writing timeseries (done by every node) would be impacted, resulting in an unresponsive admin UI. Unfortunately, it's not possible to lose one datacenter out of two and still have things work properly. You could switch to a 3-datacenter setup, and losing a single datacenter will keep the other two happy (as long as replicas are distributed across all three). However, the single disconnected datacenter will obviously not be able to make progress. |
Following your recommendation, I've setup a new environment with 3 data-centers (2 roaches each), However once I changed the environment to an uneven cluster sizes (2-2-4), I haven't change the zone configuration, but I did use the locality setting.
Thanks |
It definitely isn't the behavior that we want. Thanks for reporting back that you were able to create a problem. I think this may be explainable by #17599 (for which a fix will be going in soon), but will try reproducing to verify.
It'll depend on what the root cause is. This is not something that you're supposed to have to do workarounds for. |
@shay-stratoscale: I'm trying to reproduce this now, and I'm hitting a very clear problem, but I don't know whether it's the same problem you're hitting. |
Whoops, I didn't meant to submit that comment. Here's the rest. The problem I'm hitting is that the data isn't up-replicating off the first machine in the cluster. None of the other machines are getting any data stored on them. This is clear from the The "Simulated allocator runs" debug page (http://localhost:8000/_status/allocator/node/local) shows that there are no valid stores to up-replicate to:
Running
Either my Docker for Mac volume is totally full, or cockroach isn't playing nicely with the filesystem being used. Continuing to look. |
Yeah, that was just the Docker for Mac disk space being mostly full. Does your "Replicas per Store" graph look anything like this? The part of the system that moves data around the cluster to improve balance has an issue (#17971) in 1.1 with how well it handles clusters like this, where the number of nodes in localities differs by a lot, which causes it to keep rebalancing data back and forth. I assume that what's happened is that when you disconnected the largest of the three datacenters, it happened to have 2 of the 4 replicas of an important range, which meant the remaining 2 datacenters didn't have a quorum between them (because a range with 4 replicas requires 3 to be available for a quorum). In the steady state we maintain only an odd number of replicas for each range, but when rebalancing from one node to another we add the new node before removing the old, making for a period of fragility in which losing the datacenter could get rid of the quorum. The latter problem is something we want to fix (#12768), but is quite a tricky project. The former problem has been fixed since the release of 1.1 (#18364), but has not been cherry-picked because we tend to only cherry-pick small, self-contained fixes and this problem required a larger change. @shay-stratoscale If you think you're likely to be running datacenters with very different numbers of nodes before our next release in the spring, I can shrink that change down to something we can backport into our next 1.1 patch release. cc @bdarnell for his thoughts on cherry-picking some form of #18364 as well. |
#18364 doesn't look too bad to me. Does that apply cleanly or does it depend on other allocator changes? Have we tested it in production with imbalanced localities? |
#18364 actually doesn't appear to be sufficient for this case on its own. I'm working on some additional improvements before seeing whether it applies cleanly.
No, not in production. |
If the first target attempted was rejected due to the simulation claiming that it would be immediately removed, we would reuse the modified `rangeInfo.Desc.Replicas` that had the target added to it, messing with future iterations of the loop. Also, we weren't properly modifying the `candidates` slice, meaning that we could end up trying the same replica multiple times. I have a test for this, but it doesn't pass yet because the code in cockroachdb#18364 actually isn't quite sufficient for fixing cases like cockroachdb#20241. I'll send that out tomorrow once I have a fix done. Release note: None
Skipping the simulation when raftStatus.Progress is nil can make for undesirable thrashing of replicas, as seen when testing cockroachdb#20241. It's better to run the simulation without properly filtering replicas than to not run it at all. Release note: None
Opened up #20752 to more fully fix cases like this, and #20751 to track refactoring some of this code to align it better with how a human would think about it. @shay-stratoscale I'm definitely interested in whether this is an important deployment pattern for you guys. And thanks again for the nice Compose file! |
Fixes cockroachdb#20241 Release note (bug fix): avoid rebalance thrashing when localities have very different numbers of nodes
Hey @a-robinson , we appreciate your help on this matter. |
Skipping the simulation when raftStatus.Progress is nil can make for undesirable thrashing of replicas, as seen when testing cockroachdb#20241. It's better to run the simulation without properly filtering replicas than to not run it at all. Release note: None
Fixes cockroachdb#20241 Release note (bug fix): avoid rebalance thrashing when localities have very different numbers of nodes
To make cases like cockroachdb#20241 easier to test against in the future. Both configs perform reasonably on master (although I didn't test with stats-based rebalancing enabled). Release note: None
Thanks for letting us know, @shay-stratoscale. I've opened #20934 as a possible cherrypick of the necessary changes to the 1.1 release branch. v1.1.4 will be released early in January. If you're just doing testing for now, I'd recommend sticking with unstable releases in the meantime. |
Skipping the simulation when raftStatus.Progress is nil can make for undesirable thrashing of replicas, as seen when testing cockroachdb#20241. It's better to run the simulation without properly filtering replicas than to not run it at all. Release note: None
Fixes cockroachdb#20241 Release note (bug fix): avoid rebalance thrashing when localities have very different numbers of nodes
@shay-stratoscale this will be fixed in the v1.1.4 release, which is planned for January 8. |
Many thanks @a-robinson |
Hey, I updated cockroach-db to v1.1.4 and tried the split brain test scenario again, I've attached the compose file & the logs |
@a-robinson, you have any time to look at this again? |
Huh, that's very weird. I'll take a look again soon. |
I can't reproduce this every time, only sometimes. I assume that's true for you as well? When I do reproduce it, the problem is that the nodes in the big datacenter (dc-2) came up before the nodes in the other two datacenters, and so the starting node considers it better to replicate to them, even if it's not great for diversity, than to stay with only 1 or 2 replicas. Once the other no A way to avoid this is to make sure that all the nodes are up and running before the cluster starts For example, I tested this out a few times using this file: I added a join flag to node 1 and added a couple nodes to all the join flags and removed the dependency on the first node's health (because nodes don't register as healthy until the cluster has been initialized), but otherwise everything works the same. Once all the containers are running, I then simply run Thanks for bringing this up, and I'm sorry that the obvious approach had an ugly roadblock for you. Let me know if I've misunderstood your issue or if using the |
Hey @a-robinson , I took your advice and refactored our tests & the compose file, and indeed the tests passed successfully. If this is indeed the recommended approach for bringing up new clusters I would suggest updating this manual: https://www.cockroachlabs.com/docs/stable/start-a-local-cluster-in-docker.html I do wonder whether or not this is a sufficient fix. When setting up multiple data centers deployment we cannot ensure all the roaches in all data centers will be in a ready state before the first
|
In the case of extending an existing deployment after it's been initiated, the situation is different in one important way: you're (presumably) increasing from 3 (or more) datacenters to 4 (or more) datacenters. In such cases, your data should already be spread such that there's no more than one copy of a range in any given datacenter. Once in that state, v1.1 won't move your data in a way that will put more than one copy in a single datacenter. This does still leave open potential problems when up-replicating from 1 datacenter (or 2 datacenters) to 3 datacenters. That case could potentially leave 2 copies of a range in the original datacenter in v1.1. We're somewhat apprehensive about cherry-picking those changes into a v1.1.5 patch release, both since they're fairly large behavior changes and because they could cause a fair amount of data movement on upgrade, which most operators wouldn't expect from a patch release. If this is still a blocker for you, we'll reconsider, or you could build your own binary/image with those two PRs cherry-picked onto the release-1.1 branch.
Not currently, no.
It's currently scheduled for the first week in April. |
There is a way to do this indirectly: For each node in the first datacenter (one at a time), run |
Hey,
We're considering using cockroach-db as our product DB.
As part of our acceptance tests we've created a split-brain scenario to see how cockroach deals with it.
Test Setup
The test uses a cockroach-db cluster defined across two data centers:
DC1: consists of 3 roaches -
roach-0
,roach-1
,roach-2
.DC2: consists of 2 roaches -
roach-4
,roach-5
.Each of the data centers has its own private network, named
dc1
&dc2
, and a shared network namedshared
.The test starts with all of the roaches connected to the
shared
network and in a cluster healthy state.(See the attached
docker-compose.yaml
)Test Flow
Results
In most cases the test completes successfully, however in some cases all of the roaches seem to hang indefinitely while the shared network is disconnected.
I've attached the roaches log files and the docker-compose.yaml files.
Questions
docker-compose.txt
roach-0.log
roach-1.log
roach-2.log
roach-3.log
roach-4.log
The text was updated successfully, but these errors were encountered: