Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behaviour in a split-brain scenario #20241

Closed
sharbov opened this issue Nov 23, 2017 · 22 comments · Fixed by #20752
Closed

Unexpected behaviour in a split-brain scenario #20241

sharbov opened this issue Nov 23, 2017 · 22 comments · Fixed by #20752
Assignees
Labels
C-question A question rather than an issue. No code/spec/doc change needed. O-community Originated from the community

Comments

@sharbov
Copy link

sharbov commented Nov 23, 2017

Hey,

We're considering using cockroach-db as our product DB.
As part of our acceptance tests we've created a split-brain scenario to see how cockroach deals with it.

Test Setup

The test uses a cockroach-db cluster defined across two data centers:
DC1: consists of 3 roaches - roach-0, roach-1, roach-2.
DC2: consists of 2 roaches - roach-4, roach-5.
Each of the data centers has its own private network, named dc1 & dc2, and a shared network named shared.

The test starts with all of the roaches connected to the shared network and in a cluster healthy state.

(See the attached docker-compose.yaml)

Test Flow

  1. Run a sanity check using all of the roaches to validate the cluster works properly.
  2. Disconnect all of the roaches from the shared network - effectively creating two separate groups.
  3. Try to preform read & write operations using each one of the roaches.

Expectation: because we're using the default replica settings (3 replicas),
one of the data-centers should hold a majority (2 replicas),
and its roaches should succeed reading & writing from the DB.

  1. Reconnect the roaches to the shared network
  2. Run a sanity check using all of the roaches to validate the cluster recovered properly.

Results

In most cases the test completes successfully, however in some cases all of the roaches seem to hang indefinitely while the shared network is disconnected.

I've attached the roaches log files and the docker-compose.yaml files.

Questions

  1. Is there a problem with our test cluster configuration?
  2. Are my base assumptions about the expected behaviour wrong?
  3. Is this a bug?

docker-compose.txt
roach-0.log
roach-1.log
roach-2.log
roach-3.log
roach-4.log

@mberhault
Copy link
Contributor

Without telling cockroach about your different zones, there is no way to know where the replicas will end up.
For example, here are some scenarios of where replicas may end up and what happens when you sever the shared connection:

  • all three in dc1: r/w from dc2 will all fail
  • two dc1, one in dc2: r/w from dc2 will fail
  • one in dc1 two in dc2: r/w from dc1 will fail

Depending on which ranges you're trying to talk to, different operations will fail.
For example: if the system ranges (holding metadata about databases and tables) is unavailable, you will not be able to create a table. If some of a table's ranges are unavailable, you may be able to operate on parts of the key space, but not all of it.

To better control replication, it is recommended to tell cockroach about your configuration using zone configurations

Note that in a two datacenter scenarios, severing the connection between the two will always cause problems in at least one datacenter.

@sharbov
Copy link
Author

sharbov commented Nov 23, 2017

Hey @mberhault ,

Thanks for the prompt reply,

In our setup, we configured the locality based on the recommendation specified here

Do you think that what happened was that the system ranges majority was in one DC and the table we tried to access had a majority in the other DC? Is there a way to prevent this so that in a split-brain scenario we would at least have one DC functional?

@mberhault
Copy link
Contributor

mberhault commented Nov 23, 2017

One clear case that would fail is the following. Let's say your workload is against a table with primary keys going from a to z.

  • range [a-k[ has a majority of replicas in dc1
  • range [k-z] has a majority of replicas in dc2

If you perform a transaction touching key a and key z, no datacenter can complete it as it won't be able to get a majority on both keys.
A similar scenario can happen if something in dc2 needs to touch something with majority replicas in dc1, including system ranges.

When system ranges become unavailable, all sorts of things can go out of whack. I'm not sure we've properly tested this for resiliency, but even something like writing timeseries (done by every node) would be impacted, resulting in an unresponsive admin UI.

Unfortunately, it's not possible to lose one datacenter out of two and still have things work properly. You could switch to a 3-datacenter setup, and losing a single datacenter will keep the other two happy (as long as replicas are distributed across all three). However, the single disconnected datacenter will obviously not be able to make progress.

@sharbov
Copy link
Author

sharbov commented Dec 11, 2017

Following your recommendation, I've setup a new environment with 3 data-centers (2 roaches each),
and indeed the cluster functioned well. When one of the data-centers was disconnected, the other two remained operational.

However once I changed the environment to an uneven cluster sizes (2-2-4),
the behavior changed, when the bigger data-center was disconnected from the shared network
all of data-centers hanged on read & write operations.

I haven't change the zone configuration, but I did use the locality setting.
Here's the docker compose yaml I've used:
docker-compose.txt

  • Is this the expected behavior?
  • Is there a way to ensure the remaining data-centers will stay operational?

Thanks

@a-robinson
Copy link
Contributor

Is this the expected behavior?

It definitely isn't the behavior that we want. Thanks for reporting back that you were able to create a problem.

I think this may be explainable by #17599 (for which a fix will be going in soon), but will try reproducing to verify.

Is there a way to ensure the remaining data-centers will stay operational?

It'll depend on what the root cause is. This is not something that you're supposed to have to do workarounds for.

@a-robinson
Copy link
Contributor

@shay-stratoscale: I'm trying to reproduce this now, and I'm hitting a very clear problem, but I don't know whether it's the same problem you're hitting.

@a-robinson
Copy link
Contributor

Whoops, I didn't meant to submit that comment. Here's the rest.

The problem I'm hitting is that the data isn't up-replicating off the first machine in the cluster. None of the other machines are getting any data stored on them. This is clear from the Replicas per Store graph:

screen shot 2017-12-13 at 12 07 09 pm

The "Simulated allocator runs" debug page (http://localhost:8000/_status/allocator/node/local) shows that there are no valid stores to up-replicate to:

{
  "dryRuns": [
    {
      "rangeId": "1",
      "events": [
        {
          "time": "2017-12-13T16:54:19.847899115Z",
          "message": "[n1,status] AllocatorAdd - missing replica need=3, have=1, priority=10001.00"
        },
        {
          "time": "2017-12-13T16:54:19.847904109Z",
          "message": "[n1,status] adding a new replica"
        },
        {
          "time": "2017-12-13T16:54:19.847933175Z",
          "message": "[n1,status] allocate candidates: []"
        },
        {
          "time": "2017-12-13T16:54:19.847961441Z",
          "message": "[n1,status] error simulating allocator on replica [n1,s1,r1/1:/{Min-System/}]: 0 of 0 stores with all attributes matching []; likely not enough nodes in cluster"
        }
      ]
    },
...

Running docker exec -it roach-0 ./cockroach debug gossip-values --insecure shows that none of the stores think they have any available disk space, which is quite remarkable.

...
"store:1": {StoreID:1 Attrs: Node:{NodeID:1 Address:{NetworkField:tcp AddressField:c8c708215b86:26257} Attrs: Locality:datacenter=dc-0 ServerVersion:1.1} Capacity:disk (capacity=59 GiB, available=0 B, used=53 MiB, logicalBytes=17 MiB), ranges=19, leases=17, writes=331.78, bytesPerReplica={p10=0.00 p25=0.00 p50=70.00 p75=10914.00 p90=126170.00}, writesPerReplica={p10=0.02 p25=0.02 p50=0.06 p75=0.36 p90=3.25}}
"store:2": {StoreID:2 Attrs: Node:{NodeID:2 Address:{NetworkField:tcp AddressField:619c968d97bc:26257} Attrs: Locality:datacenter=dc-0 ServerVersion:1.1} Capacity:disk (capacity=59 GiB, available=0 B, used=64 KiB, logicalBytes=0 B), ranges=0, leases=0, writes=0.00, bytesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}}
"store:3": {StoreID:3 Attrs: Node:{NodeID:3 Address:{NetworkField:tcp AddressField:b9a6bf89519e:26257} Attrs: Locality:datacenter=dc-2 ServerVersion:1.1} Capacity:disk (capacity=59 GiB, available=0 B, used=62 KiB, logicalBytes=0 B), ranges=0, leases=0, writes=0.00, bytesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}}
"store:4": {StoreID:4 Attrs: Node:{NodeID:4 Address:{NetworkField:tcp AddressField:2e06def002e3:26257} Attrs: Locality:datacenter=dc-2 ServerVersion:1.1} Capacity:disk (capacity=59 GiB, available=0 B, used=64 KiB, logicalBytes=0 B), ranges=0, leases=0, writes=0.00, bytesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}}
"store:5": {StoreID:5 Attrs: Node:{NodeID:5 Address:{NetworkField:tcp AddressField:4f2c0c151985:26257} Attrs: Locality:datacenter=dc-2 ServerVersion:1.1} Capacity:disk (capacity=59 GiB, available=0 B, used=62 KiB, logicalBytes=0 B), ranges=0, leases=0, writes=0.00, bytesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}}
"store:6": {StoreID:6 Attrs: Node:{NodeID:6 Address:{NetworkField:tcp AddressField:325be9795d64:26257} Attrs: Locality:datacenter=dc-1 ServerVersion:1.1} Capacity:disk (capacity=59 GiB, available=0 B, used=62 KiB, logicalBytes=0 B), ranges=0, leases=0, writes=0.00, bytesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}}
"store:7": {StoreID:7 Attrs: Node:{NodeID:7 Address:{NetworkField:tcp AddressField:144963fc18f7:26257} Attrs: Locality:datacenter=dc-1 ServerVersion:1.1} Capacity:disk (capacity=59 GiB, available=0 B, used=62 KiB, logicalBytes=0 B), ranges=0, leases=0, writes=0.00, bytesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}}
"store:8": {StoreID:8 Attrs: Node:{NodeID:8 Address:{NetworkField:tcp AddressField:ded58101ea81:26257} Attrs: Locality:datacenter=dc-2 ServerVersion:1.1} Capacity:disk (capacity=59 GiB, available=0 B, used=61 KiB, logicalBytes=0 B), ranges=0, leases=0, writes=0.00, bytesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}}
"system-db": omitted

Either my Docker for Mac volume is totally full, or cockroach isn't playing nicely with the filesystem being used. Continuing to look.

@a-robinson
Copy link
Contributor

Yeah, that was just the Docker for Mac disk space being mostly full. Does your "Replicas per Store" graph look anything like this?

screen shot 2017-12-13 at 12 29 05 pm

The part of the system that moves data around the cluster to improve balance has an issue (#17971) in 1.1 with how well it handles clusters like this, where the number of nodes in localities differs by a lot, which causes it to keep rebalancing data back and forth.

I assume that what's happened is that when you disconnected the largest of the three datacenters, it happened to have 2 of the 4 replicas of an important range, which meant the remaining 2 datacenters didn't have a quorum between them (because a range with 4 replicas requires 3 to be available for a quorum). In the steady state we maintain only an odd number of replicas for each range, but when rebalancing from one node to another we add the new node before removing the old, making for a period of fragility in which losing the datacenter could get rid of the quorum.

The latter problem is something we want to fix (#12768), but is quite a tricky project. The former problem has been fixed since the release of 1.1 (#18364), but has not been cherry-picked because we tend to only cherry-pick small, self-contained fixes and this problem required a larger change.

@shay-stratoscale If you think you're likely to be running datacenters with very different numbers of nodes before our next release in the spring, I can shrink that change down to something we can backport into our next 1.1 patch release. cc @bdarnell for his thoughts on cherry-picking some form of #18364 as well.

@bdarnell
Copy link
Contributor

#18364 doesn't look too bad to me. Does that apply cleanly or does it depend on other allocator changes?

Have we tested it in production with imbalanced localities?

@a-robinson
Copy link
Contributor

#18364 actually doesn't appear to be sufficient for this case on its own. I'm working on some additional improvements before seeing whether it applies cleanly.

Have we tested it in production with imbalanced localities?

No, not in production.

a-robinson added a commit to a-robinson/cockroach that referenced this issue Dec 14, 2017
If the first target attempted was rejected due to the simulation
claiming that it would be immediately removed, we would reuse the
modified `rangeInfo.Desc.Replicas` that had the target added to it,
messing with future iterations of the loop.

Also, we weren't properly modifying the `candidates` slice, meaning that
we could end up trying the same replica multiple times.

I have a test for this, but it doesn't pass yet because the code in cockroachdb#18364
actually isn't quite sufficient for fixing cases like cockroachdb#20241. I'll send
that out tomorrow once I have a fix done.

Release note: None
a-robinson added a commit to a-robinson/cockroach that referenced this issue Dec 15, 2017
Skipping the simulation when raftStatus.Progress is nil can make
for undesirable thrashing of replicas, as seen when testing cockroachdb#20241.
It's better to run the simulation without properly filtering replicas
than to not run it at all.

Release note: None
@a-robinson
Copy link
Contributor

Opened up #20752 to more fully fix cases like this, and #20751 to track refactoring some of this code to align it better with how a human would think about it.

@shay-stratoscale I'm definitely interested in whether this is an important deployment pattern for you guys. And thanks again for the nice Compose file!

a-robinson added a commit to a-robinson/cockroach that referenced this issue Dec 15, 2017
Fixes cockroachdb#20241

Release note (bug fix): avoid rebalance thrashing when localities have
very different numbers of nodes
@sharbov
Copy link
Author

sharbov commented Dec 17, 2017

Hey @a-robinson , we appreciate your help on this matter.
Unfortunately this is an important deployment pattern for us,
we can't ensure the number of nodes in each datacenter beforehand.

a-robinson added a commit to a-robinson/cockroach that referenced this issue Dec 18, 2017
Skipping the simulation when raftStatus.Progress is nil can make
for undesirable thrashing of replicas, as seen when testing cockroachdb#20241.
It's better to run the simulation without properly filtering replicas
than to not run it at all.

Release note: None
a-robinson added a commit to a-robinson/cockroach that referenced this issue Dec 18, 2017
Fixes cockroachdb#20241

Release note (bug fix): avoid rebalance thrashing when localities have
very different numbers of nodes
a-robinson added a commit to a-robinson/cockroach that referenced this issue Dec 19, 2017
To make cases like cockroachdb#20241 easier to test against in the future.
Both configs perform reasonably on master (although I didn't test with
stats-based rebalancing enabled).

Release note: None
@a-robinson
Copy link
Contributor

Thanks for letting us know, @shay-stratoscale. I've opened #20934 as a possible cherrypick of the necessary changes to the 1.1 release branch. v1.1.4 will be released early in January. If you're just doing testing for now, I'd recommend sticking with unstable releases in the meantime.

a-robinson added a commit to a-robinson/cockroach that referenced this issue Dec 20, 2017
Skipping the simulation when raftStatus.Progress is nil can make
for undesirable thrashing of replicas, as seen when testing cockroachdb#20241.
It's better to run the simulation without properly filtering replicas
than to not run it at all.

Release note: None
a-robinson added a commit to a-robinson/cockroach that referenced this issue Dec 20, 2017
Fixes cockroachdb#20241

Release note (bug fix): avoid rebalance thrashing when localities have
very different numbers of nodes
@a-robinson
Copy link
Contributor

@shay-stratoscale this will be fixed in the v1.1.4 release, which is planned for January 8.

@sharbov
Copy link
Author

sharbov commented Dec 28, 2017

Many thanks @a-robinson

@sharbov
Copy link
Author

sharbov commented Jan 21, 2018

Hey,

I updated cockroach-db to v1.1.4 and tried the split brain test scenario again,
Unfortunately the failure was reproduced;
when the larger DC (2) was disconnected the remaining DCs (0 & 1) malfunctioned.

I've attached the compose file & the logs
docker-compose.yaml.txt
roach-0.log
roach-1.log
roach-2.log
roach-3.log
roach-4.log
roach-5.log
roach-6.log
roach-7.log

@benesch benesch reopened this Jan 22, 2018
@benesch
Copy link
Contributor

benesch commented Jan 22, 2018

@a-robinson, you have any time to look at this again?

@a-robinson
Copy link
Contributor

Huh, that's very weird. I'll take a look again soon.

@a-robinson
Copy link
Contributor

a-robinson commented Jan 23, 2018

I can't reproduce this every time, only sometimes. I assume that's true for you as well?

When I do reproduce it, the problem is that the nodes in the big datacenter (dc-2) came up before the nodes in the other two datacenters, and so the starting node considers it better to replicate to them, even if it's not great for diversity, than to stay with only 1 or 2 replicas. Once the other no
des come up, we'll move some of the replicas over to them, but only until the number of replicas on each node are roughly balanced. We don't continue moving replicas to improve the diversity of the remaining few ranges, because in v1.1 we don't consider improving diversity sufficient motivation to initiate a rebalance. We've fixed this already for the upcoming v2.0 release (#19489). It's again something that could potentially be cherrypicked back to the v1.1 branch, but would involve pulling back more code than last time (#19489 and #19486), and is less necessary than last time, because there's an easy way to avoid it.

A way to avoid this is to make sure that all the nodes are up and running before the cluster starts
replicating data around. To do this, you can give all the nodes, including the first node, a join flag. Once all the nodes are running, you then run the cockroach init command on any of the nodes in order to initialize things. This is documented at https://www.cockroachlabs.com/docs/stable/initialize-a-cluster.html, and is the recommended approach for bringing up new clusters.

For example, I tested this out a few times using this file:

docker-compose.yaml.txt

I added a join flag to node 1 and added a couple nodes to all the join flags and removed the dependency on the first node's health (because nodes don't register as healthy until the cluster has been initialized), but otherwise everything works the same. Once all the containers are running, I then simply run docker exec -it roach-0 /cockroach/cockroach init --insecure to initialize the cluster.

Thanks for bringing this up, and I'm sorry that the obvious approach had an ugly roadblock for you. Let me know if I've misunderstood your issue or if using the init command doesn't work for you.

@sharbov
Copy link
Author

sharbov commented Jan 23, 2018

Hey @a-robinson ,
Thanks for the prompt reply,

I took your advice and refactored our tests & the compose file, and indeed the tests passed successfully.

If this is indeed the recommended approach for bringing up new clusters I would suggest updating this manual: https://www.cockroachlabs.com/docs/stable/start-a-local-cluster-in-docker.html

I do wonder whether or not this is a sufficient fix.

When setting up multiple data centers deployment we cannot ensure all the roaches in all data centers will be in a ready state before the first init command, for example, one can imagine a scenario in which you want to extend an existing deployment after it has been initiated. sticking with the original replicas will put you in a risky position in which a single data center failure might affect all of them.

  1. Is there a way to trigger a rebalance operation between the different data centers manually?
  2. When is v2 scheduled to be released ?

@a-robinson
Copy link
Contributor

When setting up multiple data centers deployment we cannot ensure all the roaches in all data centers will be in a ready state before the first init command, for example, one can imagine a scenario in which you want to extend an existing deployment after it has been initiated. sticking with the original replicas will put you in a risky position in which a single data center failure might affect all of them.

In the case of extending an existing deployment after it's been initiated, the situation is different in one important way: you're (presumably) increasing from 3 (or more) datacenters to 4 (or more) datacenters. In such cases, your data should already be spread such that there's no more than one copy of a range in any given datacenter. Once in that state, v1.1 won't move your data in a way that will put more than one copy in a single datacenter.

This does still leave open potential problems when up-replicating from 1 datacenter (or 2 datacenters) to 3 datacenters. That case could potentially leave 2 copies of a range in the original datacenter in v1.1.

We're somewhat apprehensive about cherry-picking those changes into a v1.1.5 patch release, both since they're fairly large behavior changes and because they could cause a fair amount of data movement on upgrade, which most operators wouldn't expect from a patch release. If this is still a blocker for you, we'll reconsider, or you could build your own binary/image with those two PRs cherry-picked onto the release-1.1 branch.

  1. Is there a way to trigger a rebalance operation between the different data centers manually?

Not currently, no.

  1. When is v2 scheduled to be released ?

It's currently scheduled for the first week in April.

@bdarnell
Copy link
Contributor

Is there a way to trigger a rebalance operation between the different data centers manually?

There is a way to do this indirectly: For each node in the first datacenter (one at a time), run cockroach node decommission --wait on it to drain all replicas off of it, then cockroach node recommission to bring it back to normal operation. This will force the rebalancer to take action and it should end up with everything distributed across the datacenters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-question A question rather than an issue. No code/spec/doc change needed. O-community Originated from the community
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants