Documentation: enumerate self-hosted etcd operator failure scenarios #257

philips · 2017-04-18T16:33:08Z

Currently we can bringup self-hosted etcd operator setups but we don't document failure or recovery scenarios even though we handle many of them.

master cluster down and power-on
master cluster API server failure
disk loss failure of entire master cluster
pod checkpoint checkpoints bad versions

cc @jbeda @justinsb @luxas

philips · 2017-04-18T16:33:22Z

@justinsb what others am I missing?

philips · 2017-04-18T16:34:23Z

xref kubernetes-retired/bootkube#432

justinsb · 2017-04-27T06:51:01Z

That's a good start. A few more off-the-top-of-my-head:

power-off recovery during or soon after etcd2 -> etcd3 upgrade (or any major failure), where "bootstrap" version is older than local version
etcd running but not responding to queries (e.g. disk full?)
apiserver runs but does not make progress after etcd operation
k-c-m runs but does not make progress after etcd operation
kube-scheduler runs but does not make progress after etcd operation
kubelet upgrade runs but does not make progress after etcd operation

(These are the standard gotchas of self-hosting, but I guess there's a particularly likely scenario for etcd upgrades)

understanding recovery semantics in terms of data loss from catastrophic failure scenario, and giving users a choice as to whether they prefer downtime or data-loss, or at least define what choice has been made
recovery from SSL key expiry

philips · 2017-05-16T01:04:13Z

There are recovery tools being built in bootkube now: https://github.com/kubernetes-incubator/bootkube#recover-a-downed-cluster

philips · 2017-05-16T01:39:05Z

Quick notes on places to start on writing these docs:

master cluster down and power-on: uses pod checkpointing and network checkpointing
master cluster API server failure: if non-HA you need to recover the load balancer or pods via static pods
disk loss failure of entire master cluster: you need to recover from backups, see bootkube recovery
pod checkpoint checkpoints bad versions: you need to manually fix the static manifests checkpointed in /etc/kubernetes/inactive-manifests

xiang90 · 2017-05-16T15:26:51Z

@philips

I agree it is a good idea to write docs about handling the failure cases for self hosted etcd. The items you listed are good starts! I believe the doc should focus on the difference between self hosted and external etcd, and highlights the potential risks self hosted etcd might introduce and how we solve them.

@justinsb

The a lot of items you listed are not really specific to self hosted etcd in my opinion.

etcd running but not responding to queries (e.g. disk full?)
apiserver runs but does not make progress after etcd operation
k-c-m runs but does not make progress after etcd operation
kube-scheduler runs but does not make progress after etcd operation
kubelet upgrade runs but does not make progress after etcd operation
recovery from SSL key expiry

If you manually operator etcd, you might have these issues too. They are not introduced by self hosted etcd.

power-off recovery during or soon after etcd2 -> etcd3 upgrade (or any major failure), where "bootstrap" version is older than local version

understanding recovery semantics in terms of data loss from catastrophic failure scenario, and giving users a choice as to whether they prefer downtime or data-loss, or at least define what choice has been made

These two are relevant. We will cover it when writing the doc. But I would suggest you to give self hosted etcd a try if you are interested. So we can discuss in more depth.

zbwright · 2017-06-06T18:49:59Z

@radhikapc is working on etcd docs now. assigning to her, @xiang90

Quentin-M · 2017-06-19T15:49:57Z

Hi @radhikapc, any update on that one?

xiang90 · 2017-06-21T19:49:59Z

@Quentin-M hongchao or I need to write something similar to https://github.com/coreos/etcd/blob/master/Documentation/op-guide/failures.md. Then @radhikapc can start to help cleaning things up. we will get started after finishing up the TLS thing.

… documentation for etcd operator. ref: coreos#257

justinsb · 2017-09-05T15:50:18Z

Where was this moved to?

s-urbaniak added the kind/documentation label Apr 25, 2017

sym3tri added this to the Overall cleanup and stability milestone May 2, 2017

justinsb mentioned this issue May 2, 2017

RFE: Boot-strapping etcd cluster + operator kubernetes/kubeadm#254

Closed

sym3tri assigned zbwright and xiang90 May 11, 2017

justinsb mentioned this issue May 13, 2017

Add etcd-operator to kubeadm kubernetes/kubernetes#45665

Closed

sym3tri modified the milestones: Sprint 2: Overall cleanup and stability, Sprint 3: Continued Test Automation May 23, 2017

jamiehannaford mentioned this issue May 24, 2017

Finding a solution for etcd kubernetes/kubeadm#277

Closed

zbwright assigned radhikapc and xiang90 and unassigned xiang90 and zbwright Jun 6, 2017

robszumski modified the milestones: Sprint 3: Stability & Test Automation, Sprint 4 Jun 19, 2017

radhikapc pushed a commit to radhikapc/tectonic-installer that referenced this issue Jun 28, 2017

Documentation/troubleshooting/etcd_op_recovery.md : Disaster recovery…

a8d1c1d

… documentation for etcd operator. ref: coreos#257

radhikapc mentioned this issue Jun 28, 2017

[Do Not Merge- Under Dev] Documentation/troubleshooting/etcd_op_recovery.md : Disaster recovery documentation for etcd operator #1221

Closed

radhikapc pushed a commit to radhikapc/tectonic-installer that referenced this issue Jun 28, 2017

Documentation/troubleshooting/etcd_op_recovery.md : Disaster recovery…

167c2cb

… documentation for etcd operator. ref: coreos#257

sym3tri modified the milestones: Sprint 4, Sprint 5 Jun 30, 2017

sym3tri removed this from the Sprint 5 milestone Aug 23, 2017

sym3tri added the migrate-issue label Aug 23, 2017

sym3tri closed this as completed Sep 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation: enumerate self-hosted etcd operator failure scenarios #257

Documentation: enumerate self-hosted etcd operator failure scenarios #257

philips commented Apr 18, 2017 •

edited by abhinavdahiya

Loading

philips commented Apr 18, 2017

philips commented Apr 18, 2017

justinsb commented Apr 27, 2017

philips commented May 16, 2017

philips commented May 16, 2017

xiang90 commented May 16, 2017

zbwright commented Jun 6, 2017

Quentin-M commented Jun 19, 2017

xiang90 commented Jun 21, 2017

justinsb commented Sep 5, 2017

Documentation: enumerate self-hosted etcd operator failure scenarios #257

Documentation: enumerate self-hosted etcd operator failure scenarios #257

Comments

philips commented Apr 18, 2017 • edited by abhinavdahiya Loading

philips commented Apr 18, 2017

philips commented Apr 18, 2017

justinsb commented Apr 27, 2017

philips commented May 16, 2017

philips commented May 16, 2017

xiang90 commented May 16, 2017

zbwright commented Jun 6, 2017

Quentin-M commented Jun 19, 2017

xiang90 commented Jun 21, 2017

justinsb commented Sep 5, 2017

philips commented Apr 18, 2017 •

edited by abhinavdahiya

Loading