metadata: update policies #209

lucab · 2019-06-25T13:03:52Z

Once we start pushing out releases across all defined streams and architectures, we also have to define which upgrade edges (e.g. from FCOS version x to version y) are allowed, together with additional details on how the roll-out should be performed. There a few points that needs to covered here.

Process definition: at some point in the release cycle (and later on, in case of further policing) a human should be able to follow a clearly documented process flow on how to specify which existing OS can upgrade to the new one currently being released (same process would apply if we suddenly have to pause an edge or a whole release).

Upgradable streams: we set up multiple streams, some of which are meant to be experimental or fully-automated. To minimize the amount of human decision and manual actions, we have to define which streams support auto-upgrades.

Storage and versioning: update policies needs to be consumed by the cincinnati backend. As such, we should agree on a public format and location where to store them. It would be better if that system allows us to track revisions and audit policy changes.

Policy rules: similar to the above, cincinnati backend can enforce any kind of complex policy rules. We should design and document what are the semantics of the policies we support, and how to configure them.

The text was updated successfully, but these errors were encountered:

jlebon · 2019-06-26T13:48:44Z

we also have to define which upgrade edges (e.g. from FCOS version x to version y) are allowed

Yeah, I've been thinking about this a bit as well. If we make it strictly linear (i.e. each version x only has a single arrow outward to version x+1), that implies that e.g. a machine that was off for a few months would have to cycle through all the updates, right? This is less relevant to the clustered case than the single node case. Or maybe we can just say "don't do that".

We could also loosen up the policy so by default all nodes can upgrade to the latest release, except when we want a checkpoint (or "update barrier" to reuse the terminology from #83) through which all systems must go. The downside of course is that it makes upgrade testing much harder (or incomplete). Could also do a hybrid approach where we have a mandatory checkpoint every 8 releases or something to reduce the number of upgrade paths to test.

lucab · 2019-06-27T12:26:41Z

Could also do a hybrid approach where we have a mandatory checkpoint every 8 releases or something to reduce the number of upgrade paths to test.

My initial approach would be to start from a linear chain, and augment as needed.
We can start from a simple chain, where every release marks the previous (sane) one in its stream as an edge source. Periodically, we introduce edges that allow bypassing many nodes at once. This also allows us to keep "checkpoint" releases as chokepoints.

It is easier (both for human sanity and tooling) to augment a simple graph, than to remove nodes/edges from an almost-fully-connected graph.

lucab · 2019-06-27T12:41:09Z

Regarding storage and versioning, we have prior art from Openshift here.

The approach is to attach policies (edges, throttling, paused releases, etc) to each release-node, as part of its metadata. Cincinnati is capable of parsing those special key-value pairs and to build/manipulate the graph on the-fly based on them.

Openshift is currently using quay to host such metadata (as JSON objects embedded in container images), but we can simply re-use our bucket structure and store our policy in the "release metadata" document. It already contains a suitable updates section.

There are still open questions around how we version changes to that document, and how we sign it.

lucab · 2019-06-27T12:48:41Z

Regarding policy rules, Cincinnati currently implements a few and I'm thinking about adding a few more. All of them are designed to be stateless (on backend side). Policy list includes:

adding/removing oriented edges (by source, and by destination)
removing all incoming edges to a node
removing a node from the graph (and all associated edges)
fixed throttling, based on a client provided threshold value and constant value for each node
dynamic throttling, similar to the above but linearly increasing in a time-window defined by each node

bgilbert · 2019-07-06T03:02:06Z

I'd been thinking that we'd allow upgrades from any older version in a stream to the current version within that stream (i.e., the Container Linux model). Update barriers could be implemented later when needed. Booting an old VM and immediately rebooting 5, 10, 20 times to get it current is terrible UX. I'm also not convinced it improves the testing matrix at all, because upgrade problems are path-dependent. Consider an upgrade path:

A -> B -> C -> D

The fact that each edge in that graph has been tested implies nothing about the success of the entire path. D may have a problem with state left behind by A, or with state created by A and then modified by C.

lucab · 2020-05-19T10:17:43Z

We did implement all the things mentioned here, thus I'm closing it.
Metadata policy is described at https://github.com/coreos/fedora-coreos-tracker/blob/master/metadata/updates/specifications.md.
Current design supports the cases shown above, including non linear-chained updates, barriers, phased rollouts and deadends.
I have a set of slides covering all of this from an high level at https://speakerdeck.com/lucab/orchestrating-and-monitoring-fedora-coreos-auto-updates.

lucab added the kind/design label Jun 27, 2019

lucab closed this as completed May 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metadata: update policies #209

metadata: update policies #209

lucab commented Jun 25, 2019 •

edited

Loading

jlebon commented Jun 26, 2019

lucab commented Jun 27, 2019

lucab commented Jun 27, 2019

lucab commented Jun 27, 2019 •

edited

Loading

bgilbert commented Jul 6, 2019

lucab commented May 19, 2020

metadata: update policies #209

metadata: update policies #209

Comments

lucab commented Jun 25, 2019 • edited Loading

jlebon commented Jun 26, 2019

lucab commented Jun 27, 2019

lucab commented Jun 27, 2019

lucab commented Jun 27, 2019 • edited Loading

bgilbert commented Jul 6, 2019

lucab commented May 19, 2020

lucab commented Jun 25, 2019 •

edited

Loading

lucab commented Jun 27, 2019 •

edited

Loading