install: Add option to skip clearing partition table on error #168

bgilbert · 2020-02-22T03:38:13Z

If --preserve-on-error is specified, don't clear the destination partition table on error, as a debugging aid.

If --preserve-on-error is specified, don't clear the destination partition table on error, as a debugging aid.

lucab · 2020-02-24T12:46:46Z

src/cmdline.rs

@@ -152,6 +153,11 @@ pub fn parse_args() -> Result<Config> {
                        .default_value(uname.machine())
                        .takes_value(true),
                )
+                .arg(
+                    Arg::with_name("preserve-on-error")
+                        .long("preserve-on-error")


Beware, bike-shedding ahead.

If this is mostly (only?) meant for partition-table debugging, should we maybe:

mention table or part somewhere in the visible name

mark as .hidden(true) XOR move to an undocumented env-flag

I think the option is useful for debugging generally, e.g. if there's a failure writing to something in the /boot partition. I thought about mentioning the partition table, but the fact that we clear it (rather than the whole disk) feels like an implementation detail we shouldn't commit to.

On balance I think there's also some value in discoverability. Otherwise users are more likely to have to ask us for help debugging their setups.

That said, I'm not completely happy with the option name and help text. Thoughts, preferences, and proposals welcome.

Fair. It seems like you'd want this to be generic and useful enough on its own.

I honestly don't have very good proposals. An alternative could be --lazy-failures: Do not perform cleanups on install failures but I'm not strongly convinced. Maybe @jlebon has something.

If we don't find something more convincing, I'm also fine ending the bikeshedding here.

I think "cleanups" might be misunderstood to just mean temporary/cache files the installer needs during the install. I'd go with something more explicit like --skip-wipe-on-error: Don't wipe the device on install failure?

@jlebon That's basically what I started with, but:

Re option name, it seemed better to have the name describe what it does, rather than what it doesn't do.

Re help text, as of Stop discarding disk contents #172 we don't wipe the entire device on failure, and implying otherwise might scare someone who wants to preserve a data partition. (Which undermines my point above about not mentioning the partition table...)

Thoughts?

Re help text, as of #172 we don't wipe the entire device on failure, and implying otherwise might scare someone who wants to preserve a data partition. (Which undermines my point above about not mentioning the partition table...)

Hmm, so maybe a prerequisite before this patch is figuring out what exactly we should do on failure if we don't actually want to wipe the partition table. Maybe something like remember the original 1MiB block and restore it on failure?

I think we're doing the right thing already. Restoring the old partition table and boot sector implies that the old partitions are still usable, but we've probably overwritten several GB of the disk so that's not actually the case. We'll have deleted the user's data partitions, but we won't have overwritten their contents, so an Ignition config that recreates those partitions and conditionally formats their filesystems will work as expected.

In principle we could recreate only the unclobbered partitions on failure, but that seems like a lot of work. I could file an issue though.

Thoughts?

Hmm OK, so I think this also intersects with coreos/coreos-assembler#924 (comment) actually. Because while the disk might have fit before a certain partition offset in the original version, a newer version may not. I'm wondering now if we should do something like: if we detect that the disk already has FCOS installed, and it has an additional partition, and the new disk will not fit, then we error out and require a --force flag to proceed. This gives users a chance to move the partition forward before having it be clobbered.

I guess we could make this more generic so as to support not just FCOS. E.g. some kind of --preserve-partition flag which (1) ensures that we won't clobber it, and (2) recreates it as is after a successful install?

Anyway, don't mean to derail this PR too much. Keeping it as preserve-on-error WFM.

cgwalters · 2020-03-10T17:52:09Z

This looks sane to me as is.

But another option: Rather than clearing the partition table if something goes wrong, we could just neuter bootability - toggle off the MBR boot flag and change the ESP GPT type e.g.

bgilbert · 2020-03-10T21:13:13Z

IIUC the MBR boot flag isn't relevant if GRUB is installed at the MBR level, but we could clear the stage 1 bootloader code. This more nuanced approach is arch-specific, though. I'm also worried that it could lead to confusion, e.g. users trying to fix bootability on an image that failed GPG verification.

cgwalters · 2020-03-10T21:46:59Z

Right. A bit more elaborate idea is to have coreos-installer write an error message to /boot/coreos-installer.failed and our systemd units detect this and print it to the console on the subsequent boot and fail.

bgilbert · 2020-03-11T16:58:04Z

That's assuming we were successful enough to be able to do so. Also, if GPG verification failed, the image is untrusted and we shouldn't be mounting its filesystems.

install: Add option to skip clearing partition table on error

6c5186b

If --preserve-on-error is specified, don't clear the destination partition table on error, as a debugging aid.

bgilbert mentioned this pull request Feb 22, 2020

couldn't find boot device for /dev/sda #165

Closed

lucab reviewed Feb 24, 2020

View reviewed changes

cgwalters approved these changes Mar 10, 2020

View reviewed changes

cgwalters merged commit 5e0a81f into coreos:master Mar 10, 2020

bgilbert deleted the preserve branch March 10, 2020 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

install: Add option to skip clearing partition table on error #168

install: Add option to skip clearing partition table on error #168

bgilbert commented Feb 22, 2020

lucab Feb 24, 2020

bgilbert Feb 24, 2020

lucab Feb 25, 2020

jlebon Feb 25, 2020

bgilbert Feb 26, 2020

jlebon Feb 27, 2020

bgilbert Feb 27, 2020

jlebon Feb 27, 2020

cgwalters commented Mar 10, 2020

bgilbert commented Mar 10, 2020

cgwalters commented Mar 10, 2020

bgilbert commented Mar 11, 2020

install: Add option to skip clearing partition table on error #168

install: Add option to skip clearing partition table on error #168

Conversation

bgilbert commented Feb 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgwalters commented Mar 10, 2020

bgilbert commented Mar 10, 2020

cgwalters commented Mar 10, 2020

bgilbert commented Mar 11, 2020