Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

install: Add option to skip clearing partition table on error #168

Merged
merged 1 commit into from
Mar 10, 2020
Merged

install: Add option to skip clearing partition table on error #168

merged 1 commit into from
Mar 10, 2020

Conversation

bgilbert
Copy link
Contributor

If --preserve-on-error is specified, don't clear the destination partition table on error, as a debugging aid.

For #165 (comment).

If --preserve-on-error is specified, don't clear the destination
partition table on error, as a debugging aid.
@@ -152,6 +153,11 @@ pub fn parse_args() -> Result<Config> {
.default_value(uname.machine())
.takes_value(true),
)
.arg(
Arg::with_name("preserve-on-error")
.long("preserve-on-error")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware, bike-shedding ahead.

If this is mostly (only?) meant for partition-table debugging, should we maybe:

  • mention table or part somewhere in the visible name
  • mark as .hidden(true) XOR move to an undocumented env-flag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the option is useful for debugging generally, e.g. if there's a failure writing to something in the /boot partition. I thought about mentioning the partition table, but the fact that we clear it (rather than the whole disk) feels like an implementation detail we shouldn't commit to.

On balance I think there's also some value in discoverability. Otherwise users are more likely to have to ask us for help debugging their setups.

That said, I'm not completely happy with the option name and help text. Thoughts, preferences, and proposals welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. It seems like you'd want this to be generic and useful enough on its own.

I honestly don't have very good proposals. An alternative could be --lazy-failures: Do not perform cleanups on install failures but I'm not strongly convinced. Maybe @jlebon has something.

If we don't find something more convincing, I'm also fine ending the bikeshedding here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "cleanups" might be misunderstood to just mean temporary/cache files the installer needs during the install. I'd go with something more explicit like --skip-wipe-on-error: Don't wipe the device on install failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlebon That's basically what I started with, but:

  • Re option name, it seemed better to have the name describe what it does, rather than what it doesn't do.
  • Re help text, as of Stop discarding disk contents #172 we don't wipe the entire device on failure, and implying otherwise might scare someone who wants to preserve a data partition. (Which undermines my point above about not mentioning the partition table...)

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re help text, as of #172 we don't wipe the entire device on failure, and implying otherwise might scare someone who wants to preserve a data partition. (Which undermines my point above about not mentioning the partition table...)

Hmm, so maybe a prerequisite before this patch is figuring out what exactly we should do on failure if we don't actually want to wipe the partition table. Maybe something like remember the original 1MiB block and restore it on failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're doing the right thing already. Restoring the old partition table and boot sector implies that the old partitions are still usable, but we've probably overwritten several GB of the disk so that's not actually the case. We'll have deleted the user's data partitions, but we won't have overwritten their contents, so an Ignition config that recreates those partitions and conditionally formats their filesystems will work as expected.

In principle we could recreate only the unclobbered partitions on failure, but that seems like a lot of work. I could file an issue though.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK, so I think this also intersects with coreos/coreos-assembler#924 (comment) actually. Because while the disk might have fit before a certain partition offset in the original version, a newer version may not. I'm wondering now if we should do something like: if we detect that the disk already has FCOS installed, and it has an additional partition, and the new disk will not fit, then we error out and require a --force flag to proceed. This gives users a chance to move the partition forward before having it be clobbered.

I guess we could make this more generic so as to support not just FCOS. E.g. some kind of --preserve-partition flag which (1) ensures that we won't clobber it, and (2) recreates it as is after a successful install?

Anyway, don't mean to derail this PR too much. Keeping it as preserve-on-error WFM.

@cgwalters
Copy link
Member

This looks sane to me as is.

But another option: Rather than clearing the partition table if something goes wrong, we could just neuter bootability - toggle off the MBR boot flag and change the ESP GPT type e.g.

@cgwalters cgwalters merged commit 5e0a81f into coreos:master Mar 10, 2020
@bgilbert
Copy link
Contributor Author

IIUC the MBR boot flag isn't relevant if GRUB is installed at the MBR level, but we could clear the stage 1 bootloader code. This more nuanced approach is arch-specific, though. I'm also worried that it could lead to confusion, e.g. users trying to fix bootability on an image that failed GPG verification.

@bgilbert bgilbert deleted the preserve branch March 10, 2020 21:13
@cgwalters
Copy link
Member

Right. A bit more elaborate idea is to have coreos-installer write an error message to /boot/coreos-installer.failed and our systemd units detect this and print it to the console on the subsequent boot and fail.

@bgilbert
Copy link
Contributor Author

That's assuming we were successful enough to be able to do so. Also, if GPG verification failed, the image is untrusted and we shouldn't be mounting its filesystems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants