-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
install: Add option to skip clearing partition table on error #168
Conversation
If --preserve-on-error is specified, don't clear the destination partition table on error, as a debugging aid.
@@ -152,6 +153,11 @@ pub fn parse_args() -> Result<Config> { | |||
.default_value(uname.machine()) | |||
.takes_value(true), | |||
) | |||
.arg( | |||
Arg::with_name("preserve-on-error") | |||
.long("preserve-on-error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beware, bike-shedding ahead.
If this is mostly (only?) meant for partition-table debugging, should we maybe:
- mention
table
orpart
somewhere in the visible name - mark as
.hidden(true)
XOR move to an undocumented env-flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the option is useful for debugging generally, e.g. if there's a failure writing to something in the /boot
partition. I thought about mentioning the partition table, but the fact that we clear it (rather than the whole disk) feels like an implementation detail we shouldn't commit to.
On balance I think there's also some value in discoverability. Otherwise users are more likely to have to ask us for help debugging their setups.
That said, I'm not completely happy with the option name and help text. Thoughts, preferences, and proposals welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair. It seems like you'd want this to be generic and useful enough on its own.
I honestly don't have very good proposals. An alternative could be --lazy-failures: Do not perform cleanups on install failures
but I'm not strongly convinced. Maybe @jlebon has something.
If we don't find something more convincing, I'm also fine ending the bikeshedding here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "cleanups" might be misunderstood to just mean temporary/cache files the installer needs during the install. I'd go with something more explicit like --skip-wipe-on-error: Don't wipe the device on install failure
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlebon That's basically what I started with, but:
- Re option name, it seemed better to have the name describe what it does, rather than what it doesn't do.
- Re help text, as of Stop discarding disk contents #172 we don't wipe the entire device on failure, and implying otherwise might scare someone who wants to preserve a data partition. (Which undermines my point above about not mentioning the partition table...)
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re help text, as of #172 we don't wipe the entire device on failure, and implying otherwise might scare someone who wants to preserve a data partition. (Which undermines my point above about not mentioning the partition table...)
Hmm, so maybe a prerequisite before this patch is figuring out what exactly we should do on failure if we don't actually want to wipe the partition table. Maybe something like remember the original 1MiB block and restore it on failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're doing the right thing already. Restoring the old partition table and boot sector implies that the old partitions are still usable, but we've probably overwritten several GB of the disk so that's not actually the case. We'll have deleted the user's data partitions, but we won't have overwritten their contents, so an Ignition config that recreates those partitions and conditionally formats their filesystems will work as expected.
In principle we could recreate only the unclobbered partitions on failure, but that seems like a lot of work. I could file an issue though.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK, so I think this also intersects with coreos/coreos-assembler#924 (comment) actually. Because while the disk might have fit before a certain partition offset in the original version, a newer version may not. I'm wondering now if we should do something like: if we detect that the disk already has FCOS installed, and it has an additional partition, and the new disk will not fit, then we error out and require a --force
flag to proceed. This gives users a chance to move the partition forward before having it be clobbered.
I guess we could make this more generic so as to support not just FCOS. E.g. some kind of --preserve-partition
flag which (1) ensures that we won't clobber it, and (2) recreates it as is after a successful install?
Anyway, don't mean to derail this PR too much. Keeping it as preserve-on-error
WFM.
This looks sane to me as is. But another option: Rather than clearing the partition table if something goes wrong, we could just neuter bootability - toggle off the MBR boot flag and change the ESP GPT type e.g. |
IIUC the MBR boot flag isn't relevant if GRUB is installed at the MBR level, but we could clear the stage 1 bootloader code. This more nuanced approach is arch-specific, though. I'm also worried that it could lead to confusion, e.g. users trying to fix bootability on an image that failed GPG verification. |
Right. A bit more elaborate idea is to have coreos-installer write an error message to |
That's assuming we were successful enough to be able to do so. Also, if GPG verification failed, the image is untrusted and we shouldn't be mounting its filesystems. |
If
--preserve-on-error
is specified, don't clear the destination partition table on error, as a debugging aid.For #165 (comment).