-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
live-generator: Avoid tmpfs/overlayfs, add stronger deps #499
Conversation
Still debugging this. |
cat >>"${UNIT_DIR}/sysroot-etc-copy.service" <<EOF | ||
[Service] | ||
Type=oneshot | ||
RemainAfterExit=yes | ||
ExecStart=/bin/cp -a /sysroot/etc /writable/etc-copy | ||
ExecStart=/bin/sh -c 'mkdir -p /writable/etccopy && /bin/cp -a /sysroot/etc /writable/etccopy/etc && /sbin/setfiles -F -r /writable/etccopy /sysroot/etc/selinux/targeted/contexts/files/file_contexts /writable/etccopy' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about just adding another unit which runs after sysroot-etc.mount
and does ExecStart=/usr/bin/coreos-relabel /etc
? Feels cleaner since it's then part of the /sysroot
rootfs proper too instead of having to create one just for setfiles
. Not against doing it all in one unit though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we can be sure no other units ran trying to operate on /sysroot/etc
until the relabeling was done. Which...I think we can. I will investigate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guideline I try to follow for relabeling is that every component is responsible for relabeling what it touched. So ideally it shouldn't matter whether something needs to modify a file in /etc
before or after the mass relabeling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that relabeling can happen before or after files are changed, but it clearly is going to lead to races if we relabel concurrently with files changing right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, can you expand on this? If every service always makes sure to call coreos-relabel
after doing things in /etc
, then it shouldn't matter if e.g. one service is changing a file while it's being relabeled since it will also trigger a relabel once it's done, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess so. In the end though these services are now moved to be Before=initrd-root-fs.target
so nothing else (most notably ignition-files.service
should be touching them).
6f4f73f
to
1e28db4
Compare
OK so...
|
If I manually do:
Then I can see the labels are still Hum...I notice there's a zero |
And notably this seems to be common to the things done via
|
With strace I see this:
But...after doing the bits to load the policy so I can inspect the xattrs...it appears not to be set. With ftrace I can see that |
Reading some of the kernel code here...I think this may be a behavioral difference in the initramfs between When policy is loaded, the kernel walks over all the inodes from mounted filesystems and ensures they're initialized, and I think this is wiping out any changes we make to If so, then this would explain why this further code for rhel8 is only broken there - for FCOS we're doing overlayfs on the squashfs, and this is mostly going to work because it would only affect files we've modified. Yeah, getting more confident in this theory, look at the policy: |
Blah...maybe the simplest is to try to rework this so that rather than copying everything into Clearly in parallel we need to get the fhandle bits backported into RHEL8's squashfs so that one can create an overlayfs on top - they go together so well. |
OK I just tested, rather than |
Ahh but FCOS is also using a |
Ahh yeah, that's left over from when I copied this from the Ignition work where we feed the file list through Your comments about
Hmm good catch. I thought we had gotten rid of all post-switchroot relabeling in FCOS (but at least |
Yup, this is easy to reproduce on FCOS by booting with
|
OK so...I'll rework both paths to use the loopback-mounted xfs trick instead of (Now, I think we should fix |
Hmm, I'm torn. I think using |
1e28db4
to
aa1cb59
Compare
It's going to be a huge mess in the code to have two paths though. I think the real bug here is to fix |
I feel like this is making progress but now what I'm seeing...seems to be I am currently baffled by what would cause it to run again in the initramfs. It's not happening outside of live runs AFAICS. |
aa1cb59
to
f26e8a5
Compare
And that problem turns out to be that we really do need the Man, that was painful to figure out. I only stumbled into doing so because when initially testing changes for the race a week or two ago the first thing I tried was removing those |
YES
OK now I've introduced some other race in bootup - seeing But this is feeling close... |
f26e8a5
to
75b48d8
Compare
That seemed to somehow be related to the service starting too early. |
75b48d8
to
e8d267d
Compare
OK, lifting WIP on this! Tested with both FCOS/RHCOS, we have SELinux enforcing and haven't seen a boot race yet, but I'm working on some coreos-assembler patches to make testing that in an automated fashion easier. |
And now building on top of coreos/coreos-assembler#1571 I have this going:
So let's see if we can find any races... OK, did over 100 boots this time, no failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment about disk-space exhaustion and discards.
As is, this could be interesting. Consider:
- the
tmpfs
is capped at 50% of the RAM - the filesystem is allocated for the full amount of RAM
- writes could fail since the filesystem is bigger than the backing disk
- Removing files won't help since space is still allocated.
[Service] | ||
Type=oneshot | ||
RemainAfterExit=yes | ||
ExecStart=/bin/sh -c 'set -euo pipefail; mem=$$(grep "^MemTotal" /proc/meminfo | sed -e "s,^.*: *\\([0-9]*\\) .*,\\1,") && /bin/truncate -s $${mem}k /run/ephemeral.xfsloop' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about awk '/^MemTotal/{print$$2;exit;}' /proc/meminfo
? I think we already have awk
in the initramfs and its a bit easier to read.
And if I am gawking this correctly this allows for 100% of the total memory. I think that having a percentage of RAM would better.
edit: https://www.kernel.org/doc/Documentation/filesystems/tmpfs.txt says that it will be capped at half the RAM. So it won't OOM. But it could have write errors since the backing memory is smaller than the allocated space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, actually tmpfs
uses 50% of RAM by default (totally arbitrary and I think anyone using the Live ISO/PXE would want to reconfigure this in some situations). So what we're doing here is wrong in that we're creating a sparse file larger than could be written.
How about sizing the filesystem to the total available in /run
? Something like
[root@cosa-devsh ~]# echo $(($(stat -f -c '%b * %s / 1024' /run)))
495516
None of this is really right but I think for most use cases it doesn't matter, and for those it does (say a lot of container images in /var/lib/containers
), an admin can make that a tmpfs
mount point with a specific size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing files won't help since space is still allocated.
Keep in mind an admin can still invoke fstrim
on demand (and requiring them to do this is the current default) - discard
is just turning it on by default synchronously (which is totally fine in this case because this is all just in memory and so there's no latency concerns).
This has many things rolled into one commit because I just needed to test them together. The high level goal is SELinux enforcing for RHCOS and to mitigate a boot time race (still not fully debugged, ended up adding `After=systemd-udev-settle.service`) The SELinux enforcing one turned up a huge mess because as it happens SELinux will nuke all labels on `tmpfs` after policy is loaded. Which is very problematic in general (FCOS included) but FCOS was papering over this problem by using the systemd `relabel-extra.d` bits. But we don't really want to use that anymore IMO - we should have a clean model where files are always labeled correctly in the initramfs before we switch root. Anything else going to lead to pain. In order to work around the SELinux/`tmpfs` bug, instead make a loopback-mounted `xfs` filesystem (on `tmpfs`). When the kernel is fixed to retain labels in `tmpfs` we can drop that hack.
e8d267d
to
1f4a2e5
Compare
Updated per above discussion. I still want to try to diagnose exactly the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Good work @cgwalters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job debugging this! Just a couple of comments but LGTM overall.
[Unit] | ||
After=${isosrc_escaped} | ||
Requires=${isosrc_escaped} | ||
EOF | ||
cat >"${UNIT_DIR}/run-media-iso.mount" <<EOF | ||
# Automatically generated by live-generator | ||
|
||
[Unit] | ||
DefaultDependencies=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not new, though since we're on the hunt for unnecessary DefaultDependencies=
, I think this is one of them since initrd-root-device.target
is after basic.target
.
cat >"${UNIT_DIR}/run-media-iso.mount" <<EOF | ||
# Automatically generated by live-generator | ||
|
||
[Unit] | ||
DefaultDependencies=false | ||
After=initrd-root-device.target | ||
# HACK for https://github.com/coreos/fedora-coreos-config/issues/437 | ||
After=systemd-udev-settle.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this is still needed? systemd-udev-settle.service
is before sysinit.target
, which is before basic.target
(which is before initrd-root-device.target
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see much ordering around initrd-root-device.target
:
sh-5.0# systemctl show initrd-root-device.target |grep -Ee 'Before|After'
Before=initrd.target
After=ignition-disks.service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm weird. I was going according to bootup(7)
: https://github.com/systemd/systemd/blob/c03ef420fa7157b8d4881636fe72596a06e08bb6/man/bootup.xml#L243-L251. But you're right, there's no obvious ordering that seems to implement that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think actually this may be not sufficient though, the idea is systemd-udev-settle.service
doesn't run by default, and so one should actually use both Wants=
and Requires=
or so. Ah right, multipathd.service
is:
sh-5.0# systemctl show systemd-udev-settle.service |grep WantedBy
WantedBy=multipathd.service multipathd-configure.service
[Unit] | ||
DefaultDependencies=false | ||
# Let's be sure we have basic devices, but other than that we | ||
# can run really early. | ||
After=systemd-tmpfiles-setup-dev.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is this for /run
? It's mounted by systemd itself on startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I thought so too but I was seeing weird failures until I added this, see #499 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'm skeptical of units with absolutely no dependencies at all.
DefaultDependencies=false | ||
# Make sure our tmpfs is available | ||
Requires=sysroot-xfs-ephemeral-setup.service | ||
After=sysroot-xfs-ephemeral-setup.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also have Before=initrd-root-fs.target
since the etc
and var
mounts should be considered part of setting up the root fs too, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, no harm in adding that, but we already have that implicitly because sysroot-relabel.service
(the final service in the chain) has it, and is ordered after those units, so it forces the whole chain.
If we're not seeing anything seriously wrong I'd like to merge this (tested extensively as is) PR, and do other cleanups as a followup that can be tested independently. Iterating on this is painful! |
Testing this followup now:
|
Yup, WFM! |
This brings in at least coreos/fedora-coreos-config#499 for the Live ISO, and we want those fixes for all of the work happening on top of the Live ISO like the assisted installer, etc.
updates for OCP 4.8
This has many things rolled into one commit because I just
needed to test them together. The high level goal is SELinux
enforcing for RHCOS and to mitigate a boot time race (still
not fully debugged, ended up adding
After=systemd-udev-settle.service
)The SELinux enforcing one turned up a huge mess because as
it happens SELinux will nuke all labels on
tmpfs
after policyis loaded. Which is very problematic in general (FCOS included)
but FCOS was papering over this problem by using the systemd
relabel-extra.d
bits.
But we don't really want to use that anymore IMO - we should
have a clean model where files are always labeled correctly in the initramfs
before we switch root. Anything else going to lead to pain.
In order to work around the SELinux/
tmpfs
bug, instead make a loopback-mountedxfs
filesystem (ontmpfs
).When the kernel is fixed to retain labels in
tmpfs
we can drop thathack.