Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container import from directory with sockets fails #892

Closed
BeHom opened this issue Aug 15, 2023 · 18 comments · Fixed by #1587
Closed

Container import from directory with sockets fails #892

BeHom opened this issue Aug 15, 2023 · 18 comments · Fixed by #1587
Assignees
Milestone

Comments

@BeHom
Copy link

BeHom commented Aug 15, 2023

Version of Warewulf

What version of Warewulf are you using? Run

wwctl version:   4.4.1-1.git_d6f6fed
rpc version: apiPrefix:"rc1" apiVersion:"1" warewulfVersion:"4.4.1-1.git_d6f6fed"

Expected behavior

During the development process of a container definition file, I need to import, delete and re-import a container from the Apptainer build process. While it works one time it failed the 2nd time.
The expected behavior is that the import should be possible at any time .

Actual behavior

Sequence of work

  • release the container under test from the actual configuration.
    wwctl profile set --yes --container rocky-8 "default"
  • build the new container sandbox based on the modified definition file
    apptainer build --sandbox /tmp/rocky-8-def ./rocky-8-def.def
  • delete the old container
    wwctl container delete rocky-def
  • import the new container from sandbox (here the problem will occur the 2nd time).
    wwctl container import /tmp/rocky-8-def rocky-def
  • activate the new container for some nodes
    wwctl profile set --yes --container rocky-def "default"
  • Test the running node

During the import step, the following error occurred.
wwctl container import /tmp/rocky-8-def rocky-def
ERROR : could not import image: lchown /var/lib/warewulf/chroots/rocky-def/rootfs/run/user/0/gnupg/d.hkt3xk3ifea4rs471snc7nsb/S.gpg-agent: no such file or directory
ERROR: could not import image: lchown /var/lib/warewulf/chroots/rocky-def/rootfs/run/user/0/gnupg/d.hkt3xk3ifea4rs471snc7nsb/S.gpg-agent: no such file or directory

No matter if I restarted warewulf or reconfigured warewulf, the problem remains.
Only a reboot of the warewulf master server will fix it.

It looks like artifacts from the old container I deleted are preventing a new import.

A test with a new container and then an import was succesfull possible in the error situation.
Only the import into a previously existing container name, in my case rocky-def, is not possible.

Steps to reproduce this behavior

See above.

How can others reproduce this issue/problem?

What OS/distro are you running

$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="8.8 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.8 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2029-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.8"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8"

How did you install Warewulf

dnf install ./warewulf-4.4.1-1.rpm

@BeHom
Copy link
Author

BeHom commented Aug 16, 2023

The following associated command also shows errors:
The container exists before and is lost afterwards.

wwctl container import --force /tmp/rocky-8-def rocky-def
Overwriting existing VNFS
Updating the container's /etc/resolv.conf
ERROR : Could not create destination file /var/lib/warewulf/chroots/rocky-def/rootfs/etc/resolv.conf: open /var/lib/warewulf/chroots/rocky-def/rootfs/etc/resolv.conf: no such file or directory
WARN : Could not copy /etc/resolv.conf into container: open /var/lib/warewulf/chroots/rocky-def/rootfs/etc/resolv.conf: no such file or directory
ERROR : error in user sync, fix error and run 'syncuser' manually: open /var/lib/warewulf/chroots/rocky-def/rootfs/etc/passwd: no such file or directory
Building container: rocky-def
ERROR : could not build container rocky-def: Container does not exist: rocky-def
ERROR: could not build container rocky-def: Container does not exist: rocky-def

@anderbubble
Copy link
Collaborator

@BeHom thanks for reporting this. This is a known issue, and one I'm hoping to get fixed soon.

In my experience, if you try to import --force twice it works the second time, as a work-around.

@anderbubble anderbubble self-assigned this Aug 16, 2023
@anderbubble anderbubble added backport:4.2.x backport to 4.2.x bug and removed backport:4.2.x backport to 4.2.x labels Aug 16, 2023
@anderbubble anderbubble added this to the 4.5.0 milestone Aug 16, 2023
@BeHom
Copy link
Author

BeHom commented Aug 17, 2023

Thanks @anderbubble for getting back.
The workaround will not fix it.
wctl container import --force /tmp/rocky-8-def rocky-def
Overwriting existing VNFS
Updating the container's /etc/resolv.conf
ERROR : Could not create destination file /var/lib/warewulf/chroots/rocky-def/rootfs/etc/resolv.conf: open /var/lib/warewulf/chroots/rocky-def/rootfs/etc/resolv.conf: no such file or directory
WARN : Could not copy /etc/resolv.conf into container: open /var/lib/warewulf/chroots/rocky-def/rootfs/etc/resolv.conf: no such file or directory
ERROR : error in user sync, fix error and run 'syncuser' manually: open /var/lib/warewulf/chroots/rocky-def/rootfs/etc/passwd: no such file or directory
Building container: rocky-def
ERROR : could not build container rocky-def: Container does not exist: rocky-def
ERROR: could not build container rocky-def: Container does not exist: rocky-def

[root@martin tmp]# wwctl container import --force /tmp/rocky-8-def rocky-def
ERROR : could not import image: lchown /var/lib/warewulf/chroots/rocky-def/rootfs/run/user/0/gnupg/d.hkt3xk3ifea4rs471snc7nsb/S.gpg-agent: no such file or directory
ERROR: could not import image: lchown /var/lib/warewulf/chroots/rocky-def/rootfs/run/user/0/gnupg/d.hkt3xk3ifea4rs471snc7nsb/S.gpg-agent: no such file or directory
[root@martin tmp]# wwctl container import --force /tmp/rocky-8-def rocky-def
ERROR : could not import image: lchown /var/lib/warewulf/chroots/rocky-def/rootfs/run/user/0/gnupg/d.hkt3xk3ifea4rs471snc7nsb/S.gpg-agent: no such file or directory
ERROR: could not import image: lchown /var/lib/warewulf/chroots/rocky-def/rootfs/run/user/0/gnupg/d.hkt3xk3ifea4rs471snc7nsb/S.gpg-agent: no such file or directory

Import under new name also gets stuck.
root@martin tmp]# wwctl container import --force /tmp/rocky-8-def rocky-karl
ERROR : could not import image: lchown /var/lib/warewulf/chroots/rocky-karl/rootfs/run/user/0/gnupg/d.hkt3xk3ifea4rs471snc7nsb/S.gpg-agent: no such file or directory
ERROR: could not import image: lchown /var/lib/warewulf/chroots/rocky-karl/rootfs/run/user/0/gnupg/d.hkt3xk3ifea4rs471snc7nsb/S.gpg-agent: no such file or directory

Import of new build container sandbox also failed.
INFO: Build complete: /tmp/rocky-8-def2
[root@martin tmp]# wwctl container import /tmp/rocky-8-def2 rocky-karl
ERROR : could not import image: lchown /var/lib/warewulf/chroots/rocky-karl/rootfs/run/user/0/gnupg/d.hkt3xk3ifea4rs471snc7nsb/S.gpg-agent: no such file or directory
ERROR: could not import image: lchown /var/lib/warewulf/chroots/rocky-karl/rootfs/run/user/0/gnupg/d.hkt3xk3ifea4rs471snc7nsb/S.gpg-agent: no such file or directory

Container import updates will not update the chroot of the container.
wctl container import --update /tmp/rocky-8-def rocky-def
will not make any update. usage unclear

@JasonYangShadow
Copy link
Member

this PR should help fix this issue
#1015

@JasonYangShadow
Copy link
Member

JasonYangShadow commented Jan 31, 2024

@anderbubble

test against the main branch
it looks like this issue gets fixed

[vagrant@localhost test]$ apptainer build --sandbox ./rockylinux-8/ docker://ghcr.io/hpcng/warewulf-rockylinux:8
INFO:    Starting build...
Getting image source signatures
Copying blob d4cdbc20b5d6 done  
Copying blob 8b7880b32c88 done  
Copying blob a49f4b3e1553 done  
Copying blob 49a072db3168 done  
Copying blob c43766916271 done  
Copying config de197a6d39 done  
Writing manifest to image destination
Storing signatures
2024/01/30 22:16:41  info unpack layer: sha256:a49f4b3e1553c4468c366b42fd1cde2a27729bd7ab13162ad061af2bd1ef9268
2024/01/30 22:16:43  info unpack layer: sha256:8b7880b32c88b97f7738d59c6d76a1f31624007c645be620a1c9720d766b6608
2024/01/30 22:16:47  warn rootless{usr/libexec/openssh/ssh-keysign} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2024/01/30 22:16:47  info unpack layer: sha256:49a072db31682b9f8e8ee50c7bb6f55901d60af154d7f63bed68565f3321f1f5
2024/01/30 22:16:47  info unpack layer: sha256:c43766916271959ec4cc6da5a0455c2c4f1784a7e13081a80461737a20e470e5
2024/01/30 22:16:47  info unpack layer: sha256:d4cdbc20b5d6a3211177be994a6946e43920c8fa826ea0fca0707b23c6179ddc
WARNING: The sandbox contain files/dirs that cannot be removed with 'rm'.
WARNING: Use 'chmod -R u+rwX' to set permissions that allow removal.
WARNING: Use the '--fix-perms' option to 'apptainer build' to modify permissions at build time.
INFO:    Creating sandbox directory...
INFO:    Build complete: ./rockylinux-8/
[vagrant@localhost test]$ sudo su
[root@localhost test]# wwctl container import ./rockylinux-8/ rockylinux-8
uid/gid not synced: run `wwctl container syncuser --write rockylinux-8`
[root@localhost test]# wwctl container syncuser --write rockylinux-8
uid/gid synced for container rockylinux-8
[root@localhost test]# wwctl container list
  CONTAINER NAME  NODES  KERNEL VERSION                 CREATION TIME        MODIFICATION TIME    SIZE     
  rockylinux-8    0      4.18.0-513.9.1.el8_9.aarch64   30 Jan 24 22:18 MST  31 Dec 69 17:00 MST  1.2 GiB  
  rockylinux-8.7  0      4.18.0-425.19.2.el8_7.aarch64  30 Jan 24 21:59 MST  31 Dec 69 17:00 MST  1.1 GiB  
[root@localhost test]# apptainer version
1.2.5-1.el9
[root@localhost test]# cat /etc/os-release 
NAME="CentOS Stream"
VERSION="9"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="9"
PLATFORM_ID="platform:el9"
PRETTY_NAME="CentOS Stream 9"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:centos:centos:9"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_SUPPORT_PRODUCT_VERSION="CentOS Stream"
[root@localhost test]# wwctl container import --force ./rockylinux-8/ rockylinux-8
Overwriting existing VNFS
uid/gid not synced: run `wwctl container syncuser --write rockylinux-8`
[root@localhost test]# wwctl container syncuser --write rockylinux-8
uid/gid synced for container rockylinux-8
[root@localhost test]# wwctl container list
  CONTAINER NAME  NODES  KERNEL VERSION                 CREATION TIME        MODIFICATION TIME    SIZE     
  rockylinux-8    0      4.18.0-513.9.1.el8_9.aarch64   30 Jan 24 22:19 MST  31 Dec 69 17:00 MST  1.2 GiB  
  rockylinux-8.7  0      4.18.0-425.19.2.el8_7.aarch64  30 Jan 24 21:59 MST  31 Dec 69 17:00 MST  1.1 GiB  
[root@localhost test]# wwctl version
wwctl version:   4.5.x-1.git_8b7586d2
rpc version: apiPrefix:"rc1" apiVersion:"1" warewulfVersion:"4.5.x-1.git_8b7586d2"

@anderbubble
Copy link
Collaborator

Thanks for the verification, @JasonYangShadow!

@BeHom
Copy link
Author

BeHom commented Feb 12, 2024

Sorry have to reopen the issue.
The problem is still there.
Current version:
wwctl version
wwctl version: 4.5.x-1
rpc version: apiPrefix:"rc1" apiVersion:"1" warewulfVersion:"4.5.x-1"

I just used “dnf install” for the packages (kept old Warewulf setup).

Build a container based on definition file:
INFO: Adding labels
INFO: Creating sandbox directory...
INFO: Build complete: /tmp/rocky-8-base-container
Mon Feb 12 13:00:46 CET 2024

Try to import the container:

wwctl container import /tmp/rocky-8-base-container rocky-8-def-05
ERROR : could not import image: lchown /var/lib/warewulf/chroots/rocky-8-def-05/rootfs/run/user/0/gnupg/d.hkt3xk3ifea4rs471snc7nsb/S.gpg-agent: no such file or directory
ERROR: could not import image: lchown /var/lib/warewulf/chroots/rocky-8-def-05/rootfs/run/user/0/gnupg/d.hkt3xk3ifea4rs471snc7nsb/S.gpg-agent: no such file or directory

After a restart, at least the import is possible.

@anderbubble anderbubble reopened this Feb 12, 2024
@anderbubble
Copy link
Collaborator

@BeHom thanks for the report. We'll investigate.

@anderbubble
Copy link
Collaborator

@BeHom I think I see now that I was misunderstanding your report. We had previously also had a general issue with force-importing; but it appears to be working in a simple case now:

[janderson@rocky main]$ sudo wwctl container import /var/lib/warewulf/chroots/alpine/rootfs alpine2
WARN   : id(14) collision: host(ftp) container(postmaster)
WARN   : add postmaster to host to resolve conflict
ERROR  : error in user sync, fix error and run 'syncuser' manually: id(14) collision: host(ftp) container(postmaster)
ERROR: error in user sync, fix error and run 'syncuser' manually: id(14) collision: host(ftp) container(postmaster)

[janderson@rocky main]$ sudo wwctl container import /var/lib/warewulf/chroots/alpine/rootfs alpine2
ERROR  : VNFS Name exists, specify --force, --update, or choose a different name: alpine2
ERROR: VNFS Name exists, specify --force, --update, or choose a different name: alpine2

[janderson@rocky main]$ sudo wwctl container import --force /var/lib/warewulf/chroots/alpine/rootfs alpine2
Overwriting existing VNFS
WARN   : id(14) collision: host(ftp) container(postmaster)
WARN   : add postmaster to host to resolve conflict
ERROR  : error in user sync, fix error and run 'syncuser' manually: id(14) collision: host(ftp) container(postmaster)
ERROR: error in user sync, fix error and run 'syncuser' manually: id(14) collision: host(ftp) container(postmaster)

It also worked with a Rocky container.

[janderson@rocky main]$ sudo wwctl container import /var/lib/warewulf/chroots/rocky-8/rootfs/ rocky-8-reimport
uid/gid not synced: run `wwctl container syncuser --write rocky-8-reimport`
[janderson@rocky main]$ sudo wwctl container import /var/lib/warewulf/chroots/rocky-8/rootfs/ rocky-8-reimport
ERROR  : VNFS Name exists, specify --force, --update, or choose a different name: rocky-8-reimport
ERROR: VNFS Name exists, specify --force, --update, or choose a different name: rocky-8-reimport
[janderson@rocky main]$ sudo wwctl container import --force /var/lib/warewulf/chroots/rocky-8/rootfs/ rocky-8-reimport
Overwriting existing VNFS
uid/gid not synced: run `wwctl container syncuser --write rocky-8-reimport`

Can you share the specific .def you're using to create the sandbox?

@anderbubble anderbubble modified the milestones: v4.5.0, v4.5.1 Feb 17, 2024
@BeHom
Copy link
Author

BeHom commented Feb 18, 2024

@anderbubble After your question about the def file, I delved a little deeper into the subject of container creation.
If found the difference between container which can be imported and other which cannot be imported and need a system reboot.
It looks like /run/user/0 is not cleaned up inside the container even after a successful build.

There are still running processes for the sandbox container after build
E.g.
./rocky-8-NVIDIA-container/run/user/0/gnupg/d.mu5z3ywgt671eanykasyz8xb/S.gpg-agent
./rocky-8-NVIDIA-container/run/user/0/gnupg/d.mu5z3ywgt671eanykasyz8xb/S.gpg-agent.extra
./rocky-8-NVIDIA-container/run/user/0/gnupg/d.mu5z3ywgt671eanykasyz8xb/S.gpg-agent.browser
./rocky-8-NVIDIA-container/run/user/0/gnupg/d.mu5z3ywgt671eanykasyz8xb/S.gpg-agent.ssh

“ps -ef” on the host.
root 23999 1 0 12:17 ? 00:00:00 gpg-agent --homedir /var/cache/dnf/vscode-8194d3505cd295f0/pubring --use-standard-socket –daemon
root 23999 1 0 12:17 ? 00:00:00 gpg-agent --homedir /var/cache/dnf/vscode-8194d3505cd295f0/pubring --use-standard-socket --daemon
The reasons are two repositories using ,
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
gpgkey=https://packages.microsoft.com/keys/microsoft.asc

These processes will disappear after reboot and therefore release the /run/user/0 directories.
That’s why a reboot helps for the import.
A “pkill gpg-agent” will also help but is somewhat strange to me.

My though was that after apptainer build sandbox finished all related processes are stopped.
I used the command “apptainer build --force --fix-perms –sandbox /tmp/”
apptainer version 1.2.5-1.el8.

The error message from warewulf is somewhat misleading:
wwctl container import /tmp/rocky-8-minimal rocky-8.8-t2
ERROR : could not import image: lchown /var/lib/warewulf/chroots/rocky-8.8-t2/rootfs/run/user/0/gnupg/d.hkt3xk3ifea4rs471snc7nsb/S.gpg-agent: no such file or directory
ERROR: could not import image: lchown /var/lib/warewulf/chroots/rocky-8.8-t2/rootfs/run/user/0/gnupg/d.hkt3xk3ifea4rs471snc7nsb/S.gpg-agent: no such file or directory

Attach a definition file (did not check whether this one is bootable) showing the principal problem.
####################################################################################
# Definition file for apptainer to build a bootable rocky container
# the container can be imported as image for warewulf boots of nodes
####################################################################################
BootStrap: docker
From: rockylinux:8.8
%files
/etc/yum.repos.d/Intel-One-API.repo
/etc/yum.repos.d/vscode.repo

####################################################################
# basic installation process for a bootable image
####################################################################
%post
dnf update -y ;
dnf install -y --allowerasing coreutils
cpio
dbus
dhclient
e2fsprogs
ethtool
findutils
initscripts
ipmitool
iproute
kernel-core
libbpf
net-tools
network-scripts
nfs-utils
openssh-clients
openssh-server
pciutils
psmisc
rsyslog
which

##########################################################################
# final setup
##########################################################################
dnf clean all

sed -i -e '/^account.pam_unix.so\s$/s/\s*$/\ broken_shadow/' /etc/pam.d/system-auth
sed -i -e '/^account.pam_unix.so\s$/s/\s*$/\ broken_shadow/' /etc/pam.d/password-auth

rm -f /etc/sysconfig/network-scripts/ifcfg-e*

systemctl unmask console-getty.service dev-hugepages.mount getty.target sys-fs-fuse-connections.mount systemd-logind.service systemd-remount-fs.service
systemctl enable network

touch /etc/sysconfig/disable-deprecation-warnings

mkdir -p /etc/warewulf
touch /etc/warewulf/excludes
touch /etc/warewulf/container_exit.sh
chmod +x /etc/warewulf/container_exit.sh

echo "#!/bin/sh" > /etc/warewulf/container_exit.sh
echo "set -x" >> /etc/warewulf/container_exit.sh
echo "LANG=C" >> /etc/warewulf/container_exit.sh
echo "LC_CTYPE=C" >> /etc/warewulf/container_exit.sh
echo "export LANG LC_CTYPE" >> /etc/warewulf/container_exit.sh
echo "dnf clean all" >> /etc/warewulf/container_exit.sh
echo "/boot/" > /etc/warewulf/excludes
echo "/usr/share/GeoIP" >> /etc/warewulf/excludes

%labels
Author Bernhard
Version 0.1.01
Description Rocky 8 Warewulf Container definition for HPC Cluster

@anderbubble
Copy link
Collaborator

Thanks for all the new info, @BeHom! I'll try to replicate.

@anderbubble
Copy link
Collaborator

@BeHom there's a few different things happening here at the same time.

  • Apptainer, when building into a sandbox, is allowing sockets in the sandbox to persist.
  • Warewulf, when encountering a socket in the sandbox, is failing, because the socket doesn't exist at the target, so it can't copy permissions from the source socket to the dest.

Ultimately, I think this is a bug in github.com/containers/storage/drivers/copy.DirCopy, which is what we use to copy the container directory.

I tried updating to the latest version of github.com/containers/storage, but the behavior persists; so to really resolve this, we'd either need to move to a different library or submit a fix upstream.

For now, I suggest the following workaround to remove sockets from a sandbox before import:

$ find image.sandbox -xdev -type s -exec rm {} +

@anderbubble anderbubble changed the title Container image re-import failed Container import from directory with sockets fails Mar 17, 2024
@anderbubble anderbubble modified the milestones: v4.5.1, __future__ Mar 17, 2024
@anderbubble anderbubble added discuss A topic for discussion in a community meeting and removed needs confirmation labels Mar 17, 2024
@anderbubble anderbubble modified the milestones: __future__, v4.5.6 Jul 15, 2024
@anderbubble anderbubble removed the discuss A topic for discussion in a community meeting label Jul 15, 2024
@anderbubble
Copy link
Collaborator

Also more recently reported during wwctl container copy, when a previous wwctl container shell has left sockets in the chroot:

Copying sources...
ERROR  : could not duplicate image: lchown /var/lib/warewulf/chroots/gpuslurm/rootfs/run/user/0/gnupg/d.kg8ijih5tq41ixoeag4p1qup/S.gpg-agent: no such file or directory

@anderbubble anderbubble modified the milestones: v4.5.6, v4.5.7 Aug 3, 2024
@anderbubble anderbubble modified the milestones: v4.5.7, v4.5.x Sep 4, 2024
@anderbubble anderbubble modified the milestones: v4.5.x, v4.5.8 Sep 11, 2024
anderbubble added a commit to anderbubble/warewulf that referenced this issue Sep 27, 2024
anderbubble added a commit to anderbubble/warewulf that referenced this issue Sep 27, 2024
@anderbubble
Copy link
Collaborator

I've submitted #1430 to at least document this issue and its workaround. I also submitted containers/storage#2113 to see if upstream might be willing to resolve this on their end.

@anderbubble anderbubble modified the milestones: v4.5.8, __future__ Sep 27, 2024
anderbubble added a commit to anderbubble/warewulf that referenced this issue Sep 27, 2024
@anderbubble
Copy link
Collaborator

This is now fixed upstream at containers/storage#2117. I'm asking there about getting the fix backported to 1.55 so that we can use it here in Warewulf.

@anderbubble
Copy link
Collaborator

anderbubble commented Nov 4, 2024

Backport PR at containers/storage#2159.

@anderbubble
Copy link
Collaborator

Backport merged upstream. Now we just wait for release. Should be in v1.55.2.

containers/storage#2159 (comment)

@anderbubble
Copy link
Collaborator

Thanks for the merge!

@anderbubble anderbubble removed the bug label Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants