Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage gets corrupted after podman pull is killed #14003

Closed
luluz66 opened this issue Apr 25, 2022 · 12 comments
Closed

Storage gets corrupted after podman pull is killed #14003

luluz66 opened this issue Apr 25, 2022 · 12 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@luluz66
Copy link

luluz66 commented Apr 25, 2022

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

Podman storage gets corrupted if Podman is killed when a layer is incomplete.

Steps to reproduce the issue:

  1. podman pull gcr.io/tensorflow-testing/nosla-cuda11.2-cudnn8.1-ubuntu18.04-manylinux2010-multipython@sha256:5102e2651975df6c131c4f0cb22454b81d509a7be2a3d98351a876d3f85ef2b8

  2. kill the pull process when one of the layer is incomplete by monitoring /var/lib/containers/storage/overlay-layers/layers.json

watch 'cat var/lib/containers/storage/overlay-layers/layers.json | grep incomplete'
  1. Run the following command in three different terminals.
term1$ podman pull gcr.io/tensorflow-testing/nosla-cuda11.2-cudnn8.1-ubuntu18.04-manylinux2010-multipython@sha256:5102e2651975df6c131c4f0cb22454b81d509a7be2a3d98351a876d3f85ef2b8 
term2$ podman pull gcr.io/tensorflow-testing/nosla-cuda11.2-cudnn8.1-ubuntu18.04-manylinux2010-multipython@sha256:5102e2651975df6c131c4f0cb22454b81d509a7be2a3d98351a876d3f85ef2b8 
term3$ podman pull gcr.io/tensorflow-testing/nosla-cuda11.2-cudnn8.1-ubuntu18.04-manylinux2010-multipython@sha256:5102e2651975df6c131c4f0cb22454b81d509a7be2a3d98351a876d3f85ef2b8 

Describe the results you received:
All three instances of podman pull returned the following error:

WARN[0136] Can't read link "/var/lib/containers/storage/overlay/l/V2OP2CCVMKSOHK2XICC546DUCG" because it does not exist. A storage corruption might have occurred, attempting to recreate the missing symlinks. It might be best wipe the storage to avoid further errors due to storage corruption.
WARN[0136] Can't stat lower layer "/var/lib/containers/storage/overlay/l/V2OP2CCVMKSOHK2XICC546DUCG" because it does not exist. Going through storage to recreate the missing symlinks.
ERRO[0136] Unmounting /var/lib/containers/storage/overlay/a1b212349f0f80e2f88cbb35fe0b22792ee40bb7d9662c56b5bcaf3dc0941708/merged: invalid argument
Error: writing blob: adding layer with blob "sha256:eb957d5dbd82b0e05da5c583db9216df577fc3729e2353e9e04cab2963a71942": creating overlay mount to /var/lib/containers/storage/overlay/a1b212349f0f80e2f88cbb35fe0b22792ee40bb7d9662c56b5bcaf3dc0941708/merged, mount_data=",lowerdir=/var/lib/containers/storage/overlay/l/QXP5KZSENBI5UM6D4W26CEZZSC:/var/lib/containers/storage/overlay/l/XIRSG2LVRQH54TRV25LYRLNQSN:/var/lib/containers/storage/overlay/l/YTYLOHM45UDDY4ZEQAEXZMN22J:/var/lib/containers/storage/overlay/l/QB6N3I6PANPH4US42R5OLFPI7L:/var/lib/containers/storage/overlay/l/BMJKOW5HBXNZAWGHJXE2NUAXSN:/var/lib/containers/storage/overlay/l/SF5CO7EYA6M5UZOKUSTTXDNVYF:/var/lib/containers/storage/overlay/l/ZQS6IZ5T2XF2KB7JBDCQEL7EM3:/var/lib/containers/storage/overlay/l/LYOFE2V2SA4CN4NBEIHIIKTVXN:/var/lib/containers/storage/overlay/l/NR6LZN4ZIUGKVPOTOR72KBH54X:/var/lib/containers/storage/overlay/l/V2OP2CCVMKSOHK2XICC546DUCG:/var/lib/containers/storage/overlay/l/7A6QMKITLXCCE5OCRMU6USROT5:/var/lib/containers/storage/overlay/l/W2KWKL5KN57MNKLUZGULX7WJA7:/var/lib/containers/storage/overlay/l/XIUUN62UCNZ5LSOT4BXPVCTSVS:/var/lib/containers/storage/overlay/l/REG7KO3LB5KRFKDO4V7W53PU5S:/var/lib/containers/storage/overlay/l/RLQS2PCJ6IDAK7FUPOMFN7M5DH,upperdir=/var/lib/containers/storage/overlay/a1b212349f0f80e2f88cbb35fe0b22792ee40bb7d9662c56b5bcaf3dc0941708/diff,workdir=/var/lib/containers/storage/overlay/a1b212349f0f80e2f88cbb35fe0b22792ee40bb7d9662c56b5bcaf3dc0941708/work": no such file or directory

Note: since at this point the image is not downloaded yet, running podman rmi will not recover. Only podman system reset helps.

In addition, if after step 2 (killing podman pull when a layer is incomplete), I only run one podman pull. This podman pull will succeed. However podman image inspect will fail for this image with the error Error: layer not known. podman run will also fail with readlink error as described in containers/storage#1136

Describe the results you expected:
I expected the following podman pulls succeed.

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Client:       Podman Engine
Version:      4.0.3
API Version:  4.0.3
Go Version:   go1.18
Built:        Thu Jan  1 00:00:00 1970
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.24.3
  cgroupControllers:
  - cpuset
  - cpu
  - cpuacct
  - blkio
  - memory
  - devices
  - freezer
  - net_cls
  - perf_event
  - net_prio
  - hugetlb
  - pids
  - rdma
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: 'conmon: /usr/bin/conmon'
    path: /usr/bin/conmon
    version: 'conmon version 2.1.0, commit: bdb4f6e56cd193d40b75ffc9725d4b74a18cb33c'
  cpus: 8
  distribution:
    codename: bullseye
    distribution: debian
    version: "11"
  eventLogger: file
  hostname: executor-lulu-test-c56675bf5-5fckt
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.4.0-1061-gke
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 12879020032
  memTotal: 33673482240
  networkBackend: cni
  ociRuntime:
    name: runc
    package: 'containerd.io: /usr/bin/runc'
    path: /usr/bin/runc
    version: |-
      runc version 1.0.3
      commit: v1.0.3-0-gf46b6ba
      spec: 1.0.2-dev
      go: go1.17.8
      libseccomp: 2.5.1
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_AUDIT_WRITE,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_MKNOD,CAP_NET_BIND_SERVICE,CAP_NET_RAW,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.1.12
      commit: unknown
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.1
  swapFree: 0
  swapTotal: 0
  uptime: 18h 6m 38.07s (Approximately 0.75 days)
plugins:
  log:
  - k8s-file
  - none
  - passthrough
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  docker.io:
    Blocked: false
    Insecure: false
    Location: docker.io
    MirrorByDigestOnly: false
    Mirrors:
    - Insecure: false
      Location: mirror.gcr.io
    Prefix: docker.io
  docker.io/library:
    Blocked: false
    Insecure: false
    Location: quay.io/libpod
    MirrorByDigestOnly: false
    Mirrors: null
    Prefix: docker.io/library
  search:
  - docker.io
  - quay.io
  - registry.fedoraproject.org
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 1
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.0.3
  Built: 0
  BuiltTime: Thu Jan  1 00:00:00 1970
  GitCommit: ""
  GoVersion: go1.18
  OsArch: linux/amd64
  Version: 4.0.3

Package info (e.g. output of rpm -q podman or apt list podman):

podman/unknown,now 100:4.0.3-1 amd64 [installed

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 25, 2022
@luluz66 luluz66 changed the title Storage is corrupted after podman pull is killed Storage gets corrupted after podman pull is killed Apr 25, 2022
@giuseppe
Copy link
Member

@mtrmac would your fixes in c/storage also address this issue?

@mtrmac
Copy link
Collaborator

mtrmac commented Apr 28, 2022

The issue description links to containers/storage#1136 , and seems consistent with that at a short glance (I didn’t try to reproduce). The two PRs that are waiting for review target inconsistent overlay driver state, and don’t fix this locking issue.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented May 31, 2022

Since those two PRs were merged and podman has revendored storage, I am assuming this is fixed. Reopen if I am mistaken.

@rhatdan rhatdan closed this as completed May 31, 2022
@mtrmac
Copy link
Collaborator

mtrmac commented May 31, 2022

The two PRs that are waiting for review target inconsistent overlay driver state, and don’t fix this locking issue.

@mtrmac mtrmac reopened this May 31, 2022
@banool
Copy link

banool commented Jun 29, 2022

I'm hitting this also, rebooting did not fix the issue.

Update: podman system reset fixed it.

@elasticdotventures
Copy link

podman system reset

⚠️does not warn, WILL DELETE EVERY CONTAINER ON YOUR SYSTEM. use with caution!

@github-actions
Copy link

github-actions bot commented Aug 6, 2022

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Aug 6, 2022

@mtrmac Should this still be opened?

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 9, 2022

AFAIK, containers/storage#1136 is still outstanding (and I’m not working on it).

As for whether Podman needs to track this separately from the c/storage issue, I don’t have a strong opinion.

@rhatdan
Copy link
Member

rhatdan commented Aug 9, 2022

Ok I am going to close this issue, and we can follow it in containers/storage.

@rhatdan rhatdan closed this as completed Aug 9, 2022
mandre added a commit to shiftstack/machine-config-operator that referenced this issue Aug 31, 2022
In order to avoid a podman issue [1] causing a layer corruption when an
image pull is killed midway, let's move the image pull outside of the
timeout command.

The timeout was recently reduced to 20 seconds with [2] making the issue
more likely to happen.

[1] containers/podman#14003
[2] openshift#3271
mandre added a commit to shiftstack/machine-config-operator that referenced this issue Aug 31, 2022
In order to avoid a podman issue [1] causing a layer corruption when an
image pull is killed midway, let's move the image pull outside of the
timeout command.

The timeout was recently reduced to 20 seconds with [2] making the issue
more likely to happen.

[1] containers/podman#14003
[2] openshift#3271
@Ramblurr
Copy link

Ramblurr commented Oct 6, 2022

I am experiencing this bug in centos stream 9. Is there a way to fix my podman host without wiping everything?

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 13, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

7 participants