Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky scm06 test? (Error (criu/sk-unix.c:1651): unix: Can't bind id 0x9 ino 433398 addr : Address already in use") #2537

Closed
carnil opened this issue Dec 6, 2024 · 4 comments · Fixed by #2546

Comments

@carnil
Copy link
Contributor

carnil commented Dec 6, 2024

Hi

In meanwhile we run almost all tests for criu in Debian, only excluding apparmor_stacking and fd01 tests.

What we see is that occassionally the scm06 test fails:
https://ci.debian.net/data/autopkgtest/testing/amd64/c/criu/55087421/log.gz

560s ========================= Run zdtm/static/scm06 in uns =========================
560s Start test
560s Test is SUID
560s ./scm06 --pidfile=scm06.pid --outfile=scm06.out
560s Run criu dump
560s Run criu restore
560s =[log]=> dump/zdtm/static/scm06/63/1/restore.log
560s ------------------------ grep Error ------------------------
560s b'(00.021682)      1: No ipcns-sem-11.img image'
560s b'(00.023694)      1: net: Try to restore a link 10:1:lo'
560s b'(00.023700)      1: net: Restoring link lo type 1'
560s b'(00.024679)      1: net: \tRunning ip addr restore'
560s b'Error: ipv4: Address already assigned.'
560s b'Error: ipv6: address already assigned.'
560s ------------------------ ERROR OVER ------------------------
560s Send the 15 signal to  95
560s Wait for zdtm/static/scm06(95) to die for 0.100000
560s Removing dump/zdtm/static/scm06/63
560s ========================= Test zdtm/static/scm06 PASS ==========================
560s ========================== Run zdtm/static/scm06 in h ==========================
560s Start test
560s Test is SUID
560s ./scm06 --pidfile=scm06.pid --outfile=scm06.out
560s Run criu dump
560s Run criu restore
560s =[log]=> dump/zdtm/static/scm06/174/1/restore.log
560s ------------------------ grep Error ------------------------
560s b'(00.002721)    174: unix: Opening slave (stage 0 id 0x8 ino 433397 peer 433398)'
560s b'(00.002724)    174: unix: Opening master (stage 0 id 0x9 ino 433398 peer 433397)'
560s b'(00.002737)    174: \t\tCreate fd for 4'
560s b'(00.002739)    174: unix: bind id 0x9 ino 433398 addr'
560s b"(00.002750)    174: Error (criu/sk-unix.c:1651): unix: Can't bind id 0x9 ino 433398 addr : Address already in use"
560s b'(00.002753)    174: Error (criu/files.c:1213): Unable to open fd=5 id=0x9'
560s b'(00.003606) Error (criu/cr-restore.c:1256): 174 exited, status=1'
560s b'(00.003616) Error (criu/cr-restore.c:2313): Restoring FAILED.'
560s ------------------------ ERROR OVER ------------------------
560s ################# Test zdtm/static/scm06 FAIL at CRIU restore ##################
560s Test output: ================================
560s 
560s  <<< ================================
560s seccomp_filters is supported

Is this by chance known to be flaky and should I better disable the test or is there indication of a real problem we need to address? I guess this indicates a race condition as the address is already in use in this above case.

@adrianreber
Copy link
Member

@carnil For Fedora and RHEL we are excluding a couple of tests, but not this one:

https://gitlab.com/redhat/centos-stream/rpms/criu/-/blob/c10s/tests/run-zdtm.sh?ref_type=heads

That doesn't answer your question, but just as an additional data point.

@carnil
Copy link
Contributor Author

carnil commented Dec 7, 2024

@adrianreber thanks, while yes indeed is not the answer of the question it helps as an idea to how to improve the tests runs on our end (i.e. try again if a test run fails to circument and see if it was just flaky).

Still I wonder if the failing smc06 test shows a real problem or is really known to be flaky.

@adrianreber thanks a lot!

@avagin
Copy link
Member

avagin commented Dec 13, 2024

@carnil I think it is a side effect of unix_gc in the linux kernel. All dumped processes has been destroyed, but some sockets are destroyed asynchronously.

avagin added a commit to avagin/criu that referenced this issue Dec 13, 2024
The kernel releases a test socket asynchronously, so the restore can
fail if it is executed before the kernel actually destroys the socket.

Fixes checkpoint-restore#2537

Signed-off-by: Andrei Vagin <[email protected]>
avagin added a commit to avagin/criu that referenced this issue Dec 13, 2024
The kernel releases a test socket asynchronously, so the restore can
fail if it is executed before the kernel actually destroys the socket.

Fixes checkpoint-restore#2537

Signed-off-by: Andrei Vagin <[email protected]>
avagin added a commit that referenced this issue Dec 15, 2024
The kernel releases a test socket asynchronously, so the restore can
fail if it is executed before the kernel actually destroys the socket.

Fixes #2537

Signed-off-by: Andrei Vagin <[email protected]>
@carnil
Copy link
Contributor Author

carnil commented Dec 15, 2024

Thank you @avagin (and @adrianreber)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants