Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libc/int: skip TestUpdateDevices* on i386 #4595

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kolyshkin
Copy link
Contributor

Fixes: #4594

@kolyshkin
Copy link
Contributor Author

Works as intended:

=== RUN   TestUpdateDevices
    update_test.go:20: flaky on i386, see https://github.com/opencontainers/runc/issues/4594
--- SKIP: TestUpdateDevices (0.00s)
=== RUN   TestUpdateDevicesSystemd
    update_test.go:20: flaky on i386, see https://github.com/opencontainers/runc/issues/4594
--- SKIP: TestUpdateDevicesSystemd (0.00s)
ok  	github.com/opencontainers/runc/libcontainer/integration	18.859s

@kolyshkin kolyshkin marked this pull request as ready for review January 16, 2025 21:10
@kolyshkin
Copy link
Contributor Author

Generally speaking, maybe we should just skip the entirety of libcontainer/integration as it's not really supposed to work, but since it works for now let's keep it as is.

@rata
Copy link
Member

rata commented Jan 21, 2025

If this is making CI fail, I think it is ok to merge it. But I don't understand:

  • Why is libcontainer/integration supposed to not work? There might be some context I'm missing. If it's simple to explain it to me, it would be great
  • The error in the linked issue seems like some corruption. Are we sure it is safe to skip the test?

@rata
Copy link
Member

rata commented Jan 24, 2025

I've been trying to run the "broken" tests locally and they seem to pass just fine. I'll do more stress tests next week, but this smells to me as something that only fails on CI?

@kolyshkin
Copy link
Contributor Author

If this is making CI fail, I think it is ok to merge it. But I don't understand:

  • Why is libcontainer/integration supposed to not work? There might be some context I'm missing. If it's simple to explain it to me, it would be great

Simply speaking, we don't support i386; it just happens to work. We added this job merely to ensure our code is 32bit-clean; see #2768 and this comment:

# We need to continue support for 32-bit ARM.
# However, we do not have 32-bit ARM CI, so we use i386 for testing 32bit stuff.
# We are not interested in providing official support for i386.
cross-i386:

AFAIK there's still no 32bit ARM support in GHA, so we are stuck with i386.

  • The error in the linked issue seems like some corruption. Are we sure it is safe to skip the test?

This started happening about two months ago, and is not correlated with any changes we made at that time. So I suspect there's something wrong with Azure Linux/i386 support, or some particular hardware, or perhaps some i386 libraries, or maybe Go i386 port.

Ideally, I'd love to get to the bottom of it, and there is a chance to find something interesting and useful. Practically, though, I don't have time for this, given how hard it is to reproduce. It does not look like anyone else has.

So I think the best decision is to skip this, allowing for better use of developers' time.

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolyshkin LGTM. thanks for the link, I didn't know we added this as a proxy. Makes sense to skip it.

Below are some details about my findings, but I'm giving up too. It seems this is a regression outside of runc, and while I'd love to report it, it's so hard to repro that I'm giving up.

I could repro twice in a local ubuntu 22.04, that makes Azure quite unlikely the cause. I guess the most likely scenario is some package update in ubuntu that causes this sporadic failure, given that this started happening at a "random" point in time. But it could be a go issue (locally I'm trying with 1.23.5, CI is using 1.23.4).

The error seems to be when decoding a json, and we are not wrapping it with more text, so it is unclear where it comes from (here logs from when I hit it):

time="2025-01-27T16:14:02+01:00" level=info msg="removing old filter 1 from cgroup" id=7085 name= run_count=0 runtime=0s tag=fb6cb1c301453333 type=CGroupDevice
    update_test.go:67: unexpected error: unable to start container process: invalid character 'ÿ' looking for beginning of object key string
--- FAIL: TestUpdateDevices (0.68s)
=== RUN   TestUpdateDevicesSystemd
    update_test.go:41: unexpected error: unable to start container process: error during container init: invalid character 'ÿ' looking for beginning of object key string
    update_test.go:38: exec: Wait was already called
--- FAIL: TestUpdateDevicesSystemd (0.16s)

I've tried to run it 1400 times again with more info and I can't repro :(.

@kolyshkin
Copy link
Contributor Author

@rata thanks, this is quite interesting that you were able to repro this (I never managed to do that). Did you ran an isolated test (like go test -run TestUpdateDevices) or did you ran all the tests?

I think that increasing the number of iterations inside the test (currently 300, see libcontainer/integration/update_devices.go) might improve the chances of fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestUpdateDevices fails on i386
2 participants