Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: parallize unit tests #5954

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

flouthoc
Copy link
Collaborator

What type of PR is this?

/kind api-change
/kind bug
/kind cleanup
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake
/kind other

What this PR does / why we need it:

How to verify it

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

None

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are reverting the integration test changes.

tests/source.bats Outdated Show resolved Hide resolved
tests/test_runner.sh Outdated Show resolved Hide resolved
tests/bud.bats Outdated Show resolved Hide resolved
tests/blobcache.bats Outdated Show resolved Hide resolved
@flouthoc
Copy link
Collaborator Author

Looks like you are reverting the integration test changes.

Yes did it by mistake fixed it.

@flouthoc flouthoc force-pushed the parallel-unit-test branch 2 times, most recently from 5693588 to 8d1e570 Compare January 29, 2025 18:06
@flouthoc
Copy link
Collaborator Author

flouthoc commented Jan 29, 2025

Following tests are not parallel and are racing when using t.Parallel , also they are high contributors to the time of unit test, looks like I will have to do changes in tests so we can make them parallel

TestPutNoChroot: 202.85 seconds
TestStatNoChroot: 166.13 seconds
TestCWConvertImage: 51.84 seconds
TestPutNoChroot/archive=regular,topdir=: 34.64 seconds
TestPutNoChroot/archive=regular,topdir=.: 29.35 seconds
TestPutNoChroot/archive=regular,rename=no-match-dir: 27.94 seconds
TestPutNoChroot/archive=regular,topdir=top: 27.89 seconds
TestPutNoChroot/archive=regular,rename=no-match-file: 27.77 seconds

@Luap99
Copy link
Member

Luap99 commented Jan 30, 2025

image

They do not seem to benefit much from more cores/memory so you likely need to look at other options or get them to run parallel much better. I guess the early spikes are when we we compile the go code in the ci setup. Maybe also worth looking into. It is not clear to me if we would actually need to compile anything else for the unit test.

The other thing could be is to split the unit-test target in three separate ones and then run them in parallel.
Another thing I see the tests use -cover and -race, I am not sure how value these add for our testing.
race is known to be slow, https://go.dev/doc/articles/race_detector#Runtime_Overheads

The cost of race detection varies by program, but for a typical program, memory usage may increase by 5-10x and execution time by 2-20x.
So maybe drop these flags and see how much faster they run.

@flouthoc flouthoc force-pushed the parallel-unit-test branch 2 times, most recently from 4d4cb56 to 845f109 Compare January 30, 2025 16:12
@flouthoc
Copy link
Collaborator Author

image

They do not seem to benefit much from more cores/memory so you likely need to look at other options or get them to run parallel much better. I guess the early spikes are when we we compile the go code in the ci setup. Maybe also worth looking into. It is not clear to me if we would actually need to compile anything else for the unit test.

The other thing could be is to split the unit-test target in three separate ones and then run them in parallel. Another thing I see the tests use -cover and -race, I am not sure how value these add for our testing. race is known to be slow, https://go.dev/doc/articles/race_detector#Runtime_Overheads

The cost of race detection varies by program, but for a typical program, memory usage may increase by 5-10x and execution time by 2-20x.
So maybe drop these flags and see how much faster they run.

Wow @Luap99 !!! unit test without -cover and -race completed in 8 minutes, let me try without -race only and including -cover.

@flouthoc
Copy link
Collaborator Author

flouthoc commented Jan 30, 2025

Looks like removing -race does the trick, brings down the all the unit tests from 48m something to 8m

buildah_test.go Outdated Show resolved Hide resolved
@Luap99
Copy link
Member

Luap99 commented Jan 30, 2025

@nalind Any objections to removing -race from the unit tests?

@nalind
Copy link
Member

nalind commented Jan 30, 2025

@nalind Any objections to removing -race from the unit tests?

Other than that I'll probably forget to run them that way from time to time if it isn't automated, which isn't that important, no.

@flouthoc
Copy link
Collaborator Author

flouthoc commented Jan 30, 2025

@nalind @Luap99 @containers/buildah-maintainers PTAL, ready for a review.

@flouthoc
Copy link
Collaborator Author

Also it looks like osh-diff-scan:fedora-rawhide-x86_64:buildah-fedora is the bottleneck now, it is the longest running job.

@flouthoc flouthoc requested a review from Luap99 January 30, 2025 21:30
@nalind
Copy link
Member

nalind commented Jan 30, 2025

@nalind Any objections to removing -race from the unit tests?

Other than that I'll probably forget to run them that way from time to time if it isn't automated, which isn't that important, no.

Come to think of it, can we conditionally keep that for the scheduled "cron" builds? I don't think anyone actually waits for them to finish.

@flouthoc
Copy link
Collaborator Author

flouthoc commented Jan 30, 2025

@nalind Any objections to removing -race from the unit tests?

Other than that I'll probably forget to run them that way from time to time if it isn't automated, which isn't that important, no.

Come to think of it, can we conditionally keep that for the scheduled "cron" builds? I don't think anyone actually waits for them to finish.

Sounds good to me, is there any way to test that when I open a PR or it will only run on merge ? I will open a separate PR to add env variable for -race in .github workflow.

@Luap99
Copy link
Member

Luap99 commented Jan 31, 2025

Also it looks like osh-diff-scan:fedora-rawhide-x86_64:buildah-fedora is the bottleneck now, it is the longest running job.

None of the packit tests enforce anything, just ignore them for now.

Come to think of it, can we conditionally keep that for the scheduled "cron" builds? I don't think anyone actually waits for them to finish.

That is easy to do sure except nobody looks at these results so that is pretty pointless.

How imporant is -race really, did you ever catch actual issues with it?

Sounds good to me, is there any way to test that when I open a PR or it will only run on merge ? I will open a separate PR to add env variable for -race in .github workflow.

Well anything post merge cannot be tested on a PR and this is not a github workflow. This must be done in cirrus.yml and would be easy, just check if the CIRRUS_PR env is empty then we run on a branch.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Luap99
Copy link
Member

Luap99 commented Jan 31, 2025

Total CI time reported by cirrus 38:01

type user fs d13 f40 f41
Unit root vfs 08:17
Unit root overlay 08:02
Conformance root vfs 23:49
Conformance root overlay 14:33
Integration root vfs 23:53 29:30 30:12
Integration root overlay 19:50 23:37
Integration rootless overlay 26:06 26:34

That seems quite good, vfs is noticeably slower though which I guess is expected to the higher amount of io required for it.

@flouthoc
Copy link
Collaborator Author

@nalind @TomSweeneyRedHat @rhatdan PTAL, if consensus is made for cron then I'll open a new PR for that.

@flouthoc flouthoc mentioned this pull request Feb 3, 2025
@flouthoc flouthoc requested a review from nalind February 3, 2025 16:27
// which eventually and indirectly accesses a global variable
// defined in `go-selinux`, this must be fixed at `go-selinux`
// or builder must enable sometime of locking mechanism i.e if
// routine is creating Builder other's must wait for it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we know when it's safe? Is there an open issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just created an issue #5967 and added in comment and commit message.

@flouthoc flouthoc requested a review from nalind February 3, 2025 20:56
flouthoc added a commit to flouthoc/buildah that referenced this pull request Feb 3, 2025
@nalind
Copy link
Member

nalind commented Feb 3, 2025

@nalind @TomSweeneyRedHat @rhatdan PTAL, if consensus is made for cron then I'll open a new PR for that.

Why wait?

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think there is a point in having race enabled post merge.
AFAICT nobody ever looks at any of these failures anyway. ci-monitor emails for the cron jobs are ignored and for post merge tests on nobody ever looked at them.

But maybe I am just wrong here? Who is looking at that?

Makefile Outdated Show resolved Hide resolved
contrib/cirrus/test.sh Outdated Show resolved Hide resolved
@nalind
Copy link
Member

nalind commented Feb 4, 2025

But maybe I am just wrong here? Who is looking at that?

Anecdata: I do read the ones for buildah.

@Luap99
Copy link
Member

Luap99 commented Feb 4, 2025

But maybe I am just wrong here? Who is looking at that?

Anecdata: I do read the ones for buildah.

Ok if you actual can make use of the data then sure let's do that. I just have never seen anyone respond to these emails or post PRs to fix the issues from there. Maybe that just wasn't necessary on buildah so far and/or I don't look at buildah PRs often enough.

@flouthoc flouthoc force-pushed the parallel-unit-test branch 4 times, most recently from 6000bd1 to ae100a5 Compare February 4, 2025 15:42
@flouthoc
Copy link
Collaborator Author

flouthoc commented Feb 4, 2025

@nalind @Luap99 PTAL again.

@flouthoc flouthoc requested a review from Luap99 February 4, 2025 15:59
@@ -25,8 +25,7 @@ GO_GCFLAGS := $(shell if $(GO) version|grep -q gccgo; then echo "-gccgoflags"; e
NPROCS := $(shell nproc)
export GO_BUILD=$(GO) build
export GO_TEST=$(GO) test -parallel=$(NPROCS)
#RACEFLAGS := $(shell $(GO_TEST) -race ./pkg/dummy > /dev/null 2>&1 && echo -race)
RACEFLAGS :=
RACEFLAGS ?= $(shell $(GO_TEST) -race ./pkg/dummy > /dev/null 2>&1 && echo -race)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge this one with previous

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@rhatdan
Copy link
Member

rhatdan commented Feb 4, 2025

LGTM

Add `t.Parallel()` to unit tests whereever its possible without race.

Signed-off-by: flouthoc <[email protected]>
@flouthoc flouthoc requested a review from rhatdan February 4, 2025 19:12
@rhatdan
Copy link
Member

rhatdan commented Feb 4, 2025

@nalind @Luap99 PTAL

@@ -67,7 +67,13 @@ else
showrun make validate
;;
unit)
showrun make test-unit
race=
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, line indentation still seems wrong, looks like you use one tab and spaces on that line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM

Copy link
Contributor

openshift-ci bot commented Feb 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: flouthoc, Luap99
Once this PR has been reviewed and has the lgtm label, please assign rhatdan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants