-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Cleanup exported repos during sync failures #3282
Conversation
Thanks, this seems like a nice fix! I am reviewing all of the open PRs to see which ones can be included in a next Flux v1 release. If you can rebase and also amend the commit with a --signoff for the DCO, to satisfy DCObot, then I can try to incorporate your change. This seems like a good candidate to include as a bugfix. As Flux v2 approaches feature parity, we hope your needs are able to be met with the new version, but if there are outstanding issues that can still be solved easily in Flux v1, my aim is to help with that. |
1c55caf
to
61c2944
Compare
@nairb774 Can you check out the failing test on #3421? It looks like it is yours. I ran the CI multiple times before now and unfortunately I'm not quite adept enough at Go just yet to see for myself why this test sometimes fails. Is there some nondeterministic behavior that can be made safe by blocking somewhere? (If you happen to know why it fails, can you say if the failure is in the test itself, or is it a failure in a shipping part of the daemon?) |
The specific failure I'm asking for you to look at is here: https://github.com/fluxcd/flux/pull/3421/checks?check_run_id=1874454916 It only surfaces on about 30% of runs. Must be some kind of race condition, like (guess) a non-blocking filesystem operation?
(There is another failure in the link checker but it's not related, looks to be a remote service that is down.) |
This still seems like a valuable change because it prevents Flux from eating up ephemeral storage and potentially being targeted for termination in cases where the cluster node hosting Flux is experiencing storage pressure. But with the flaky test failure as it is, I am sure I won't be able to merge it. Hopefully you have been able to move on to Flux v2 and are no longer experiencing this issue. If you'd like to revisit this, we can reopen it. Thanks for your contribution. (Closing for now) |
Hi, |
I think we're having this same issue with |
Several functions which generate a clone of the repo can result in the repository being left behind when an error is triggered. This shores up some of those failure paths to prevent the storage leaks. Signed-off-by: Brian Atkinson <[email protected]>
eb25639
to
2cb0fd2
Compare
Linking back to #2713 where discussion about this PR is ongoing now |
I haven't been able to reproduce the "flaky test" that I think I was describing from the previous round of reviews on this PR. Will try a few more times, as it will be really inconvenient to merge a flaky test. I also haven't read this change from end to end. |
There it is |
We cannot merge a flaky test, as it will complicate future releases. Commenter in #2713 indicated that this issue is still a problem for them. @nairb774 I will see if I can find the issue in the test that causes it to sometimes fail, and fix it up so this can be merged for release. Do you have any idea what might be the issue? |
This project is in Migration and security support only, so unfortunately this PR won't be merged. We recommend users to migrate to Flux 2 at their earliest convenience. More information about the Flux 2 transition timetable can be found at: https://fluxcd.io/docs/migration/timetable/. |
Several functions which generate a clone of the repo can result in the repository being left behind when an error is triggered. This shores up some of those failure paths to prevent the storage leaks.
One of our deployments of FluxCD was found to be eating up a bunch of space in the
/tmp
dir. There were many folders of like/tmp/flux-workingXXXXXXX
that seemed to be hanging out. This seemed to be the result of Flux being able to pull the repository, but the path it was configured to read from had been removed. Each failure caused a new copy of the repository to be generated on disk. This shores up some of the exit paths to do a better job attempting to clean the data up.