-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arm64 builds hangs on install-info a lot #62
Comments
I believe this is a bug in cygwin/msys2-runtime, because it also happened with pacman frequently when it would verify sync db signatures. The workaround I had for that was to append |
Add attempted workaround for install-info.exe hanging, by disabling the pacman hooks that call it. msys2/msys2-autobuild#62
I wonder whether this is still the case for v3.4.*... |
This hang seems to be happening much more often with the new 2023 dev kit machine compared to the qc710. It is now even happening when validating signatures on packages, where it would usually only happen validating database signatures before. |
Me and the GIMP project have also seen this. There's a comment lost in a Merge Request (Gitlab) somewhere about it. They decided to stop pacman update as part of the builds and the runners are now doing this on a daily scheduled task with timeout/retry overnight :( |
I gathered some information that may be helpful for analyzing this issue, and wrote it down here. |
I wonder if it might possibly be https://cygwin.com/pipermail/cygwin/2024-February/255431.html. Maybe we can try backporting that patch (msys2/msys2-runtime@4e77fa9b8bf4) and see if the issues go away? |
If anyone else wants to try, I built msys2-runtime and msys2-runtime-3.3 with that patch applied in https://github.com/jeremyd2019/MSYS2-packages/actions/runs/7921543265. I am planning to try some things with it and see what happens. UPDATE: that seems pretty broken. I'm guessing I didn't backport the fix correctly. |
https://github.com/jeremyd2019/MSYS2-packages/actions/runs/7924206550 is at least not as immediately broken 😉. Will test that |
I built both 3.4 and 3.3, and 3.3 for 32-bit (which took some doing because any binutils later than 2.40 resulted in a broken msys-2.0.dll). I then set both a Windows 11 VM on the Dev Kit and a Windows 10 install on a Raspberry Pi 4 in a loop running pacman (without disabling db signature checking). The raspberry pi did hang up, but the debugger looks different than I remember. The dev kit vm is still going at last check. |
I think the 32-bit on the raspberry pi hung up in |
Looking back at the cygwin thread, it seems that patch was introduced after a report of a hang with 100% CPU usage, rather than the hang with 0 CPU usage that we see, so I'm not sure it's the same issue. I guess I'll keep looking into the |
With debug build it hung up somewhere different, but doesn't make any more sense. This time it hung up apparently during process teardown, having called |
The 64-bit msys2 on windows 11 did eventually hang too |
When running `pacman` on Windows/ARM64, we frequently run into curious hangs (see msys2/msys2-autobuild#62 for more details). This commit aims to work around that by replacing the double-fork with a single-fork in `_gpgme_io_spawn()`. Signed-off-by: Johannes Schindelin <[email protected]>
I finally have some good news. While I am not even close to a fix, I have a work-around: msys2/MSYS2-packages#4583 Here is a run of Git for Windows' By manually observing the hangs (RDPing into those self-hosted runners) I figured out that there were typically around half a dozen hanging processes whose command-lines were identical to their respective parent processes' command-lines. I've tracked that down to One thing that helped me tremendously while debugging this was the insight that calling that PowerShell script that runs So these are my thoughts how to proceed from here:
|
In that case, I wonder if there's a race between starting up the wait thread and shutting it down during process exit. Assuming the second fork is followed by an exec in the (grand)child, that could further complicate things because I think that there is some magic that shuffles around the process tree to try to make it look as though exec actually replaced the process instead of starting a new one. (I think that may even be involved in the wait thread). I never did get a good understanding of locking around this code, either. This is why I was trying Interlocked operations, to see if maybe there was a race going on, because I was seeing things in the debugger like handles that were NULL in the struct, but the stack showed a non-NULL handle passed to functions like CloseHandle or TerminateThread. I think I was satisfied that they were moving the handle into a temp variable and nulling it in the struct before closing it, but it felt like it was trying to avoid a race in a not-horribly-effective manner. As for a Windows bug, I couldn't see any good reason for TerminateThread to block. I was a little concerned that maybe terminating a thread could leave the emulation in a bad state. |
I read the code, and I think I understand what it is trying to do. It has this comment: /* Intermediate child to prevent zombie processes. */ As I recall, there is a "rule" on *nix that a parent must wait on a child process (or ignore |
@jeremyd2019 I believe you’re 100% correct, and TIL: http://stackoverflow.com/questions/10932592/why-fork-twice/16655124#16655124 And I think it’s exactly the reason why it un-hangs if I manually kill the “right” pacman process (the intermediate child apparently). Which probably means that it’s actually the intermediate child that hangs (maybe because the grandchild exits too soon?). Hope this information helps in diagnosing the root cause 🤞 |
That's probably not what you're trying to achieve though: https://devblogs.microsoft.com/oldnewthing/20150205-00/?p=44743 Seems your original solution with |
Yeah, seems I do need GetThreadContext for the synchronization side-effect. The yield does seem to be working though, but probably less reliably than GetThreadContext would. |
Yep, that's pretty much the last two paragraphs:
The issue with |
Thanks. I thought I was just hacking around blindly trying to find something that helped, and here I'm getting code reviews and everything 😁. I'm going to test jeremyd2019/msys2-runtime@7863965 for a while, and if that works I'm going to try submitting jeremyd2019/msys2-runtime@8597665 to [email protected] and see what they say. (The only reason I'm testing one and submitting the other is that I already know the CancelSynchronousIo addition helps quite a bit, so would make any potential reproduction of a hang less likely). |
@jeremyd2019 great work! May I suggest to expand the commit message a bit, though? Something like this:
|
Thank you! I have a bias toward what I consider a nicer UI at https://inbox.sourceware.org/cygwin-patches/[email protected]/T/#u, maybe you like that one, too? |
this patch applies cleanly to 3.3 as well, so I've got it built for i686 and running my reproducer on win10/raspberry pi 4 since that seemed to be a very good machine to reproduce the hang on (along with the qc710, which I have used for the prior tests for this patch and is still going) |
I can help test too. What should I download and do? In my arm64 pacman locks a lot. |
test x86_64 binaries can be found:
in these zips, I just extract the usr/bin/msys-2.0.dll and replace (after backing up!) the one in msys2. |
Nicely done @jeremyd2019 . Yes, the whole point with my original suggestion with Suspend + GetContext was so you can make sure exit from simulation is complete. Sleep/Yield would not give you that or any other API. I'll pull these and give it a go too. This issue usually reproduces almost instantaneously on my Qualcomm 8380 machine. Thanks for sharing the engineering privates with the fix. |
This test seems to be holding up
|
This does suggest a change that I could potentially do to the emulator to protect cases such as this: A change to always exit simulation before completing cross-thread termination. So far we haven't been considering because cross-thread termination is already such a Russian roulette, that it didn't seem to warrant the effort to un-simulate first... but Cygwin is giving me a new perspective. |
As I could confirm in the git-for-windows/msys2-runtime PR that your patch does what we hoped it would, do you maybe want to open a PR for msys2/msys2-runtime@msys2-3.5.4...jeremyd2019:msys2-runtime:msys2-3.5.4-suspendthread? I think we're good to go on that. |
msys2/msys2-runtime#234 (and msys2/msys2-runtime#235 for good measure) |
I ran my fork reproducer for over 12 hours, 4x instances on the QC710 running x86_64/msys2-runtime-3.5.4, and 2x instances on the Raspberry Pi 4 8GB running i686/msys2-runtime-3.3.6, and all instances are still going without deadlocks or crashes. |
Amazing, terrific work, everyone! ❤️ Thank you so much for this! |
After msys2/msys2-runtime#234, we shouldn't hit msys2/msys2-autobuild#62 anymore, so remove the workarounds for it. Also remove the sed for enabling clangarm64 in pacman.conf, since it has been enabled by default for a couple of years now.
Thank you @jeremyd2019 and everyone! |
On Windows/ARM64, running the 64-bit version of Git for Windows could infrequently cause deadlocked threads (see e.g. [this report](msys2/msys2-autobuild#62) or [this one](https://inbox.sourceware.org/cygwin-developers/[email protected]/)), [which was addressed](git-for-windows/msys2-runtime#73). Signed-off-by: gitforwindowshelper[bot] <[email protected]>
Well. I just went to update my raspberry pi, which I left running 2x instances for however long it's been, and there was a totally different failure:
The stackdump was empty. C:\>net helpmsg 1455
The paging file is too small for this operation to complete. Phew, maybe the scrollback in Windows Terminal exhausted the page file? QC710 was still going without issue. |
Is there maybe a memory leak somewhere? Or does your test app just keep on creating forks without exiting the parent process? I wonder if you could've exhausted the amount of process handles, for example 🤔 It's a 32-bit app after all. |
In this case the 11 is the cygwin errno, |
I would indeed assume a process handle exhaustion in this case 🤔 |
Today GHA Windows runner images (all versions) deployed an upgrade (20250127.1.0 -> 20250203.1.0) that upgraded the default MSYS2, which now seems to feature the October 2024 issue that caused curl runtests run times increasing ~2.5x. It also causes test987 to fail, and vcpkg jobs hitting their time limits and fail. Reliability also got a hit. In October this issue came with a Git for Windows upgrade, and likely the MSYS2 runtime update within it. It affected vcpkg jobs only, and I mitigated it by switching them to use the default MSYS2 shell and runtime (at `C:\msys64`): 5f9411f #15380 After today's update this mitigation no longer works. The issue also affects `dl-mingw` jobs now, though to a lesser extent than vcpkg ones. Tried switching back to Git for Windows which received several updates since October, but the performance issue is still present. I managed to mitigate the slowdown in vcpkg by lowering test parallelism to `-j4` (from `-j8`), after which the jobs are about *half the speed* than before, and fit their time limits. `dl-mingw` builds run slower by 1-1.5 minutes per job, they were already using `-j4`. Example jobs: Before (ALL GOOD): https://github.com/curl/curl/actions/runs/13167230443/job/36750175428 installed MSYS2, mingw (-j8): 3m50s (OK) https://github.com/curl/curl/actions/runs/13167230443/job/36750158662 default MSYS2, dl-mingw (-j4): 4m22s (OK) https://github.com/curl/curl/actions/runs/13167230443/job/36750163392 default MSYS2, vcpkg (-j8): 3m27s (OK) runner: https://github.com/actions/runner-images/blob/win22/20250127.1/images/windows/Windows2022-Readme.md C:\msys64: System: MSYS_NT-10.0-20348 fv-az1115-916 3.5.4-0bc1222b.x86_64 2024-12-05 09:27 UTC x86_64 Msys msys2/msys2-runtime@0bc1222b After: https://github.com/curl/curl/actions/runs/13186498273/job/36809747078 installed MSYS2, mingw (-j8): 3m48s (OK) https://github.com/curl/curl/actions/runs/13186498273/job/36809728481 default MSYS2, dl-mingw (-j4): 5m56s (SLOW) https://github.com/curl/curl/actions/runs/13186498273/job/36809736429 default MSYS2, vcpkg (-j8): 9m1s (SLOW) runner: https://github.com/actions/runner-images/blob/win22/20250203.1/images/windows/Windows2022-Readme.md C:\msys64: System: MSYS_NT-10.0-20348 fv-az1115-498 3.5.7-2644508f.x86_64 2025-01-30 09:08 UTC x86_64 Msys msys2/msys2-runtime@2644508f windows-2025 image: C:\msys64: System: MSYS_NT-10.0-26100 fv-az2043-515 3.5.7-2644508f.x86_64 2025-01-30 09:08 UTC x86_64 Msys windows-2019 image: C:\msys64: System: MSYS_NT-10.0-17763 fv-az1434-677 3.5.7-2644508f.x86_64 2025-01-30 09:08 UTC x86_64 Msys This PR: final: https://github.com/curl/curl/actions/runs/13186498273/job/36809736429 GfW, vcpkg (*-j4*): ~7m (SLOW) test: https://github.com/curl/curl/actions/runs/13187992987/job/36814644852?pr=16217, GfW, vcpkg (-j8): ~11m (SLOWER) Before and after (unused) Git for Windows (SLOW as tested in this PR): C:\Program Files\Git System: MINGW64_NT-10.0-20348 fv-az1760-186 3.5.4-395fda67.x86_64 2024-11-25 09:49 UTC x86_64 Msys msys2/msys2-runtime@395fda67 (fork) Before and after (used) MSYS2 installed via msys2/setup-msys2 (OK): D:\a\_temp\msys64 System: MINGW64_NT-10.0-20348 fv-az836-378 3.5.4-0bc1222b.x86_64 2024-12-05 09:27 UTC x86_64 Msys Perl pipe issue report from October, still open: msys2/msys2-runtime#230 ARM deadlock fixed by GfW 2.47.1(1), but for x86_64, on a quick glance: msys2/msys2-runtime@290bea9 Possibly interesting: msys2/msys2-autobuild#62 Closes #16217
Just so we have an issue to link to and discuss maybe
The text was updated successfully, but these errors were encountered: