-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MueLu: errors during library creation in PR testing #4696
Comments
@trilinos/muelu |
#awesomelinkerfail |
Speculation:
|
If the issue is that the linker is limited by available memory, would breaking MueLu into separate libraries help? |
@jhux2 I polished up some tools I used when Mutrino was very unstable. I then used them to analyze a Trilinos build on Waterman. With a little more effort, I could visualize the results. This may not help in general, but for helping 'spread the load' at build time:
E.g.,
Effectively, both builds have the same number of 'CPUS', e.g., Masking HWs resulted in a 20% faster build. How does relate to Memory usage? Recommendation: For PR testing, you could also set: (waterman Numa Nodes)
Most likely, what is happening, is that the builds are ganging up on NUMA node 0 (on all arches), and at random they are hitting high watermarks together... and poof you die. Memory HWM for lib MueLu:
I get this number by using A 1GB HWM doesn't seem obnoxious.
Regular Ninja:
|
A few more caveats: Power9 (that we have atleast), enumerate their 'CPUs' different than Mutrino. Take: On my wussy little desktop:
On Power9 Login (login != compute nodes):
The marked difference between the two, is that: See how my numactl info goes On power, When you run make -j, I suspect what happens is that the OS is scheduling on the first available core, which on Power means you tend to overload Core0. (you schedule on all HWs for Core0 FIRST then move to another Core!) This isn't a 'power' specific thing. Different OSes enumerate the cores differently... If you tend to land on Core0 more often, then you also will tend to allocate memory in NUMA 0 more often. This is why running 'ninja -j 80' performed vastly faster when I 'hid' 2 of the hardware threads... is forced the OS to schedule on more cores, which distributed the load across the sockets more. The timings I showed previously are reproducible (Atleast for me!), within about +/- 1 minute. IMO, lets look at tuning the Autotester a little... |
|
I did a naive pass at this and just ran htop while linking the library (gcc 7.2.0 build). The linker had a high water mark of 4.84G and produced a library 1.5G in size. The other file we see fail is MueLu_ParameterListInterpreter.cpp.o and that is a paltry 65k and used almost no extra memory (about .5G). |
I reconfigured for all packages instead of just MueLu and this file skyrocketed to 4.77G and a compile time of almost 5 minutes |
4.77G is the size of |
No - the amount of memory the compiler used to compile this one *.o file |
A problem with Trilinos builds in general, is that it makes dependencies enforced by package. This means that we build the simple, low memory files first, then at the end we are build nothing but memory hogs - all at the same time. I think it would work to grab all @jhu2 the stuff I sent via email can give you file by file HWM/size/time/etc... I can help w/that if you want. |
@prwolfe Did that 4.77G figure happen with ETI ON? |
@jjellio That's a good idea, though part of the reason for the high memory pressure near the end is because of our software design. We can work around with randomization, but we'll always have these high-water marks unless we do factories better. |
Yes it did.
…-----------------------------------------------------------------------
Recommending policy and creating automation to
enable DevOps in the Trilinos and CTH organizations.
From: Mark Hoemmen <[email protected]>
Reply-To: trilinos/Trilinos <[email protected]>
Date: Wednesday, March 27, 2019 at 11:06 AM
To: trilinos/Trilinos <[email protected]>
Cc: "Wolfenbarger, Paul R" <[email protected]>, Mention <[email protected]>
Subject: [EXTERNAL] Re: [trilinos/Trilinos] MueLu: errors during library creation in PR testing (#4696)
@prwolfe<https://github.com/prwolfe> Did that 4.77G figure happen with ETI ON?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#4696 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHTpCUq_HeuARjpyHLc_V7mg8d07t0Sjks5va6UqgaJpZM4cCrNM>.
|
@prwolfe Sierra turns on more Scalar (and probably more Node) types than typical Trilinos tests, so perhaps this may motivate splitting up the MueLu library by Scalar type. |
@mhoemmen wrote:
Eric P. did this recently for Stokhos, I was thinking MueLu might be able to do something similar. |
However, I'm not sure that would help with the high water mark that @prwolfe mentioned. |
@jhux2 : The idea to split based on scalars is a good one irrespective of this issue. I suggest we make this at a Trilinos level. |
After discussion in the Framework meeting this morning we are moving the per-core memory requirement up to 3G (see #4800 ) in an attempt to make this more stable until the MueLu team can resolve the underlying issue. This will also require a reduction in the number of PR's that can run at once. |
@prwolfe Can this issue be close? |
It looks like upping the per-core memory and/or work on #4986 has helped. |
Is there another issue being used to track progress on the size issue? The increased memory limit is after all a band-aid for the larger issue. I looked and current library sites are down to 1.1G from the previous 1.5G. Did we ever set a target for library size or memory use by a single process that we can/should enforce? At the least we should consider another issue targeting those issues - the way this one is worded does make me think it can be closed as we do at least understand the issue at this point. |
There is no hard number identified. The high-water library size mark occurs with intel debug static, so that's the measuring stick in #3137. |
I've seen or heard about a number of recent failures during PR testing that occur during the creation of the MueLu library. One example is here.
I'd like to understand more about what's going on.
@jwillenbring @prwolfe
The text was updated successfully, but these errors were encountered: