MueLu: errors during library creation in PR testing #4696

jhux2 · 2019-03-22T01:33:46Z

I've seen or heard about a number of recent failures during PR testing that occur during the creation of the MueLu library. One example is here.

Error while building C++ shared library " packages/muelu/src/libmuelu.so.12.13" in target muelu

collect2: fatal error: ld terminated with signal 9 [Killed]
compilation terminated.

I'd like to understand more about what's going on.

@jwillenbring @prwolfe

The text was updated successfully, but these errors were encountered:

jhux2 · 2019-03-22T01:34:25Z

@trilinos/muelu

csiefer2 · 2019-03-22T14:55:08Z

#awesomelinkerfail

mhoemmen · 2019-03-22T17:01:29Z

Speculation:

Insufficient memory for the linker: Hopefully move from 1.8G/core to 2.1 #4659 will help.
We missed upgrading some linker to the 64-bit version.

jhux2 · 2019-03-22T17:14:37Z

If the issue is that the linker is limited by available memory, would breaking MueLu into separate libraries help?

jjellio · 2019-03-27T06:00:49Z

@jhux2 I polished up some tools I used when Mutrino was very unstable. I then used them to analyze a Trilinos build on Waterman. With a little more effort, I could visualize the results.

This may not help in general, but for helping 'spread the load' at build time:

Mask out some hardware threads
If concurrent builds, then restrict each build to their own NUMA region.

E.g.,
A waterman optimized build:

Build Technique	Ninja -j	Elapsed Wallclock Time (MM:SS.ms)
Regular	-j 80	28:13.06
Masked HW threads	-j 80	22:52.25

Effectively, both builds have the same number of 'CPUS', e.g., -j 80. But one build is spreading that load over more numa domains.

Masking HWs resulted in a 20% faster build.

How does relate to Memory usage?

Recommendation:
Restrict concurrent builds to specific NUMA node. The line numactl -C 0,1,.. is telling you which CPUs you want from:
numactl -H

For PR testing, you could also set: (waterman Numa Nodes)

Choose numa node 0, and cores 0-79 for build A
numactl -m 0 -C $(seq -s, 0 2 79) ninja -j 40 ...

Choose numa node 8, and cores 80-159 for build B
numactl -m 8 -C $(seq -s, 80 2 159) ninja -j 40 ...

Most likely, what is happening, is that the builds are ganging up on NUMA node 0 (on all arches), and at random they are hitting high watermarks together... and poof you die.

Memory HWM for lib MueLu:

1,093,504 Kb

I get this number by using /usr/bin/time around the link call and my scripts put this info into a CSV formatted file.

A 1GB HWM doesn't seem obnoxious.

This looks horrendous, but what it does is 'hide' 2 of the 4 HTs per core on Power9

The real command run is:

/usr/bin/time -v numactl -C $(seq -s, 0 2 159) ninja -j $(seq -s, 0 2 159 | tr ',' '\n' | wc -l) PanzerMiniEM_BlockPrec.exe

Which turns into:

Command being timed: "numactl -C 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,142,144,146,148,150,152,154,156,158 ninja -j 80 PanzerMiniEM_BlockPrec.exe"
	User time (seconds): 45372.47
	System time (seconds): 13534.02
	Percent of CPU this job got: 4292%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 22:52.25
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3158656
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 68098703
	Voluntary context switches: 21053620
	Involuntary context switches: 1050173
	Swaps: 0
	File system inputs: 3287168
	File system outputs: 6307968
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 65536
	Exit status: 0

Regular Ninja:

	Command being timed: "ninja -j 80 PanzerMiniEM_BlockPrec.exe"
	User time (seconds): 57055.80
	System time (seconds): 57898.56
	Percent of CPU this job got: 6789%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 28:13.06
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3158656
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 69494728
	Voluntary context switches: 8858406
	Involuntary context switches: 1067195
	Swaps: 0
	File system inputs: 3102208
	File system outputs: 6304128
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 65536
	Exit status: 0

jjellio · 2019-03-27T06:31:13Z

A few more caveats:

Power9 (that we have atleast), enumerate their 'CPUs' different than Mutrino.

Take:
numactl -H

On my wussy little desktop:

numactr[jjellio@s1001943 jjellio]$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 **13 28** 29 30 31 32 33 34 35 36 37 38 39 40 41
node 0 size: 65451 MB
node 0 free: 24832 MB
node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 1 size: 65536 MB
node 1 free: 39323 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

On Power9 Login (login != compute nodes):

[jjellio@waterman11 build]$ numactl -H 
available: 2 nodes (0,8)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 65311 MB
node 0 free: 1016 MB
node 8 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 8 size: 48975 MB
node 8 free: 724 MB
node distances:
node   0   8 
  0:  10  40 
  8:  40  10

The marked difference between the two, is that:
Core0 thread 0, is 0
Core0, thread 1, is 1 (on power)
Core0, thread 1, is 28 (on my x86)

See how my numactl info goes 0,1,2...13,28,29,30...41
I have these weird 14 core per socket x86 processors, with 2 HT per core.
Core0 th0 = 0
Core0 th1 = 28
Core1 th0 = 1
Core1 th1 = 29

On power,
Core0 th0 = 0
Core0 th1 = 1
Core0 th1 = 2
Core0 th1 = 3
Core1 th0 = 4
Core1 th1 = 5

When you run make -j, I suspect what happens is that the OS is scheduling on the first available core, which on Power means you tend to overload Core0. (you schedule on all HWs for Core0 FIRST then move to another Core!)

This isn't a 'power' specific thing. Different OSes enumerate the cores differently... If you tend to land on Core0 more often, then you also will tend to allocate memory in NUMA 0 more often. This is why running 'ninja -j 80' performed vastly faster when I 'hid' 2 of the hardware threads... is forced the OS to schedule on more cores, which distributed the load across the sockets more.

The timings I showed previously are reproducible (Atleast for me!), within about +/- 1 minute.

IMO, lets look at tuning the Autotester a little...

prwolfe · 2019-03-27T13:59:38Z

Speculation:

Insufficient memory for the linker: Hopefully move from 1.8G/core to 2.1 #4659 will help.

We missed upgrading some linker to the 64-bit version.

I think this helped, but I also don't want to continue moving the memory requirements without some understanding of the underlying issue.
Do we want to require all our customers have 64 bit linkers? Is this actually needed to provide this functionality?

prwolfe · 2019-03-27T16:24:28Z

I did a naive pass at this and just ran htop while linking the library (gcc 7.2.0 build). The linker had a high water mark of 4.84G and produced a library 1.5G in size.

The other file we see fail is MueLu_ParameterListInterpreter.cpp.o and that is a paltry 65k and used almost no extra memory (about .5G).

prwolfe · 2019-03-27T16:33:13Z

I did a naive pass at this and just ran stop while linking the library (gcc 7.2.0 build). The linker had a high water mark of 4.84G and produced a library 1.5G in size.

The other file we see fail is MueLu_ParameterListInterpreter.cpp.o and that is a paltry 65k and used almost no extra memory (about .5G).

I reconfigured for all packages instead of just MueLu and this file skyrocketed to 4.77G and a compile time of almost 5 minutes

jhux2 · 2019-03-27T16:42:12Z

I reconfigured for all packages instead of just MueLu and this file skyrocketed to 4.77G and a compile time of almost 5 minutes

4.77G is the size of libmuelu.a?

prwolfe · 2019-03-27T16:43:49Z

No - the amount of memory the compiler used to compile this one *.o file

jjellio · 2019-03-27T17:03:27Z

A problem with Trilinos builds in general, is that it makes dependencies enforced by package. This means that we build the simple, low memory files first, then at the end we are build nothing but memory hogs - all at the same time.

I think it would work to grab all .o targets from the build.ninja, then randomize them. Object files will never depend on other object files, but CMake enforces the rules this way, because a target could use the target to generate header/cpp files. I don't think anyone in Trilinos does this... all of our generated files happen at configure time (I think). Effectively randomizing the build tree so that later targets are spread among the earlier targets would probably avoid the insane HWM pressure at the end.

@jhu2 the stuff I sent via email can give you file by file HWM/size/time/etc... I can help w/that if you want.

mhoemmen · 2019-03-27T17:06:43Z

@prwolfe Did that 4.77G figure happen with ETI ON?

mhoemmen · 2019-03-27T17:08:14Z

@jjellio That's a good idea, though part of the reason for the high memory pressure near the end is because of our software design. We can work around with randomization, but we'll always have these high-water marks unless we do factories better.

prwolfe · 2019-03-27T17:10:40Z

Yes it did.

…

----------------------------------------------------------------------- Recommending policy and creating automation to enable DevOps in the Trilinos and CTH organizations. From: Mark Hoemmen <[email protected]> Reply-To: trilinos/Trilinos <[email protected]> Date: Wednesday, March 27, 2019 at 11:06 AM To: trilinos/Trilinos <[email protected]> Cc: "Wolfenbarger, Paul R" <[email protected]>, Mention <[email protected]> Subject: [EXTERNAL] Re: [trilinos/Trilinos] MueLu: errors during library creation in PR testing (#4696) @prwolfe<https://github.com/prwolfe> Did that 4.77G figure happen with ETI ON? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#4696 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHTpCUq_HeuARjpyHLc_V7mg8d07t0Sjks5va6UqgaJpZM4cCrNM>.

mhoemmen · 2019-03-27T17:17:31Z

@prwolfe Sierra turns on more Scalar (and probably more Node) types than typical Trilinos tests, so perhaps this may motivate splitting up the MueLu library by Scalar type.

jhux2 · 2019-03-27T17:41:55Z

@mhoemmen wrote:

Sierra turns on more Scalar (and probably more Node) types than typical Trilinos tests, so perhaps this may motivate splitting up the MueLu library by Scalar type.

Eric P. did this recently for Stokhos, I was thinking MueLu might be able to do something similar.

jhux2 · 2019-03-27T17:42:48Z

However, I'm not sure that would help with the high water mark that @prwolfe mentioned.

srajama1 · 2019-03-27T19:44:03Z

@jhux2 : The idea to split based on scalars is a good one irrespective of this issue. I suggest we make this at a Trilinos level.

prwolfe · 2019-04-03T15:35:29Z

After discussion in the Framework meeting this morning we are moving the per-core memory requirement up to 3G (see #4800 ) in an attempt to make this more stable until the MueLu team can resolve the underlying issue. This will also require a reduction in the number of PR's that can run at once.

jhux2 · 2019-06-07T18:12:36Z

@prwolfe Can this issue be close?

jhux2 · 2019-06-07T18:14:29Z

It looks like upping the per-core memory and/or work on #4986 has helped.

prwolfe · 2019-06-11T16:08:01Z

@prwolfe Can this issue be close?

Is there another issue being used to track progress on the size issue? The increased memory limit is after all a band-aid for the larger issue. I looked and current library sites are down to 1.1G from the previous 1.5G. Did we ever set a target for library size or memory use by a single process that we can/should enforce?

At the least we should consider another issue targeting those issues - the way this one is worded does make me think it can be closed as we do at least understand the issue at this point.

jhux2 · 2019-06-11T19:27:43Z

Is there another issue being used to track progress on the size issue? The increased memory limit is after all a band-aid for the larger issue. I looked and current library sites are down to 1.1G from the previous 1.5G.

Yes. See #3137 and #4986.

Did we ever set a target for library size or memory use by a single process that we can/should enforce?

There is no hard number identified. The high-water library size mark occurs with intel debug static, so that's the measuring stick in #3137.

jhux2 · 2019-08-20T17:14:22Z

@prwolfe I am closing this issue, as it's being tracked in #3137 and #4986.

jhux2 added pkg: MueLu impacting: configure or build The issue is primarily related to configuring or building labels Mar 22, 2019

jhux2 self-assigned this Mar 22, 2019

prwolfe mentioned this issue Apr 3, 2019

Change PR testing from 2.1G to 3G per core #4800

Merged

jhux2 closed this as completed Aug 20, 2019

bartlettroscoe mentioned this issue Oct 3, 2019

Add automated tests for each package to check growth of library size #6032

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MueLu: errors during library creation in PR testing #4696

MueLu: errors during library creation in PR testing #4696

jhux2 commented Mar 22, 2019

jhux2 commented Mar 22, 2019

csiefer2 commented Mar 22, 2019

mhoemmen commented Mar 22, 2019

jhux2 commented Mar 22, 2019

jjellio commented Mar 27, 2019

jjellio commented Mar 27, 2019

prwolfe commented Mar 27, 2019

prwolfe commented Mar 27, 2019 •

edited

Loading

prwolfe commented Mar 27, 2019

jhux2 commented Mar 27, 2019

prwolfe commented Mar 27, 2019

jjellio commented Mar 27, 2019

mhoemmen commented Mar 27, 2019

mhoemmen commented Mar 27, 2019

prwolfe commented Mar 27, 2019 via email

mhoemmen commented Mar 27, 2019

jhux2 commented Mar 27, 2019

jhux2 commented Mar 27, 2019

srajama1 commented Mar 27, 2019

prwolfe commented Apr 3, 2019

jhux2 commented Jun 7, 2019

jhux2 commented Jun 7, 2019

prwolfe commented Jun 11, 2019

jhux2 commented Jun 11, 2019

jhux2 commented Aug 20, 2019

MueLu: errors during library creation in PR testing #4696

MueLu: errors during library creation in PR testing #4696

Comments

jhux2 commented Mar 22, 2019

jhux2 commented Mar 22, 2019

csiefer2 commented Mar 22, 2019

mhoemmen commented Mar 22, 2019

jhux2 commented Mar 22, 2019

jjellio commented Mar 27, 2019

Memory HWM for lib MueLu:

jjellio commented Mar 27, 2019

prwolfe commented Mar 27, 2019

prwolfe commented Mar 27, 2019 • edited Loading

prwolfe commented Mar 27, 2019

jhux2 commented Mar 27, 2019

prwolfe commented Mar 27, 2019

jjellio commented Mar 27, 2019

mhoemmen commented Mar 27, 2019

mhoemmen commented Mar 27, 2019

prwolfe commented Mar 27, 2019 via email

mhoemmen commented Mar 27, 2019

jhux2 commented Mar 27, 2019

jhux2 commented Mar 27, 2019

srajama1 commented Mar 27, 2019

prwolfe commented Apr 3, 2019

jhux2 commented Jun 7, 2019

jhux2 commented Jun 7, 2019

prwolfe commented Jun 11, 2019

jhux2 commented Jun 11, 2019

jhux2 commented Aug 20, 2019

prwolfe commented Mar 27, 2019 •

edited

Loading