Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MueLu: errors during library creation in PR testing #4696

Closed
jhux2 opened this issue Mar 22, 2019 · 25 comments
Closed

MueLu: errors during library creation in PR testing #4696

jhux2 opened this issue Mar 22, 2019 · 25 comments
Assignees
Labels
impacting: configure or build The issue is primarily related to configuring or building pkg: MueLu

Comments

@jhux2
Copy link
Member

jhux2 commented Mar 22, 2019

I've seen or heard about a number of recent failures during PR testing that occur during the creation of the MueLu library. One example is here.

Error while building C++ shared library " packages/muelu/src/libmuelu.so.12.13" in target muelu

collect2: fatal error: ld terminated with signal 9 [Killed]
compilation terminated.

I'd like to understand more about what's going on.

@jwillenbring @prwolfe

@jhux2 jhux2 added pkg: MueLu impacting: configure or build The issue is primarily related to configuring or building labels Mar 22, 2019
@jhux2 jhux2 self-assigned this Mar 22, 2019
@jhux2
Copy link
Member Author

jhux2 commented Mar 22, 2019

@trilinos/muelu

@csiefer2
Copy link
Member

#awesomelinkerfail

@mhoemmen
Copy link
Contributor

Speculation:

  1. Insufficient memory for the linker: Hopefully move from 1.8G/core to 2.1 #4659 will help.
  2. We missed upgrading some linker to the 64-bit version.

@jhux2
Copy link
Member Author

jhux2 commented Mar 22, 2019

If the issue is that the linker is limited by available memory, would breaking MueLu into separate libraries help?

@jjellio
Copy link
Contributor

jjellio commented Mar 27, 2019

@jhux2 I polished up some tools I used when Mutrino was very unstable. I then used them to analyze a Trilinos build on Waterman. With a little more effort, I could visualize the results.

This may not help in general, but for helping 'spread the load' at build time:

  1. Mask out some hardware threads
  2. If concurrent builds, then restrict each build to their own NUMA region.

E.g.,
A waterman optimized build:

Build Technique Ninja -j Elapsed Wallclock Time (MM:SS.ms)
Regular -j 80 28:13.06
Masked HW threads -j 80 22:52.25

Effectively, both builds have the same number of 'CPUS', e.g., -j 80. But one build is spreading that load over more numa domains.

Masking HWs resulted in a 20% faster build.

How does relate to Memory usage?

Recommendation:
Restrict concurrent builds to specific NUMA node. The line numactl -C 0,1,.. is telling you which CPUs you want from:
numactl -H

For PR testing, you could also set: (waterman Numa Nodes)

Choose numa node 0, and cores 0-79 for build A
numactl -m 0 -C $(seq -s, 0 2 79) ninja -j 40 ...

Choose numa node 8, and cores 80-159 for build B
numactl -m 8 -C $(seq -s, 80 2 159) ninja -j 40 ...

Most likely, what is happening, is that the builds are ganging up on NUMA node 0 (on all arches), and at random they are hitting high watermarks together... and poof you die.

Memory HWM for lib MueLu:

1,093,504 Kb

I get this number by using /usr/bin/time around the link call and my scripts put this info into a CSV formatted file.

A 1GB HWM doesn't seem obnoxious.

This looks horrendous, but what it does is 'hide' 2 of the 4 HTs per core on Power9

The real command run is:

/usr/bin/time -v numactl -C $(seq -s, 0 2 159) ninja -j $(seq -s, 0 2 159 | tr ',' '\n' | wc -l) PanzerMiniEM_BlockPrec.exe

Which turns into:

Command being timed: "numactl -C 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,142,144,146,148,150,152,154,156,158 ninja -j 80 PanzerMiniEM_BlockPrec.exe"
	User time (seconds): 45372.47
	System time (seconds): 13534.02
	Percent of CPU this job got: 4292%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 22:52.25
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3158656
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 68098703
	Voluntary context switches: 21053620
	Involuntary context switches: 1050173
	Swaps: 0
	File system inputs: 3287168
	File system outputs: 6307968
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 65536
	Exit status: 0

Regular Ninja:

	Command being timed: "ninja -j 80 PanzerMiniEM_BlockPrec.exe"
	User time (seconds): 57055.80
	System time (seconds): 57898.56
	Percent of CPU this job got: 6789%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 28:13.06
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3158656
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 69494728
	Voluntary context switches: 8858406
	Involuntary context switches: 1067195
	Swaps: 0
	File system inputs: 3102208
	File system outputs: 6304128
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 65536
	Exit status: 0

@jjellio
Copy link
Contributor

jjellio commented Mar 27, 2019

A few more caveats:

Power9 (that we have atleast), enumerate their 'CPUs' different than Mutrino.

Take:
numactl -H

On my wussy little desktop:

numactr[jjellio@s1001943 jjellio]$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 **13 28** 29 30 31 32 33 34 35 36 37 38 39 40 41
node 0 size: 65451 MB
node 0 free: 24832 MB
node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 1 size: 65536 MB
node 1 free: 39323 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

On Power9 Login (login != compute nodes):

[jjellio@waterman11 build]$ numactl -H 
available: 2 nodes (0,8)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 65311 MB
node 0 free: 1016 MB
node 8 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 8 size: 48975 MB
node 8 free: 724 MB
node distances:
node   0   8 
  0:  10  40 
  8:  40  10 

The marked difference between the two, is that:
Core0 thread 0, is 0
Core0, thread 1, is 1 (on power)
Core0, thread 1, is 28 (on my x86)

See how my numactl info goes 0,1,2...13,28,29,30...41
I have these weird 14 core per socket x86 processors, with 2 HT per core.
Core0 th0 = 0
Core0 th1 = 28
Core1 th0 = 1
Core1 th1 = 29

On power,
Core0 th0 = 0
Core0 th1 = 1
Core0 th1 = 2
Core0 th1 = 3
Core1 th0 = 4
Core1 th1 = 5

When you run make -j, I suspect what happens is that the OS is scheduling on the first available core, which on Power means you tend to overload Core0. (you schedule on all HWs for Core0 FIRST then move to another Core!)

This isn't a 'power' specific thing. Different OSes enumerate the cores differently... If you tend to land on Core0 more often, then you also will tend to allocate memory in NUMA 0 more often. This is why running 'ninja -j 80' performed vastly faster when I 'hid' 2 of the hardware threads... is forced the OS to schedule on more cores, which distributed the load across the sockets more.

The timings I showed previously are reproducible (Atleast for me!), within about +/- 1 minute.

IMO, lets look at tuning the Autotester a little...

@prwolfe
Copy link
Contributor

prwolfe commented Mar 27, 2019

Speculation:

  1. Insufficient memory for the linker: Hopefully move from 1.8G/core to 2.1 #4659 will help.
  2. We missed upgrading some linker to the 64-bit version.
  1. I think this helped, but I also don't want to continue moving the memory requirements without some understanding of the underlying issue.
  2. Do we want to require all our customers have 64 bit linkers? Is this actually needed to provide this functionality?

@prwolfe
Copy link
Contributor

prwolfe commented Mar 27, 2019

I did a naive pass at this and just ran htop while linking the library (gcc 7.2.0 build). The linker had a high water mark of 4.84G and produced a library 1.5G in size.

The other file we see fail is MueLu_ParameterListInterpreter.cpp.o and that is a paltry 65k and used almost no extra memory (about .5G).

@prwolfe
Copy link
Contributor

prwolfe commented Mar 27, 2019

I did a naive pass at this and just ran stop while linking the library (gcc 7.2.0 build). The linker had a high water mark of 4.84G and produced a library 1.5G in size.

The other file we see fail is MueLu_ParameterListInterpreter.cpp.o and that is a paltry 65k and used almost no extra memory (about .5G).

I reconfigured for all packages instead of just MueLu and this file skyrocketed to 4.77G and a compile time of almost 5 minutes

@jhux2
Copy link
Member Author

jhux2 commented Mar 27, 2019

I reconfigured for all packages instead of just MueLu and this file skyrocketed to 4.77G and a compile time of almost 5 minutes

4.77G is the size of libmuelu.a?

@prwolfe
Copy link
Contributor

prwolfe commented Mar 27, 2019

No - the amount of memory the compiler used to compile this one *.o file

@jjellio
Copy link
Contributor

jjellio commented Mar 27, 2019

A problem with Trilinos builds in general, is that it makes dependencies enforced by package. This means that we build the simple, low memory files first, then at the end we are build nothing but memory hogs - all at the same time.

I think it would work to grab all .o targets from the build.ninja, then randomize them. Object files will never depend on other object files, but CMake enforces the rules this way, because a target could use the target to generate header/cpp files. I don't think anyone in Trilinos does this... all of our generated files happen at configure time (I think). Effectively randomizing the build tree so that later targets are spread among the earlier targets would probably avoid the insane HWM pressure at the end.

@jhu2 the stuff I sent via email can give you file by file HWM/size/time/etc... I can help w/that if you want.

@mhoemmen
Copy link
Contributor

@prwolfe Did that 4.77G figure happen with ETI ON?

@mhoemmen
Copy link
Contributor

@jjellio That's a good idea, though part of the reason for the high memory pressure near the end is because of our software design. We can work around with randomization, but we'll always have these high-water marks unless we do factories better.

@prwolfe
Copy link
Contributor

prwolfe commented Mar 27, 2019 via email

@mhoemmen
Copy link
Contributor

@prwolfe Sierra turns on more Scalar (and probably more Node) types than typical Trilinos tests, so perhaps this may motivate splitting up the MueLu library by Scalar type.

@jhux2
Copy link
Member Author

jhux2 commented Mar 27, 2019

@mhoemmen wrote:

Sierra turns on more Scalar (and probably more Node) types than typical Trilinos tests, so perhaps this may motivate splitting up the MueLu library by Scalar type.

Eric P. did this recently for Stokhos, I was thinking MueLu might be able to do something similar.

@jhux2
Copy link
Member Author

jhux2 commented Mar 27, 2019

However, I'm not sure that would help with the high water mark that @prwolfe mentioned.

@srajama1
Copy link
Contributor

@jhux2 : The idea to split based on scalars is a good one irrespective of this issue. I suggest we make this at a Trilinos level.

@prwolfe
Copy link
Contributor

prwolfe commented Apr 3, 2019

After discussion in the Framework meeting this morning we are moving the per-core memory requirement up to 3G (see #4800 ) in an attempt to make this more stable until the MueLu team can resolve the underlying issue. This will also require a reduction in the number of PR's that can run at once.

@jhux2
Copy link
Member Author

jhux2 commented Jun 7, 2019

@prwolfe Can this issue be close?

@jhux2
Copy link
Member Author

jhux2 commented Jun 7, 2019

It looks like upping the per-core memory and/or work on #4986 has helped.

@prwolfe
Copy link
Contributor

prwolfe commented Jun 11, 2019

@prwolfe Can this issue be close?

Is there another issue being used to track progress on the size issue? The increased memory limit is after all a band-aid for the larger issue. I looked and current library sites are down to 1.1G from the previous 1.5G. Did we ever set a target for library size or memory use by a single process that we can/should enforce?

At the least we should consider another issue targeting those issues - the way this one is worded does make me think it can be closed as we do at least understand the issue at this point.

@jhux2
Copy link
Member Author

jhux2 commented Jun 11, 2019

Is there another issue being used to track progress on the size issue? The increased memory limit is after all a band-aid for the larger issue. I looked and current library sites are down to 1.1G from the previous 1.5G.

Yes. See #3137 and #4986.

Did we ever set a target for library size or memory use by a single process that we can/should enforce?

There is no hard number identified. The high-water library size mark occurs with intel debug static, so that's the measuring stick in #3137.

@jhux2
Copy link
Member Author

jhux2 commented Aug 20, 2019

@prwolfe I am closing this issue, as it's being tracked in #3137 and #4986.

@jhux2 jhux2 closed this as completed Aug 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impacting: configure or build The issue is primarily related to configuring or building pkg: MueLu
Projects
None yet
Development

No branches or pull requests

6 participants