Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

number of available cpus (in cpuset) #700

Merged
merged 29 commits into from
Nov 4, 2013
Merged

Conversation

wpoely86
Copy link
Member

On Linux, try to find out if we're a member of a cpuset and find the
number of CPUs in it. Use this number to set the parallelism. If we're not in a cpuset, continue with get_core_count().

This should also be done on BSD...

On Linux, try to find out if we're a member of a cpuset and find the
number of CPUs in it. If not, continue with the old behaviour.

This should also be done on BSD...
_log.info("In cpuset with %s CPUs" % numofcpus)
return numofcpus
except IOError, err:
_log.warning("Failed to read /proc/%s/status to determine the cpuset: %s" % (mypid, err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least add an else: with a warning log message that attempting get to determine number of available CPUs is not even attempted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

I think adding a pass is beter here? It's not a problem when this fails, as long as get_core_count() still works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I don't follow. When an exception is raised, it prints a warning and continues to get_core_count(). Where do you want an else:?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an else: on the same level as if os_type == LINUX:, that's run for e.g. OS X.

It's OK that it just falls through to use get_core_count, but it should be explicit in the log.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a pass is beter here? It's not a problem when this fails, as long as get_core_count() still works.

A simple pass is silent, a warning log message also passes through in the end, but at least there's something in the logs...

@wpoely86
Copy link
Member Author

All done and tested.

cpuset = re.match("^Cpus_allowed_list:\s*([0-9,-]+)", txt, re.M)
if cpuset is not None:
cpuset_list = cpuset.group(1).split(',')
numofcpus = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use num_of_cpus (improved readability)

@boegel
Copy link
Member

boegel commented Sep 30, 2013

@wpoely86: If you collapse the 7 lines of code into the (ugly ;-)) list comprehension line I proposed, then I consider this ready, but it'll have to wait until EB v1.9, since the feature freeze in in effect since last Friday.

Do you need this urgently for something particular?

As soon as I've branched off 1.8.x (soon) I can merge this into develop...

@wpoely86
Copy link
Member Author

I don't need it any time soon. The current code works, it only gives more process than cpu's.

@boegel
Copy link
Member

boegel commented Oct 1, 2013

Verified that unit tests work, this should be good to go in as soon as 1.8.x is branched off. Thanks @wpoely86!

@fgeorgatos
Copy link
Collaborator

Hi there,

how is this one being used?
I try to grok if it relates in any way to BC_CORES_PER_NODE from here:
http://hpcbios.readthedocs.org/en/latest/HPCBIOS_2012-98.html#hpcbios-2012-98

@boegel
Copy link
Member

boegel commented Oct 1, 2013

@fgeorgatos: it is used to determine the default setting for parallel, i.e. the default value for n in Make -j n.
Before this patch, we simply counted the number of available cores, which is of course wrong if your locked inside a cpuset.

@fgeorgatos
Copy link
Collaborator

OK! Could we "invent" some kind of BC_ variable that allows to override
this externally from EasyBuild,
since that would be a convenient handle in configuration management
systems, user startup files etc?

@wpoely86
Copy link
Member Author

wpoely86 commented Oct 1, 2013

I don't understand what you want to do. Putting EasyBuild in a cpuset is done externally?

You can set maxparallel in EasyConfig, if you want to control it?

@fgeorgatos
Copy link
Collaborator

I was considering the case of provisioning for something like
BC_MAXPARALLEL,
so that there is a way to control the automation externally.
(fi. we have a node with 16 decacore CPUs, this would allow for some
interesting combinations)

If you don't see the need for it, I'm fine to postpone for when the moment
makes it arise.

@wpoely86
Copy link
Member Author

wpoely86 commented Oct 2, 2013

Well, it's easy to do. We only need to check $BC_MAXPARALLEL in set_parallelism and compare it to maxparallel and take the minimum of those two.

I can add it if you find it usefull.

@boegel
Copy link
Member

boegel commented Oct 2, 2013

That should be fairly easy to support... But maybe this should be done in a more generic way, i.e. to allow that various easyconfig parameters can be easily overridden. I'm thinking $EB_EC_PARAM_MAXPARALLEL, $EB_EC_PARAM_PATCHES, $EB_EC_PARAM_IGNORETHROTTLING for ATLAS, etc.

Basically, the equivalent of the $EASYBUILD_X variables that can be set to override EasyBuild configuration parameters (installpath, buildpath, ...).

@wpoely86
Copy link
Member Author

wpoely86 commented Oct 2, 2013

Should we do it generally or eb specific? $EB_EC_PARAM_PATCHES or $EB_EC_PARAM_ATLAS_PATCHES. Things like $EB_EC_PARAM_PATCHES are only useful for a specific eb.

@boegel
Copy link
Member

boegel commented Oct 2, 2013

Hmm, good point @wpoely86... I think the $EB_EC_PARAM_ATLAS_PATCHES makes more sense, it gives more fine-grained control in case of large multi-easyconfig builds.

@fgeorgatos
Copy link
Collaborator

+1, really like the direction this is taking; giving control externally,
will require more easy tuning
via configuration management or modules, without forcing to hack the
defaults. Nice.

I'd suggest that the each time a variable is defined, the setting is
"enforced" (as opposed
to taking the minimum or such, in order to defer more control externally.
But I agree that
the builder may have a reason to check both and make a choice on that basis)

On Wed, Oct 2, 2013 at 11:06 AM, Kenneth Hoste [email protected]:

Hmm, good point @wpoely86 https://github.com/wpoely86... I think the
$EB_EC_PARAM_ATLAS_PATCHES makes more sense, it gives more fine-grained
control in case of large multi-easyconfig builds.


Reply to this email directly or view it on GitHubhttps://github.com//pull/700#issuecomment-25524054
.

@boegel
Copy link
Member

boegel commented Oct 2, 2013

The order of preference should be the same as with the EasyBuild configuration: i.e. command line (is this relevant in this case? --try-X doesn't count) has preference over environment has preference over easyconfig file.
Enforcing rather than combining values makes more sense to me, it certainly avoids potentially nasty surprises.

@wpoely86
Copy link
Member Author

wpoely86 commented Oct 2, 2013

command line is not relevant IMHO. That leaves environment > easyconfig.

I will have a look at where and how to code this. The init of the easyconfig class seems a logical place.

@boegel
Copy link
Member

boegel commented Oct 2, 2013

@wpoely86: Yes, this should be tackled in EasyConfig.__init__ in easybuild/framework/easyconfig/easyconfig.py, probably just after this (and before the validation part):

# parse easyconfig file
self.parse(path)

@boegel
Copy link
Member

boegel commented Oct 18, 2013

The required update for vsc.utils is available in wpoely86#4.

But, this seems to break the unit tests on OS X (and the get_(avail_)core_count() functionality), which is a regression...

So, maybe we should only use the sched_getaffinity approach on Linux, and fall back to the previous way on OS X?

@stdweird: How portable is sched_getaffinity supposed to be (on Linux, and on *nix)?

@wpoely86
Copy link
Member Author

According to https://developer.apple.com/library/mac/releasenotes/Performance/RN-AffinityAPI/ there is no equivalent of sched_getaffinity in darwin. Maybe we better stick to the old approach for darwin?

@boegel can you check what sysctl hw.logicalcpu tells us?

@boegel
Copy link
Member

boegel commented Oct 18, 2013

I would stick to the old approach on Darwin, yes. Getting this 100% right (w.r.t. taskset, cpusets, etc.) is more important on Linux, but we should break this on OS X (and not only because my Mac laptop is my main development platform).

On a MacBook Pro with (dual-core) Intel Core 2 Duo:

$ uname -a
Darwin Kenneths-MacBook-Pro.local 11.4.2 Darwin Kernel Version 11.4.2: Thu Aug 23 16:25:48 PDT 2012; root:xnu-1699.32.7~1/RELEASE_X86_64 x86_64
$ sysctl hw.logicalcpu
hw.logicalcpu: 2

@boegel
Copy link
Member

boegel commented Nov 1, 2013

This is awaiting the merge of wpoely86#4, in which the open issues are fixed, and which makes this ready for merging in.

boegel and others added 2 commits November 2, 2013 15:00
long has unlimited precision. EB needs to be ready for systems with 64
or more cores ;-)
@wpoely86
Copy link
Member Author

wpoely86 commented Nov 3, 2013

Merged wpoely86#4

I've added one commit to use long instead of int. We want easybuild to be ready for >32 core systems afterall... 😉

@boegel
Copy link
Member

boegel commented Nov 4, 2013

@wpoely86: I see why you've added to extra commit, it makes sense, but it's not really needed.

With Python 2.6:

>>> long('fffffffffffffffffffffffffff', 16)
324518553658426726783156020576255L
>>> int('fffffffffffffffffffffffffff', 16)
324518553658426726783156020576255L

Nevertheless, I like the change, it makes it more explicit that it should be robust for > 32 cores.

This one is now ready to go in, thanks @wpoely86!

boegel added a commit that referenced this pull request Nov 4, 2013
number of available cpus (in cpuset)
@boegel boegel merged commit 4f18d16 into easybuilders:develop Nov 4, 2013
@wpoely86 wpoely86 deleted the numcpus branch November 4, 2013 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants