-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restoring more hwloc --cpu-set behavior from OMPI 3.x #6755
Conversation
bot:ibm:retest |
bot:ibm:retest |
hwloc_topology_get_allowed_cpuset(topo) is only useful if hwloc is configured to list disallowed CPUs (those outside of our cgroup). AFAIK, that's not the case in OMPI anymore (HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM is never passed to hwloc_topology_set_flags()) |
bot:ibm:retest |
Oh gosh, I was about to ask how --report-bindings manages to display the affinity as a portion of the whole machine if WHOLE_SYSTEM isn't used... but it doesn't anymore. I guess that's a matter for a separate discussion what --report-bindings should show. I prefer the old way where the whole machine is there. Eg if I run in a tiny cgroup that contains cores 6,7,10,11 I'd prefer to see
than [BB/..][] I think I agree this makes hwloc_topology_get_allowed_cpuset() unnecessary, but the intersections with rdata->available would still be needed |
Enabling WHOLE_SYSTEM makes things much more complicated since you have to manually ignore disallowed objects when iterating, counting objects, etc. I don't remember the actual reason why it was disabled, but enabling it just for having a nicer --report-bindings would seem crazy to me. I don't see an easy way to bring back the feature. hwloc still has the info that some other PUs exist, but we don't know which cores/sockets they correspond to. If the transmitted XML contains disallowed PUs, one could load a secondary topology for --report-bindings with the WHOLE_SYSTEM flag. |
It was a very deliberate decision - the inclusion of disallowed PUs confused people (and software) that didn't realize they had to ignore them. As @bgoglin explained, it made things much easier and less confusing to users. Please don't bring it back 😄 |
I'll open a separate PR related to --report-bindings. For the purpose of this PR I like the decision to not use WHOLE_SYSTEM, and I'm curious if an analogous approach should be used with rdata->available. I'd consider --cpu-set to be an analogous concept, it's like specifying a cgroup minus the enforcement that comes with a real cgroup. It seems like the bitwise intersections with rdata->available are scattered around quite a bit in the 3.x code with lots of checks and skips in the loops like mentioned above. I'd rather see the whole topology modified early on to restrict it to rdata->available, close to topology_load time. I haven't gone down that path, but might consider it for the future if that sounds plausible to others. |
I was about to modify this PR to remove the calls to hwloc_topology_get_allowed_cpuset() since none of the topologies use WHOLE_SYSTEM anymore, but I heard there's discussion about whether the mappings are right under the new system and it might be coming back (?). I'm not clear on why they'd be wrong, empirically I'm able to put different cgroups on each host and mpirun is seeing a correct looking just-what's-available topology for each host and the assignments it makes are translating out correctly back at the ranks, at least for my runs. But anyway where does that put this PR? Here I'm restoring 3.x behavior for the --cpu-set option and currently still have code that intersects with hwloc_topology_get_allowed_cpuset() which shouldn't be harmful but might be unnecessary |
Afraid I don't know what you are talking about relative to this PR. Removing get_allowed_cpuset makes sense per the comments from @bgoglin . The other discussion related to your report-bindings PR which has some problems associated with it. |
Understood, but that doesn't have anything to do with this PR. Looking at it again, my question would be: if you remove the get_allowed_cpuset changes, is there anything left? |
The reason I brought up the teleconference conversation was to ask if the WHOLE_SYSTEM + get_allowed_cpuset() might be coming back due to the concerns about mapper output with the new non-WHOLE_SYSTEM topologies or if the former WHOLE_SYSTEM topologies are really staying gone. That question would affect whether to keep the get_allowed_cpuset() in this PR. But even if we keep the topologies as non-WHOLE_SYSTEM and remove the get_allowed_cpuset() here, there is still active code change in this PR. The |
This is also planned to be discussed on the 9 July 2019 webex. |
I reverted not just the current addition of 3.x code but also a previous checkin from april that had also added 3.x code for skipping over unavailable tree entries. And I switched to a hwloc_topology_restrict() call that makes --cpu-set get handled analogously to how a cgroup is handled. That approach always seemed more logical to me; I just hadn't bothered to see if it would work, figuring restoring 3.x behavior was safer than taking a new approach. But I bet cgroups get more attention and testing than --cpu-set, so piggybacking that approach ought to be good |
You can squash these if you like - probably need to do so to quiet the signed-off-by checker. I simply fixed some code style violations. Something to perhaps keep in mind for future PRs. I have no issue with these changes. |
Just a second - something just hit me. @bgoglin What happens if an application proc calls hwloc_topology_restrict on a shared memory topology tree owned by the daemon? What if it tries to add an object to the "userdata" field? |
@rhc54 It's not allowed, it's a read-only shared mapping. |
Well, I think it should be okay, but would like confirmation. The code we're concerned about is where I added a call at the bottom of
to shrink the topology in a manner that I expect is similar to what would happen if I had created an equivalent cgroup and had done a topology_load() from inside that cgroup. My two reasons for believing this is okay are
|
I repushed with Ralph's code violation fixes squashed into my commits. |
In bind_generic() there's a loop that picks a starting trg_obj and then walks through a loop of next = trg_obj->next_cousin until it has made total_cpus assignments. But the code doesn't accept that those assignments might not be adjacent objects. Example: % mpirun -np 2 --report-bindings --map-by ppr:2:node:pe=3 \ --cpu-set 4,5,7,8,9,11 -bind-to hwthread:overload-allowed > MCW 0 : [..../BB.B/..../....] > MCW 1 : [..../..../BB.B/....] It will want to assign 3 cpus and will loop through trg_obj 00001 (with ncpus 1) trg_obj 000001 (with ncpus 1) trg_obj 0000001 (with ncpus 0) trg_obj 000000011 (with ncpus 1) The original code on the third entry would see num_bound for the object become too high for its ncpus and think oversubscription was happening. I changed it to only ++num_bound eg to use that object if the object has cpus in its cpuset after intersected with the allowed/available masks. The error message from the original code (if you remove the overload-allowed) would be > A request was made to bind to that would result in binding more > processes than cpus on a resource: > Bind to: HWTHREAD > Node: ... > #processes: 1 > #cpus: 0 Signed-off-by: Mark Allen <[email protected]> [ This checkin also includes a squashed commit from Ralph Castain to fix code style violations. ]
I made an older checkin bdd92a7 that partially restored --cpu-set behavior using OMPI 3.x code, and I was about to restore more code from 3.x, but I'm taking a different approach here and reverting the previous addition of 3.x code too. For the (partial) revert: commit: bdd92a7. title: "-cpu-set as a constraint rather than as a binding" I'm reverting the functional changes to how the hwloc tree is iterated over, but keeping the error-detection changes modeled after 3.x. For the new approach to --cpu-set: The --cpu-set option is logically similar to running under a cgroup just without the OS-level enforcement that comes with a cgroup. For cgroups the new code loads the topology without the WHOLE_SYSTEM flag so the tree only contains what's in the cgroup. We can do the same thing with a hwloc_restrict_topology() call to constrain the topology to whatever --cpu-set the user enters. Then all the 3.x code that skips over unavailable tree entries becomes unnecessary. Here are the cases that were fixed in bdd92a7, and which are confirmed to still work with the new approach: hardware: [..../..../..../....] numbered sequentially 0-15 % mpirun -np 2 --report-bindings --bind-to hwthread \ --map-by hwthread --cpu-set 6,7 hostname > MCW rank 0 [..../..B./..../....] > MCW rank 1 [..../...B/..../....] % mpirun -np 2 --report-bindings --bind-to hwthread \ --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname > MCW rank 0 [..../..BB/..../....] > MCW rank 1 [..../..../..../BB..] The additional case I was in progress to fix here and which is also handled by the new approach is % mpirun -np 2 --report-bindings --bind-to hwthread \ --map-by ppr:2:node:pe=3 --cpu-set 4,5,9,11,14,15 ./x > MCW rank 0 [..../BB../.B../....] > MCW rank 1 [..../..../...B/..BB] Signed-off-by: Mark Allen <[email protected]> [ This checkin also includes a squashed commit from Ralph Castain to fix code style violations. ]
Something may be off here - we have reports that cpu-set isn't working on v3.1.x, and so I'm not sure restoring it is the correct answer. Probably just need to start by looking at what the behavior actually is vs what we want it to be and fix it. I'll see what is going on. |
I think I've got the commit message updated to reflect what this commit is doing, although I guess the one-liner about "restoring behavior" is a little misleading. This PR isn't trying to restore old OMPI 3.1 code anymore. It's even reverting an older commit bdd92a7 where I had previously restored some pieces of the 3.1 code for --cpu-set behavior. Initially this PR was going to pull in even more 3.1 code, but after our discussions I've ended up reversing course and reverting most of bdd92a7 too. The new direction for supporting the old behavior is very analogous to what would happen if the user created a cgroup and all the affinity code was running inside of that. I feel like the new code should be quite a bit more reliable than the 3.1 code that had dozens of locations where it would be iterating over the hwloc tree and would have extra checks to see if the things it was iterating on were part of the rdata->available mask that comes from --cpu-set. I suspect cgroups are more frequently used and tested than -cpu-set, and they're logically doing the same thing, so I think it makes a lot of sense for them to be the same code path. |
@markalle I'm a bit confused: can you clarify what the problem is that we're solving here? I ask because I'm not entirely sure that we really know what we want our mapping/binding options to do. It feels like we have a huge sprawl of mapping/binding options, and there's decent-enough documentation of the basics in the man page, and they seem to work well ( E.g., @rhc54 and I were talking about #6966 and we realized that we didn't know the difference between It might be a good idea to have a webex to discuss: what exactly do we want? (perhaps in the context of v5.0.x) For example, maybe we could cut down/deprecate some of these older CLI options. It might be nice if everything could be done through |
I agree there's a lot of sprawl, and some options might be "one off" designs that aren't clear on their interactions with other options. My fixes are generally in response to our test team, which is probably trying out options based on the documentation more often than they're making tests based on customer usage. Running normal affinity options inside of a cgroup though is pretty fundamental (and works I think). And this option is loosely related to that. For cgroups you probably have a job scheduler who knows what resources are available, and maybe another job is on the first socket, and the current job is being given the second socket so the job scheduler creates a cgroup for the part of the machine you're allowed to be on. Then this option would be kind of a cheap mimicry for people without a job scheduler who are writing their own scripts to do things like that by hand. I wouldn't say it's the most important option in the world, but I still think it can be handy. |
Can one of the admins verify this patch? |
The IBM CI (PGI) build failed! Please review the log, linked below. Gist: https://gist.github.com/a426a1c0813e015c139743456eeaf7d8 |
That's precisely the direction we went in PRRTE. I don't understand this PR at all as it (a) modifies a non-existent file in the non-existent ORTE tree and (b) modifies a file in hwloc/base that is no longer used (and should probably be removed from OMPI's master branch). Why does this PR even exist? |
I'd be willing to bet that this PR is from before we PRTE-ized. It's probably stale / moot now. |
As of 9/21 ompi call, it was decided this is too risky to take this late in the v4 series. Instead we will print a warning saying that this option is broken, as well as update the man page. |
Someone should check the behavior on master/v5.0.x to see if we need to fix the problem there. The concern with fixing the v4.0 and v4.1 branches was that it was a behavior change. In v5.0 we should do the right thing. |
There are two commits here
I made an older checkin bdd92a7
that partially restored --cpu-set behavior using OMPI 3.x code,
and I was about to restore more code from 3.x, but I'm taking a
different approach here and reverting the previous addition of
3.x code too.
For the (partial) revert:
commit: bdd92a7.
title: "-cpu-set as a constraint rather than as a binding"
I'm reverting the functional changes to how the hwloc tree is
iterated over, but keeping the error-detection changes modeled
after 3.x.
For the new approach to --cpu-set:
The --cpu-set option is logically similar to running under a cgroup
just without the OS-level enforcement that comes with a cgroup. For
cgroups the new code loads the topology without the WHOLE_SYSTEM
flag so the tree only contains what's in the cgroup. We can do the
same thing with a hwloc_topology_restrict() call to constrain the
topology to whatever --cpu-set the user enters. Then all the 3.x
code that skips over unavailable tree entries becomes unnecessary.
Here are the cases that were fixed in bdd92a7,
and which are confirmed to still work with the new approach:
hardware: [..../..../..../....] numbered sequentially 0-15
The additional case I was in progress to fix here and which
is also handled by the new approach is