Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Windows Process Groups #723

Open
tdaede opened this issue Feb 7, 2020 · 4 comments
Open

Support Windows Process Groups #723

tdaede opened this issue Feb 7, 2020 · 4 comments

Comments

@tdaede
Copy link

tdaede commented Feb 7, 2020

To use more than 64 threads on Windows, you have to use Process Groups:

https://docs.microsoft.com/en-us/windows/win32/procthread/processor-groups?redirectedfrom=MSDN

Each spawned thread must be assigned to a process group. I don't know if it makes sense to handle this in rayon or libstd or what.

@cuviper
Copy link
Member

cuviper commented Feb 7, 2020

Do you know whether this is reflected in num_cpus?

If that will be limited to the 64-thread ceiling already, then I think we're in okay shape by default. We wouldn't want to oversubscribe too many rayon threads into a 64-cpu process group, but I think it's fine if we're limited to that and just don't use additional cpus.

Beyond that, I think it's in the realm of advanced tweaking that the user could deal with in ThreadPoolBuilder::spawn_handler.

See also #319 for general NUMA awareness.

@tdaede
Copy link
Author

tdaede commented Feb 7, 2020

Looking at the source code of num_cpus, I believe it will max out at 64. You'd need to use GetActiveProcessorCount to get a higher number. So it will work fine as is, just underutilize.

Note that this is technically unrelated to NUMA - in particular the just-released 3990WX has 128 threads but only one NUMA node.

@shuffle2
Copy link

the numa-related parts of chapter 4 in https://developer.amd.com/wp-content/resources/56782_1.0.pdf are relevant:

Since all the processors in a single-socket/128-logical-processor NUMA node cannot fit completely within a single Windows Processor Group, Windows creates a (virtual) secondary node to hold the additional processors.

Regardless of NPS settings, applications will need to be multi-group aware to take advantage of all the processors (otherwise their affinity will be to a single processor group).

i.e. the "NUMA node"/"processor group" Windows terminology is becoming blurred, as it is imposing limitations which don't reflect the hardware or the configuration of the hardware...

@shuffle2
Copy link

shuffle2 commented Feb 13, 2020

...this means your code needs to manually move threads to other groups via something like

GROUP_AFFINITY affinity{}, affinity_prev{};
affinity.Group = id / MAXIMUM_PROC_PER_GROUP;
affinity.Mask = 1ull << (id % MAXIMUM_PROC_PER_GROUP);
SetThreadGroupAffinity(GetCurrentThread(), &affinity, &affinity_prev);

where the important part is .Group. AFAIK there is no "group mask" field to tell the scheduler to schedule a thread across a set of processor groups(?)
Meaning, just creating 128 threads would not be enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants