You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In any parallel environmen, threads which synchronize with each other through either point to point synchronization or collectives with implied synchronization either have to be running concurrently (ideally) or at least have “forward progress” which means that no thread is blocked indefinitely. If thread A (in traditional SHMEM style) is spin waiting for something, and that spinning indefinitely blocks the execution of the thread that it is waiting for, then you have a deadlock. CPU scheduling is such that all threads get to run occasionally, just by time slicing, so you rarely get deadlocks like this. You just get performance bugs. GPUs don’t have scheduling like this, they tend to run threads to completion, so at present the best way to ensure synchronization works and collectives work is to limit the size of kernels so that ALL threads get to run concurrently. The simplest idea is to require that all SHMEM threads are concurrent and to not care about the forward progress guarantees or lack thereof.
Forward Progress and Collective Launch
SYCL Forward Progress
SYCL does not guarantee forward progress. What that means is that a particular kernel invocation is broken down into multiple work-groups, and one or more work-groups run concurrently on the GPU hardware, but in general you cannot know which work-groups are running concurrently. .
Consequently, code that tries to synchronize across multiple work groups is not guaranteed to work.
It is my understanding that all the work items of a single work-group do run concurrently and there are built in operations like work group barriers that depend on this
This seems reasonable. You want to run the same code on different size GPUs and the smaller ones can’t run a gazillion work items at once.
Now suppose you wish to run a SHMEM program that synchronizes across GPUs. You have to two things
Make sure the kernel that synchronizes is actually executing concurrently on all GPUs in the SHMEM team.
Make sure that the work-group or groups that synchronize are executing concurrently on all GPUs.
To make this work, you need to schedule the kernel collectively (launch it on all GPUs at the same time) and you need to make sure the resouces required by the kernel are few enough that all the work-groups can run concurrently.
The tack taken by Intel SHMEM 1.1 (?) is to supply a “collective launch” facility which might have similar semantics to the nvSHMEM feature of the same name. The idea of the collective launch API is to let the programmer determine the maximum work group sizes that can run on all the GPUs in the program such that work groups in collectively launched kernels can run concurrently. If they do, then cross-GPU synchronization and SHMEM collectives will actually work.
To do this, the SHMEM collective launcher uses the experimental SYCL features to determine work group parameters that permit all the work groups to run (which is what you need for kernel wide barriers on a single GPU) and then uses a SHMEM reduction to provide a workable set of parameters across the whole job.
The second task of collective launch is to make sure that kernels which are going to do cross device synchronization are launched at more or less the same time on all PEs.
Regarding work-groups... In Intel SHMEM, psync structures used by collectives are tied to teams, and a particular team can have only one collective in flight at a time. If you write code that uses a work-group collective, then each concurrently executing work-group needs its own team.
If you use the WORLD team, then you can have only one collective running, so SYCL kernels you launch should have only one work-group.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Synchronization across GPUs
In any parallel environmen, threads which synchronize with each other through either point to point synchronization or collectives with implied synchronization either have to be running concurrently (ideally) or at least have “forward progress” which means that no thread is blocked indefinitely. If thread A (in traditional SHMEM style) is spin waiting for something, and that spinning indefinitely blocks the execution of the thread that it is waiting for, then you have a deadlock. CPU scheduling is such that all threads get to run occasionally, just by time slicing, so you rarely get deadlocks like this. You just get performance bugs. GPUs don’t have scheduling like this, they tend to run threads to completion, so at present the best way to ensure synchronization works and collectives work is to limit the size of kernels so that ALL threads get to run concurrently. The simplest idea is to require that all SHMEM threads are concurrent and to not care about the forward progress guarantees or lack thereof.
Forward Progress and Collective Launch
SYCL Forward Progress
SYCL does not guarantee forward progress. What that means is that a particular kernel invocation is broken down into multiple work-groups, and one or more work-groups run concurrently on the GPU hardware, but in general you cannot know which work-groups are running concurrently. .
Consequently, code that tries to synchronize across multiple work groups is not guaranteed to work.
It is my understanding that all the work items of a single work-group do run concurrently and there are built in operations like work group barriers that depend on this
This seems reasonable. You want to run the same code on different size GPUs and the smaller ones can’t run a gazillion work items at once.
Now suppose you wish to run a SHMEM program that synchronizes across GPUs. You have to two things
Make sure the kernel that synchronizes is actually executing concurrently on all GPUs in the SHMEM team.
Make sure that the work-group or groups that synchronize are executing concurrently on all GPUs.
To make this work, you need to schedule the kernel collectively (launch it on all GPUs at the same time) and you need to make sure the resouces required by the kernel are few enough that all the work-groups can run concurrently.
The tack taken by Intel SHMEM 1.1 (?) is to supply a “collective launch” facility which might have similar semantics to the nvSHMEM feature of the same name. The idea of the collective launch API is to let the programmer determine the maximum work group sizes that can run on all the GPUs in the program such that work groups in collectively launched kernels can run concurrently. If they do, then cross-GPU synchronization and SHMEM collectives will actually work.
To do this, the SHMEM collective launcher uses the experimental SYCL features to determine work group parameters that permit all the work groups to run (which is what you need for kernel wide barriers on a single GPU) and then uses a SHMEM reduction to provide a workable set of parameters across the whole job.
The second task of collective launch is to make sure that kernels which are going to do cross device synchronization are launched at more or less the same time on all PEs.
Regarding work-groups... In Intel SHMEM, psync structures used by collectives are tied to teams, and a particular team can have only one collective in flight at a time. If you write code that uses a work-group collective, then each concurrently executing work-group needs its own team.
If you use the WORLD team, then you can have only one collective running, so SYCL kernels you launch should have only one work-group.
Beta Was this translation helpful? Give feedback.
All reactions