You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a GPU with number of cores greater than the number of q points the limiting speed for 1D models is the cost of the integration loop. For example, on the Radeon R9 Nano, the paracrystalline models each require about 400 ms to evaluate for nq up to 12000. Given that a typical curve has on the order of 120 points (e.g., the P123 example data sets) this suggests only 1% of the GPU is active for any given function evaluation using 120 points. [4096 cores / 128 points suggests 3% usage]
To improve parallelism we could unroll the integration loop by evaluating the different (theta, phi) points in gauss_z X gauss_z in parallel, then summing the resulting grid in parallel. The 76x76x120 point 2D calculation for sc_paracrystal takes 7.5 ms vs 423 ms for the 1D calculation (56x speedup). For core-shell parallelepiped the speedup is only 5x. Even symmetric shapes such as barbell can benefit, with a 7x speeup for a 2D 76x120 pattern compared to a 1D loop over gauss_z. More speedup would be possible with specialized code since some parts of the equation can be precomputed and shared for all points at a given q (the sphere form in the paracrystal example) or a given theta (the C direction in the parallelepiped models).
If we define the q points for the 2D calculator in polar coordinates, then larger rings at higher |q| could use more points, giving a simple form of adaptive integration (ticket #526).
As an alternative, we could use the same GPU in parallel from different fit processes. Testing with mpi and DREAM (4 fit pars, 32 evals per step), this can give a 4x speedup for population fitters (DREAM and DE):
processes time
1 13.6
2 13.2
4 6.8
8 3.8
16 4.3
Note that turning on resolution slows down the program by 3x because the points below q min and the points above q max are computed in separate batches, each of which takes the same time as the measured q set even though they may have only one or two points. This will be fixed with ticket #839.
{
"status": "new",
"changetime": "2018-07-01T19:50:33",
"_ts": "2018-07-01 19:50:33.511808+00:00",
"description": "On a GPU with number of cores greater than the number of q points the limiting speed for 1D models is the cost of the integration loop. For example, on the Radeon R9 Nano, the paracrystalline models each require about 400 ms to evaluate for nq up to 12000. Given that a typical curve has on the order of 120 points (e.g., the P123 example data sets) this suggests only 1% of the GPU is active for any given function evaluation using 120 points. [4096 cores / 128 points suggests 3% usage]\n\nTo improve parallelism we could unroll the integration loop by evaluating the different (theta, phi) points in gauss_z X gauss_z in parallel, then summing the resulting grid in parallel. The 76x76x120 point 2D calculation for sc_paracrystal takes 7.5 ms vs 423 ms for the 1D calculation (56x speedup). For core-shell parallelepiped the speedup is only 5x. Even symmetric shapes such as barbell can benefit, with a 7x speeup for a 2D 76x120 pattern compared to a 1D loop over gauss_z. More speedup would be possible with specialized code since some parts of the equation can be precomputed and shared for all points at a given q (the sphere form in the paracrystal example) or a given theta (the C direction in the parallelepiped models).\n\nIf we define the q points for the 2D calculator in polar coordinates, then larger rings at higher |q| could use more points, giving a simple form of adaptive integration (ticket #392).\n\nAs an alternative, we could use the same GPU in parallel from different fit processes. Testing with mpi and DREAM (4 fit pars, 32 evals per step), this can give a 4x speedup for population fitters (DREAM and DE):\n{{{\nprocesses time\n 1 13.6\n 2 13.2\n 4 6.8\n 8 3.8\n 16 4.3 \n}}}\n\nNote that turning on resolution slows down the program by 3x because the points below q min and the points above q max are computed in separate batches, each of which takes the same time as the measured q set even though they may have only one or two points. This will be fixed with ticket #717.\n",
"reporter": "pkienzle",
"cc": "",
"resolution": "",
"workpackage": "SasView Bug Fixing",
"time": "2018-04-13T16:29:02",
"component": "SasView",
"summary": "improve parallelism for 1D integration models",
"priority": "major",
"keywords": "",
"milestone": "SasView 4.3.0",
"owner": "",
"type": "enhancement"
}
The text was updated successfully, but these errors were encountered:
Trac update at 2018/07/01 19:50:33: richardh commented:
Moving this ticket to beta approximation project - which this is not really part of, but Paul K would like to keep in mind whilst making all the other changes - see also #526
We can also unroll the polydispersity loop, allowing us to compute I(q) for different sets of shape parameters in parallel.
Depending on the number of q values, we can maintain k different I((q) computation streams, dispatching the next set of polydisperse parameters whenever stream k has completed its current result. When the entire space has been explored and all streams are exhausted, add the k different Iq results in parallel as the last step.
We can additionally precompute values that are independent of q, such as form volume, and send them to the kernel. Reparameterized models (see #211) can generate the underlying values at this stage. Individual kernels may have values that are worth precomputing, so split the kernel into a pre-calculation stage and a calculation stage. Probably a good time to pass the kernel arguments as a pointer to a structure rather than a list of parameters.
On a GPU with number of cores greater than the number of q points the limiting speed for 1D models is the cost of the integration loop. For example, on the Radeon R9 Nano, the paracrystalline models each require about 400 ms to evaluate for nq up to 12000. Given that a typical curve has on the order of 120 points (e.g., the P123 example data sets) this suggests only 1% of the GPU is active for any given function evaluation using 120 points. [4096 cores / 128 points suggests 3% usage]
To improve parallelism we could unroll the integration loop by evaluating the different (theta, phi) points in gauss_z X gauss_z in parallel, then summing the resulting grid in parallel. The 76x76x120 point 2D calculation for sc_paracrystal takes 7.5 ms vs 423 ms for the 1D calculation (56x speedup). For core-shell parallelepiped the speedup is only 5x. Even symmetric shapes such as barbell can benefit, with a 7x speeup for a 2D 76x120 pattern compared to a 1D loop over gauss_z. More speedup would be possible with specialized code since some parts of the equation can be precomputed and shared for all points at a given q (the sphere form in the paracrystal example) or a given theta (the C direction in the parallelepiped models).
If we define the q points for the 2D calculator in polar coordinates, then larger rings at higher |q| could use more points, giving a simple form of adaptive integration (ticket #526).
As an alternative, we could use the same GPU in parallel from different fit processes. Testing with mpi and DREAM (4 fit pars, 32 evals per step), this can give a 4x speedup for population fitters (DREAM and DE):
Note that turning on resolution slows down the program by 3x because the points below q min and the points above q max are computed in separate batches, each of which takes the same time as the measured q set even though they may have only one or two points. This will be fixed with ticket #839.
Migrated from http://trac.sasview.org/ticket/1091
The text was updated successfully, but these errors were encountered: