improve parallelism for 1D integration models (Trac #1091) #244

pkienzle · 2019-03-30T08:53:50Z

On a GPU with number of cores greater than the number of q points the limiting speed for 1D models is the cost of the integration loop. For example, on the Radeon R9 Nano, the paracrystalline models each require about 400 ms to evaluate for nq up to 12000. Given that a typical curve has on the order of 120 points (e.g., the P123 example data sets) this suggests only 1% of the GPU is active for any given function evaluation using 120 points. [4096 cores / 128 points suggests 3% usage]

To improve parallelism we could unroll the integration loop by evaluating the different (theta, phi) points in gauss_z X gauss_z in parallel, then summing the resulting grid in parallel. The 76x76x120 point 2D calculation for sc_paracrystal takes 7.5 ms vs 423 ms for the 1D calculation (56x speedup). For core-shell parallelepiped the speedup is only 5x. Even symmetric shapes such as barbell can benefit, with a 7x speeup for a 2D 76x120 pattern compared to a 1D loop over gauss_z. More speedup would be possible with specialized code since some parts of the equation can be precomputed and shared for all points at a given q (the sphere form in the paracrystal example) or a given theta (the C direction in the parallelepiped models).

If we define the q points for the 2D calculator in polar coordinates, then larger rings at higher |q| could use more points, giving a simple form of adaptive integration (ticket #526).

As an alternative, we could use the same GPU in parallel from different fit processes. Testing with mpi and DREAM (4 fit pars, 32 evals per step), this can give a 4x speedup for population fitters (DREAM and DE):

processes   time
     1      13.6
     2      13.2
     4       6.8
     8       3.8
    16       4.3

Note that turning on resolution slows down the program by 3x because the points below q min and the points above q max are computed in separate batches, each of which takes the same time as the measured q set even though they may have only one or two points. This will be fixed with ticket #839.

Migrated from http://trac.sasview.org/ticket/1091

{
    "status": "new",
    "changetime": "2018-07-01T19:50:33",
    "_ts": "2018-07-01 19:50:33.511808+00:00",
    "description": "On a GPU with number of cores greater than the number of q points the limiting speed for 1D models is the cost of the integration loop.  For example, on the Radeon R9 Nano, the paracrystalline models each require about 400 ms to evaluate for nq up to 12000.  Given that a typical curve has on the order of 120 points (e.g., the P123 example data sets) this suggests only 1% of the GPU is active for any given function evaluation using 120 points.  [4096 cores / 128 points suggests 3% usage]\n\nTo improve parallelism we could unroll the integration loop by evaluating the different (theta, phi) points in gauss_z X gauss_z in parallel, then summing the resulting grid in parallel.  The 76x76x120 point 2D calculation for sc_paracrystal takes 7.5 ms vs 423 ms for the 1D calculation (56x speedup).  For core-shell parallelepiped the speedup is only 5x.  Even symmetric shapes such as barbell can benefit, with a 7x speeup for a 2D 76x120 pattern compared to a 1D loop over gauss_z.  More speedup would be possible with specialized code since some parts of the equation can be precomputed and shared for all points at a given q (the sphere form in the paracrystal example) or a given theta (the C direction in the parallelepiped models).\n\nIf we define the q points for the 2D calculator in polar coordinates, then larger rings at higher |q| could use more points, giving a simple form of adaptive integration (ticket #392).\n\nAs an alternative, we could use the same GPU in parallel from different fit processes.  Testing with mpi and DREAM (4 fit pars, 32 evals per step), this can give a 4x speedup for population fitters (DREAM and DE):\n{{{\nprocesses   time\n     1      13.6\n     2      13.2\n     4       6.8\n     8       3.8\n    16       4.3           \n}}}\n\nNote that turning on resolution slows down the program by 3x because the points below q min and the points above q max are computed in separate batches, each of which takes the same time as the measured q set even though they may have only one or two points.  This will be fixed with ticket #717.\n",
    "reporter": "pkienzle",
    "cc": "",
    "resolution": "",
    "workpackage": "SasView Bug Fixing",
    "time": "2018-04-13T16:29:02",
    "component": "SasView",
    "summary": "improve parallelism for 1D integration models",
    "priority": "major",
    "keywords": "",
    "milestone": "SasView 4.3.0",
    "owner": "",
    "type": "enhancement"
}

The text was updated successfully, but these errors were encountered:

pkienzle · 2019-03-30T10:33:21Z

Trac update at 2018/04/17 12:06:16:

pkienzle changed milestone from "SasView 4.2.0" to "SasView 4.3.0"
pkienzle changed type from "defect" to "enhancement"

RichardHeenan · 2019-03-30T10:33:22Z

Trac update at 2018/07/01 19:50:33: richardh commented:

Moving this ticket to beta approximation project - which this is not really part of, but Paul K would like to keep in mind whilst making all the other changes - see also #526

pkienzle · 2019-07-29T21:02:36Z

We can also unroll the polydispersity loop, allowing us to compute I(q) for different sets of shape parameters in parallel.

Depending on the number of q values, we can maintain k different I((q) computation streams, dispatching the next set of polydisperse parameters whenever stream k has completed its current result. When the entire space has been explored and all streams are exhausted, add the k different Iq results in parallel as the last step.

We can additionally precompute values that are independent of q, such as form volume, and send them to the kernel. Reparameterized models (see #211) can generate the underlying values at this stage. Individual kernels may have values that are worth precomputing, so split the kernel into a pre-calculation stage and a calculation stage. Probably a good time to pass the kernel arguments as a pointer to a structure rather than a list of parameters.

pkienzle · 2022-11-05T03:11:33Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve parallelism for 1D integration models (Trac #1091) #244

improve parallelism for 1D integration models (Trac #1091) #244

pkienzle commented Mar 30, 2019

pkienzle commented Mar 30, 2019

RichardHeenan commented Mar 30, 2019

pkienzle commented Jul 29, 2019

pkienzle commented Nov 5, 2022

improve parallelism for 1D integration models (Trac #1091) #244

improve parallelism for 1D integration models (Trac #1091) #244

Comments

pkienzle commented Mar 30, 2019

pkienzle commented Mar 30, 2019

RichardHeenan commented Mar 30, 2019

pkienzle commented Jul 29, 2019

pkienzle commented Nov 5, 2022