Parallelization of integrand evaluations #80

lxvm · 2023-05-25T20:16:40Z

This is a draft pr to parallelize quadgk and quadgk! over evaluation points. Using a procedure similar to the one discussed by Gladwell in "Vectorisation of One-Dimensional Quadrature Codes" (https://doi.org/10.1007/978-94-009-3889-2_24), I remove all the segments in the heap that will need to be refined and then I evaluate the integrand at all the Kronrod points in all of the new segments in parallel. I decided to go for a simple Threads.@threads approach since I wasn't sure if using something more sophisticated such as Floops.jl would provide much more benefit. Here is an example showing some speed up on a simulated costly integrand

julia> using QuadGK

julia> s1(x) = sin(13x)     # integrand requiring subdivisions
s1 (generic function with 1 method)

julia> s2(x) = (sleep(0.005); s1(x))  # costly integrand
s2 (generic function with 1 method)

julia> Threads.nthreads()  # show threads
6

julia> p = QuadGK.Parallel(Float64, Float64, Float64) # construct parallelization buffer

julia> @time quadgk(s1, 0, 1)        # very fast
  0.000007 seconds (2 allocations: 368 bytes)
(0.007119478349984942, 8.85874706924028e-13)

julia> @time quadgk(s1, 0, 1, parallel=p)  # slower because of threading overhead
  0.000239 seconds (336 allocations: 34.984 KiB)
(0.007119478349984942, 8.85874706924028e-13)

julia> @time quadgk(s2, 0, 1)      # slow when sequential and costly
  0.663973 seconds (531 allocations: 16.312 KiB)
(0.007119478349984942, 8.85874706924028e-13)

julia> @time quadgk(s2, 0, 1, parallel=p)   # faster when parallel and costly
  0.056169 seconds (819 allocations: 50.031 KiB)
(0.007119478349984942, 8.85874706924028e-13)

TODO:

Decide on multithreading framework
Reduce allocations in parevalrule for InplaceIntegrand

codecov · 2023-05-25T20:26:51Z

Codecov Report

Patch coverage: 99.31% and project coverage change: +0.28% 🎉

Comparison is base (cb745d4) 97.87% compared to head (d560001) 98.16%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #80      +/-   ##
==========================================
+ Coverage   97.87%   98.16%   +0.28%     
==========================================
  Files           5        6       +1     
  Lines         471      599     +128     
==========================================
+ Hits          461      588     +127     
- Misses         10       11       +1

Files Changed	Coverage Δ
src/QuadGK.jl	`100.00% <ø> (ø)`
src/batch.jl	`99.15% <99.15%> (ø)`
src/adapt.jl	`96.66% <100.00%> (+0.27%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

stevengj · 2023-06-09T19:16:37Z

I think we just want to provide the caller with the array of points to evaluate and let them parallelize it (or not) using threads, the GPU, distributed-memory, or whatever.

One way to expose this API would be to define as a wrapper around the integrand, e.g.:

struct BatchIntegrand{F,X,Y}
    f!::F # in-place function f!(y, x) that takes an array of x values and outputs an array of results in-place
    max_batch::Int # maximum number of x to supply in parallel (defaults to typemax(Int))
    x::Vector{X}
    y::Vector{Y}
end
BatchIntegrand(f!::F, ::Type{X}, ::Type{Y}; max_parallel::Integer=typemax(Int)) where {F,X,Y} =
    BatchIntegrand{F,X,Y}(f!, max_parallel, X[], Y[])
BatchIntegrand(f!::F, x::Vector{X}, y::Vector{Y}; max_parallel::Integer=typemax(Int)) where {F,X,Y} =
    BatchIntegrand{F,X,Y}(f!, max_parallel, x, y)

It's a little annoying that it doesn't get X from the type of the endpoints. One possibility would to have a constructor:

BatchIntegrand(f!::F, ::Type{Y}; max_parallel::Integer=typemax(Int)) where {F,Y} =
    BatchIntegrand{F,None,Y}(f!, max_parallel, nothing[], Y[])

and then if the quadgk function sees a BatchIntegrand{F,None,Y} then it can replace it with BatchIntegrand{F,X,Y} where it gets X from the endpoint type.

lxvm · 2023-07-05T21:14:12Z

I wrote a draft implementation of BatchIntegrand that can be tested with the following example integrand

julia> using QuadGK, Unitful

julia> f(x) = exp(-(x/(1.0u"m"))^2)
f (generic function with 1 method)

julia> quadgk(f, 0.0u"m", Inf*u"m")
(0.8862269254527579 m, 8.595134596701424e-9 m)

julia> f!(y, x) = y .= f.(x)
f! (generic function with 1 method)

julia> quadgk(QuadGK.BatchIntegrand(f!, Float64, typeof(0.0u"m")), 0.0u"m", Inf*u"m")
(0.8862269254527579 m, 8.595134596694648e-9 m)

Introducing BatchIntegrand also changes the default behavior of quadgk when using relative tolerances, since it can simultaneously adapt all of the segments in the heap that surpass the "current" tolerance. Previously, the "current" tolerance was updated after any segment was refined, however now I changed the default behavior to match BatchIntegrand. I expect the computational effort to be the same if the caller performs singularity separation for their integrand. Otherwise, there are integrands with multiple peaks in an interval that previously only integrated one dominating peak and now can capture more or all of the peaks. For example:

julia> f(x)    = sin(x)/(cos(x)+im*1e-5)   # peaked "nice" integrand
f (generic function with 1 method)

julia> imf(x)  = imag(f(x))                # peaked difficult integrand
imf (generic function with 1 method)

julia> imf2(x) = imf(x)^2                  # even more peaked
imf2 (generic function with 1 method)

julia> x0 = 0.1    # arbitrary offset of between peaks
0.1

julia> using QuadGK

julia> quadgk(x -> imf2(x) + imf2(x-x0), 0, 2pi, rtol = 1e-5)   # previous versions
(235619.45750214785, 0.8317222055900457)

julia> quadgk(x -> imf2(x) + imf2(x-x0), 0, 2pi, rtol = 1e-5)   # with this pr
(628318.5307008666, 3.1850136117936505)

The implementation of BatchIntegrand is as a stateful iterator yielding segments because the state is in the buffer of integrand evaluations that may contain points corresponding to either less than or more than one segment. The heap also acts as a stateful iterator that pops off segments one at a time until the error goes below the tolerance. In order to reuse as much existing code as I could, which isn't much, I try to hide the fact that these iterators are stateful by collecting their results into tuples. I try doing so with recursion, however I believe this leads to a lot of code generation and there is still a lot of type instability within the iterators that I attribute to the variables I use for iterator state. I would appreciate any ideas for ways of mitigating type instability and thus allocations, as well as iterating only once over stateful iterators to avoid storing them in tuples.

lxvm · 2023-07-18T13:25:23Z

I simplified the implementation of BatchIntegrand by requiring that max_batch >= (4*order+2) and updating the integral estimate after each batch of GK rules is evaluated. So now, the batch integrator refines as many segments as need to be refined for the current tolerance up to the limit of how many fit into the batch buffer at once. Since this can affect numerical results obtained with a relative tolerance, I added a docstring example on the effect of max_batch.

When refactoring the original adaptation code, I also had to change the return type of adapt to handle type-unstable integrands, so I also put the final re-summation of the integral into its own routine.

If more simplifications are possible, then I can attempt implementing them. Otherwise, it might be possible to reduce allocations in the batching code, but some may be unavoidable due to resize! for the buffers or some reshuffling of elements in the heap (which could probably be done better).

src/batch.jl

stevengj · 2023-07-18T14:20:40Z

This is looking good! Thanks for working on this!

Note that we will also want to add a new section to the manual on this, giving some explanation/motivation and a short example.

lxvm · 2023-07-18T17:54:58Z

Thanks! I'll start writing a page in the manual and will follow up on the review in another commit

lxvm · 2023-07-19T02:15:26Z

Examples were added to the manual and the constructors were modified as previously discussed. I also added BatchIntegrand to the exported API. Thank you for reviewing this PR!

lxvm · 2023-07-19T02:21:37Z

Perhaps the only thing this PR leaves to be desired is a batched and vector-valued integrand combining both BatchIntegrand and InplaceIntegrand. It would be great if it could be done by adding additional dimensions to the output array y in f!(y,x) (e.g. with https://github.com/JuliaArrays/ElasticArrays.jl), but it would make sense to do it in a future PR.

docs/src/index.md

docs/src/quadgk-examples.md

src/batch.jl

docs/src/quadgk-examples.md

lxvm · 2023-07-19T20:12:19Z

I was able to implement a batched and inplace integrand via adding additional dimensions to the y buffer. This mostly reuses the existing code by defining a method for quadgk!(::BatchIntegrand,...) and internally creating a BatchIntegrand(InplaceIntegrand(...)). I started this on a branch https://github.com/lxvm/QuadGK.jl/tree/batchinplace, but perhaps it would be best to have it as a package extension, since it would really only work with ElasticArrays.jl

src/adapt.jl

docs/src/quadgk-examples.md

stevengj · 2023-07-19T20:27:20Z

Upon reflection, I think it would be more useful to make max_batch act like max_evals — it should be a rough limit on the batch size, but not a hard limit. That way, the user doesn't have to worry about accidentally making max_batch slightly too small.

(The main purpose of max_batch is to not allow the buffer allocation to be unbounded, and for this purpose I don't think the user cares if you allocate a slightly bigger buffer. Whereas having quadgk throw an exception if you accidentally forget to change max_batch when you change the order or the minimum number of segments is a huge pain for the user.)

lxvm · 2023-07-19T21:16:13Z

I agree that making max_batch behave as a soft limit is convenient. We can implement this by always evaluating the minimum number of segments needed for any call which can be done by changing the inner loop in refine from

while len > nsegs && 2m*(nsegs+1) <= f.max_batch && E > tol && numevals < maxevals
    ...
end

to

while len > nsegs && 2m*nsegs < f.max_batch && E > tol && numevals < maxevals
    ...
end

and removing the errors for small values of max_batch

lxvm · 2023-08-03T13:28:34Z

@stevengj I rebased this pr on master and I think it is ready if it also looks good to you

src/batch.jl

Co-authored-by: Steven G. Johnson <[email protected]>

lxvm · 2023-09-17T00:24:02Z

@stevengj Do you think this is finished?

src/batch.jl

docs/src/quadgk-examples.md

src/batch.jl

lxvm · 2023-09-20T14:36:23Z

@stevengj I updated the docs and made the changes you suggested, so it should be ready to go.

There is also a straightforward path for extending this PR to inplace and batched integrands. If the y buffer is a VectorOfSimilarArrays of an ElasticArray with a last dimension for batching, then the behavior of BatchIntegrand is identical, except with mutable integrand values, i.e. array views. We could make this work with a quadgk!(f!::BatchIntegrand, result, a, b) method that internally creates a BatchIntegrand(InplaceIntegrand(f!,...),...), then add some if-statement branches to make sure in-place operations are used on the integrand values.

stevengj · 2023-09-20T18:45:08Z

Thanks for keeping at it!

lxvm changed the title ~~parallel~~ Parallelization of integrand evaluations May 25, 2023

stevengj reviewed Jul 18, 2023

View reviewed changes

src/batch.jl Outdated Show resolved Hide resolved

stevengj reviewed Jul 18, 2023

View reviewed changes

src/batch.jl Outdated Show resolved Hide resolved

stevengj reviewed Jul 18, 2023

View reviewed changes

src/batch.jl Outdated Show resolved Hide resolved

stevengj reviewed Jul 18, 2023

View reviewed changes

src/batch.jl Outdated Show resolved Hide resolved

lxvm marked this pull request as ready for review July 19, 2023 02:00

stevengj reviewed Jul 19, 2023

View reviewed changes

docs/src/index.md Outdated Show resolved Hide resolved

stevengj reviewed Jul 19, 2023

View reviewed changes

docs/src/quadgk-examples.md Outdated Show resolved Hide resolved

stevengj reviewed Jul 19, 2023

View reviewed changes

src/batch.jl Outdated Show resolved Hide resolved

stevengj reviewed Jul 19, 2023

View reviewed changes

src/batch.jl Outdated Show resolved Hide resolved

stevengj reviewed Jul 19, 2023

View reviewed changes

src/batch.jl Outdated Show resolved Hide resolved

stevengj reviewed Jul 19, 2023

View reviewed changes

docs/src/quadgk-examples.md Outdated Show resolved Hide resolved

stevengj reviewed Jul 19, 2023

View reviewed changes

src/adapt.jl Outdated Show resolved Hide resolved

stevengj reviewed Jul 19, 2023

View reviewed changes

docs/src/quadgk-examples.md Outdated Show resolved Hide resolved

lxvm force-pushed the parallel branch from 292a1cc to 7d07ce0 Compare August 3, 2023 13:15

lxvm mentioned this pull request Aug 6, 2023

Broken inplace integrand for HCubatureJL SciML/Integrals.jl#169

Closed

stevengj reviewed Aug 17, 2023

View reviewed changes

src/batch.jl Outdated Show resolved Hide resolved

stevengj reviewed Aug 17, 2023

View reviewed changes

src/batch.jl Outdated Show resolved Hide resolved

lxvm and others added 13 commits September 16, 2023 20:16

add examples to manual

d4978bd

update constructors and various bugfixes

b941ab4

typo

05f5b0a

move example to docs

235a311

Apply suggestions from code review

f23907c

Co-authored-by: Steven G. Johnson <[email protected]>

apply suggestions to examples

1ded75b

move max_batch check to quadgk

db34c43

add BatchIntegrand outer constructor

55aa65f

Update docs/src/quadgk-examples.md

b98d48b

Co-authored-by: Steven G. Johnson <[email protected]>

whitespace typo

37b3a6d

make max_batch a rought limit

03f2449

update docstrings

2458aa2

fix codecov

50a995c

lxvm force-pushed the parallel branch from 7d07ce0 to 50a995c Compare September 17, 2023 00:20

stevengj reviewed Sep 17, 2023

View reviewed changes

src/batch.jl Outdated Show resolved Hide resolved

stevengj reviewed Sep 17, 2023

View reviewed changes

src/batch.jl Outdated Show resolved Hide resolved

lxvm added 3 commits September 18, 2023 09:17

idiomatic constructors

da77a99

edit type parameters

5a6fde6

add parameters for consistency

16e9b02

stevengj reviewed Sep 19, 2023

View reviewed changes

docs/src/quadgk-examples.md Outdated Show resolved Hide resolved

stevengj reviewed Sep 19, 2023

View reviewed changes

src/batch.jl Show resolved Hide resolved

stevengj reviewed Sep 19, 2023

View reviewed changes

src/batch.jl Outdated Show resolved Hide resolved

stevengj reviewed Sep 19, 2023

View reviewed changes

src/batch.jl Show resolved Hide resolved

stevengj reviewed Sep 19, 2023

View reviewed changes

src/batch.jl Outdated Show resolved Hide resolved

simplify constructors

d560001

stevengj merged commit ed0fb5d into JuliaMath:master Sep 20, 2023

lxvm mentioned this pull request Sep 23, 2023

Compatibility with CUDA.jl #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization of integrand evaluations #80

Parallelization of integrand evaluations #80

lxvm commented May 25, 2023

codecov bot commented May 25, 2023 •

edited

Loading

stevengj commented Jun 9, 2023 •

edited

Loading

lxvm commented Jul 5, 2023

lxvm commented Jul 18, 2023

stevengj commented Jul 18, 2023 •

edited

Loading

lxvm commented Jul 18, 2023

lxvm commented Jul 19, 2023

lxvm commented Jul 19, 2023 •

edited

Loading

lxvm commented Jul 19, 2023

stevengj commented Jul 19, 2023

lxvm commented Jul 19, 2023

lxvm commented Aug 3, 2023

lxvm commented Sep 17, 2023

lxvm commented Sep 20, 2023

stevengj commented Sep 20, 2023

Parallelization of integrand evaluations #80

Parallelization of integrand evaluations #80

Conversation

lxvm commented May 25, 2023

codecov bot commented May 25, 2023 • edited Loading

Codecov Report

stevengj commented Jun 9, 2023 • edited Loading

lxvm commented Jul 5, 2023

lxvm commented Jul 18, 2023

stevengj commented Jul 18, 2023 • edited Loading

lxvm commented Jul 18, 2023

lxvm commented Jul 19, 2023

lxvm commented Jul 19, 2023 • edited Loading

lxvm commented Jul 19, 2023

stevengj commented Jul 19, 2023

lxvm commented Jul 19, 2023

lxvm commented Aug 3, 2023

lxvm commented Sep 17, 2023

lxvm commented Sep 20, 2023

stevengj commented Sep 20, 2023

codecov bot commented May 25, 2023 •

edited

Loading

stevengj commented Jun 9, 2023 •

edited

Loading

stevengj commented Jul 18, 2023 •

edited

Loading

lxvm commented Jul 19, 2023 •

edited

Loading