Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization of integrand evaluations #80

Merged
merged 27 commits into from
Sep 20, 2023
Merged

Conversation

lxvm
Copy link
Contributor

@lxvm lxvm commented May 25, 2023

This is a draft pr to parallelize quadgk and quadgk! over evaluation points. Using a procedure similar to the one discussed by Gladwell in "Vectorisation of One-Dimensional Quadrature Codes" (https://doi.org/10.1007/978-94-009-3889-2_24), I remove all the segments in the heap that will need to be refined and then I evaluate the integrand at all the Kronrod points in all of the new segments in parallel. I decided to go for a simple Threads.@threads approach since I wasn't sure if using something more sophisticated such as Floops.jl would provide much more benefit. Here is an example showing some speed up on a simulated costly integrand

julia> using QuadGK

julia> s1(x) = sin(13x)     # integrand requiring subdivisions
s1 (generic function with 1 method)

julia> s2(x) = (sleep(0.005); s1(x))  # costly integrand
s2 (generic function with 1 method)

julia> Threads.nthreads()  # show threads
6

julia> p = QuadGK.Parallel(Float64, Float64, Float64) # construct parallelization buffer

julia> @time quadgk(s1, 0, 1)        # very fast
  0.000007 seconds (2 allocations: 368 bytes)
(0.007119478349984942, 8.85874706924028e-13)

julia> @time quadgk(s1, 0, 1, parallel=p)  # slower because of threading overhead
  0.000239 seconds (336 allocations: 34.984 KiB)
(0.007119478349984942, 8.85874706924028e-13)

julia> @time quadgk(s2, 0, 1)      # slow when sequential and costly
  0.663973 seconds (531 allocations: 16.312 KiB)
(0.007119478349984942, 8.85874706924028e-13)

julia> @time quadgk(s2, 0, 1, parallel=p)   # faster when parallel and costly
  0.056169 seconds (819 allocations: 50.031 KiB)
(0.007119478349984942, 8.85874706924028e-13)

TODO:

  1. Decide on multithreading framework
  2. Reduce allocations in parevalrule for InplaceIntegrand

@lxvm lxvm changed the title parallel Parallelization of integrand evaluations May 25, 2023
@codecov
Copy link

codecov bot commented May 25, 2023

Codecov Report

Patch coverage: 99.31% and project coverage change: +0.28% 🎉

Comparison is base (cb745d4) 97.87% compared to head (d560001) 98.16%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #80      +/-   ##
==========================================
+ Coverage   97.87%   98.16%   +0.28%     
==========================================
  Files           5        6       +1     
  Lines         471      599     +128     
==========================================
+ Hits          461      588     +127     
- Misses         10       11       +1     
Files Changed Coverage Δ
src/QuadGK.jl 100.00% <ø> (ø)
src/batch.jl 99.15% <99.15%> (ø)
src/adapt.jl 96.66% <100.00%> (+0.27%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stevengj
Copy link
Member

stevengj commented Jun 9, 2023

I think we just want to provide the caller with the array of points to evaluate and let them parallelize it (or not) using threads, the GPU, distributed-memory, or whatever.

One way to expose this API would be to define as a wrapper around the integrand, e.g.:

struct BatchIntegrand{F,X,Y}
    f!::F # in-place function f!(y, x) that takes an array of x values and outputs an array of results in-place
    max_batch::Int # maximum number of x to supply in parallel (defaults to typemax(Int))
    x::Vector{X}
    y::Vector{Y}
end
BatchIntegrand(f!::F, ::Type{X}, ::Type{Y}; max_parallel::Integer=typemax(Int)) where {F,X,Y} =
    BatchIntegrand{F,X,Y}(f!, max_parallel, X[], Y[])
BatchIntegrand(f!::F, x::Vector{X}, y::Vector{Y}; max_parallel::Integer=typemax(Int)) where {F,X,Y} =
    BatchIntegrand{F,X,Y}(f!, max_parallel, x, y)

It's a little annoying that it doesn't get X from the type of the endpoints. One possibility would to have a constructor:

BatchIntegrand(f!::F, ::Type{Y}; max_parallel::Integer=typemax(Int)) where {F,Y} =
    BatchIntegrand{F,None,Y}(f!, max_parallel, nothing[], Y[])

and then if the quadgk function sees a BatchIntegrand{F,None,Y} then it can replace it with BatchIntegrand{F,X,Y} where it gets X from the endpoint type.

@lxvm
Copy link
Contributor Author

lxvm commented Jul 5, 2023

I wrote a draft implementation of BatchIntegrand that can be tested with the following example integrand

julia> using QuadGK, Unitful

julia> f(x) = exp(-(x/(1.0u"m"))^2)
f (generic function with 1 method)

julia> quadgk(f, 0.0u"m", Inf*u"m")
(0.8862269254527579 m, 8.595134596701424e-9 m)

julia> f!(y, x) = y .= f.(x)
f! (generic function with 1 method)

julia> quadgk(QuadGK.BatchIntegrand(f!, Float64, typeof(0.0u"m")), 0.0u"m", Inf*u"m")
(0.8862269254527579 m, 8.595134596694648e-9 m)

Introducing BatchIntegrand also changes the default behavior of quadgk when using relative tolerances, since it can simultaneously adapt all of the segments in the heap that surpass the "current" tolerance. Previously, the "current" tolerance was updated after any segment was refined, however now I changed the default behavior to match BatchIntegrand. I expect the computational effort to be the same if the caller performs singularity separation for their integrand. Otherwise, there are integrands with multiple peaks in an interval that previously only integrated one dominating peak and now can capture more or all of the peaks. For example:

julia> f(x)    = sin(x)/(cos(x)+im*1e-5)   # peaked "nice" integrand
f (generic function with 1 method)

julia> imf(x)  = imag(f(x))                # peaked difficult integrand
imf (generic function with 1 method)

julia> imf2(x) = imf(x)^2                  # even more peaked
imf2 (generic function with 1 method)

julia> x0 = 0.1    # arbitrary offset of between peaks
0.1

julia> using QuadGK

julia> quadgk(x -> imf2(x) + imf2(x-x0), 0, 2pi, rtol = 1e-5)   # previous versions
(235619.45750214785, 0.8317222055900457)

julia> quadgk(x -> imf2(x) + imf2(x-x0), 0, 2pi, rtol = 1e-5)   # with this pr
(628318.5307008666, 3.1850136117936505)

The implementation of BatchIntegrand is as a stateful iterator yielding segments because the state is in the buffer of integrand evaluations that may contain points corresponding to either less than or more than one segment. The heap also acts as a stateful iterator that pops off segments one at a time until the error goes below the tolerance. In order to reuse as much existing code as I could, which isn't much, I try to hide the fact that these iterators are stateful by collecting their results into tuples. I try doing so with recursion, however I believe this leads to a lot of code generation and there is still a lot of type instability within the iterators that I attribute to the variables I use for iterator state. I would appreciate any ideas for ways of mitigating type instability and thus allocations, as well as iterating only once over stateful iterators to avoid storing them in tuples.

@lxvm
Copy link
Contributor Author

lxvm commented Jul 18, 2023

I simplified the implementation of BatchIntegrand by requiring that max_batch >= (4*order+2) and updating the integral estimate after each batch of GK rules is evaluated. So now, the batch integrator refines as many segments as need to be refined for the current tolerance up to the limit of how many fit into the batch buffer at once. Since this can affect numerical results obtained with a relative tolerance, I added a docstring example on the effect of max_batch.

When refactoring the original adaptation code, I also had to change the return type of adapt to handle type-unstable integrands, so I also put the final re-summation of the integral into its own routine.

If more simplifications are possible, then I can attempt implementing them. Otherwise, it might be possible to reduce allocations in the batching code, but some may be unavoidable due to resize! for the buffers or some reshuffling of elements in the heap (which could probably be done better).

src/batch.jl Outdated Show resolved Hide resolved
src/batch.jl Outdated Show resolved Hide resolved
src/batch.jl Outdated Show resolved Hide resolved
src/batch.jl Outdated Show resolved Hide resolved
@stevengj
Copy link
Member

stevengj commented Jul 18, 2023

This is looking good! Thanks for working on this!

Note that we will also want to add a new section to the manual on this, giving some explanation/motivation and a short example.

@lxvm
Copy link
Contributor Author

lxvm commented Jul 18, 2023

Thanks! I'll start writing a page in the manual and will follow up on the review in another commit

@lxvm lxvm marked this pull request as ready for review July 19, 2023 02:00
@lxvm
Copy link
Contributor Author

lxvm commented Jul 19, 2023

Examples were added to the manual and the constructors were modified as previously discussed. I also added BatchIntegrand to the exported API. Thank you for reviewing this PR!

@lxvm
Copy link
Contributor Author

lxvm commented Jul 19, 2023

Perhaps the only thing this PR leaves to be desired is a batched and vector-valued integrand combining both BatchIntegrand and InplaceIntegrand. It would be great if it could be done by adding additional dimensions to the output array y in f!(y,x) (e.g. with https://github.com/JuliaArrays/ElasticArrays.jl), but it would make sense to do it in a future PR.

docs/src/index.md Outdated Show resolved Hide resolved
src/batch.jl Outdated Show resolved Hide resolved
src/batch.jl Outdated Show resolved Hide resolved
src/batch.jl Outdated Show resolved Hide resolved
@lxvm
Copy link
Contributor Author

lxvm commented Jul 19, 2023

I was able to implement a batched and inplace integrand via adding additional dimensions to the y buffer. This mostly reuses the existing code by defining a method for quadgk!(::BatchIntegrand,...) and internally creating a BatchIntegrand(InplaceIntegrand(...)). I started this on a branch https://github.com/lxvm/QuadGK.jl/tree/batchinplace, but perhaps it would be best to have it as a package extension, since it would really only work with ElasticArrays.jl

src/adapt.jl Outdated Show resolved Hide resolved
@stevengj
Copy link
Member

Upon reflection, I think it would be more useful to make max_batch act like max_evals — it should be a rough limit on the batch size, but not a hard limit. That way, the user doesn't have to worry about accidentally making max_batch slightly too small.

(The main purpose of max_batch is to not allow the buffer allocation to be unbounded, and for this purpose I don't think the user cares if you allocate a slightly bigger buffer. Whereas having quadgk throw an exception if you accidentally forget to change max_batch when you change the order or the minimum number of segments is a huge pain for the user.)

@lxvm
Copy link
Contributor Author

lxvm commented Jul 19, 2023

I agree that making max_batch behave as a soft limit is convenient. We can implement this by always evaluating the minimum number of segments needed for any call which can be done by changing the inner loop in refine from

while len > nsegs && 2m*(nsegs+1) <= f.max_batch && E > tol && numevals < maxevals
    ...
end

to

while len > nsegs && 2m*nsegs < f.max_batch && E > tol && numevals < maxevals
    ...
end

and removing the errors for small values of max_batch

@lxvm
Copy link
Contributor Author

lxvm commented Aug 3, 2023

@stevengj I rebased this pr on master and I think it is ready if it also looks good to you

src/batch.jl Outdated Show resolved Hide resolved
src/batch.jl Outdated Show resolved Hide resolved
@lxvm
Copy link
Contributor Author

lxvm commented Sep 17, 2023

@stevengj Do you think this is finished?

src/batch.jl Outdated Show resolved Hide resolved
src/batch.jl Outdated Show resolved Hide resolved
src/batch.jl Show resolved Hide resolved
src/batch.jl Outdated Show resolved Hide resolved
src/batch.jl Show resolved Hide resolved
src/batch.jl Outdated Show resolved Hide resolved
@lxvm
Copy link
Contributor Author

lxvm commented Sep 20, 2023

@stevengj I updated the docs and made the changes you suggested, so it should be ready to go.

There is also a straightforward path for extending this PR to inplace and batched integrands. If the y buffer is a VectorOfSimilarArrays of an ElasticArray with a last dimension for batching, then the behavior of BatchIntegrand is identical, except with mutable integrand values, i.e. array views. We could make this work with a quadgk!(f!::BatchIntegrand, result, a, b) method that internally creates a BatchIntegrand(InplaceIntegrand(f!,...),...), then add some if-statement branches to make sure in-place operations are used on the integrand values.

@stevengj stevengj merged commit ed0fb5d into JuliaMath:master Sep 20, 2023
@stevengj
Copy link
Member

Thanks for keeping at it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants