Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Dual Mesh Generation #71

Merged
merged 30 commits into from
May 28, 2024
Merged

Faster Dual Mesh Generation #71

merged 30 commits into from
May 28, 2024

Conversation

GeorgeR227
Copy link
Contributor

This pull requests improves some of the dual mesh generation code to allow it to generate dual mesh information faster and with a lower memory footprint. This dual information is generated with the same functions but now a FastMesh() is passed in as an argument to dispatch to the new code.

This code is separate from the original mesh generation code as this new code bypasses a lot of the ACSet functionality to achieve better performance. A better fix would offer improvements to the underlying ACSet accessing and setting.

GeorgeR227 and others added 9 commits February 22, 2024 22:13
Was able to double the speed of subdivide_duals! for 2D meshes.
Also cleaning up of individual functions. Fusing similar loops and adding multithreading to this would be amazing for very large meshes.
Also changed operator tests to use faster generation. Added PointType struct to grab mesh point data even when EmbeddedDeltaDual structure is not guaranteed in a particular function.
@GeorgeR227
Copy link
Contributor Author

image

Some benchmarks run on functions going from the primal mesh all the way to a fully filled out dual mesh.

@GeorgeR227 GeorgeR227 requested a review from lukem12345 March 1, 2024 18:38
Copy link
Member

@jpfairbanks jpfairbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are some places where you have an extra layer of dispatch (see line comments). Can we eliminate that layer?

Is the reason to bypass the ACSet machinery the need to put all the @inboundses and @views around? Can you open an upstream issue on ACSets and work with @olynch to fix that?

It looks like you aren't using incident anywhere here. Generally when you have a graph algorithm, if you can write it in terms of edges instead of for all v; for n in neighbors(v) you can get better cache performance and better parallelism. Is that what is happening here?

Can you benchmark the two versions as part of the docs?

src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved
src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved
src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved
@GeorgeR227
Copy link
Contributor Author

As for the dispatch, there might be a cleaner way of writing this code out but this current version works and I fear that cleaner versions would need deeper changes to the CombinatorialSpaces package, specifically to way the type hierarchy is structured.

The main reason for bypassing the ACSet machinery, like with the fast operators, is that using that machinery always costs heap allocations which in turn slows down code significantly and balloons the memory footprint, as shown in the benchmarks above. I will definitely make an issue on ACSets about improving this but for now I personally feel that I've deviated from the GPU Decapodes for long enough that I should return to that first.

Coincidentally, a lot of the FastDEC operators I wrote are improved based on that same idea, so flowing down tri->edge->vertex instead of climbing up which would require incident calls. However, this dual mesh code is essentially one-to-one from the original, just with more views and avoidance of indexing the ACSet directly to reduce allocations.

I can for sure add a benchmark of the two different generations.

@jpfairbanks
Copy link
Member

That is great, I really like what you just wrote about flowing down Tri -> Edge -> Vertex being faster. Can you add a doc page noting that with your benchmarks. You can write something like a Performance Tips page and this can be a first tip. It is always better to avoid incident calls by working form the top dimensional simplicies down to vertices rather than working bottom up.

If we can actually purge enough of the operators code of incident, then we could actually disable the morphism indexes, which would really improve the memory performance. Any calls to incident are going to be a locality problem which is a cache/parallel/GPU performance trap. Definitely an issue worth opening for follow up.

@GeorgeR227
Copy link
Contributor Author

Yeah, that Performance Tips page sounds like a good idea to disseminate what I've learned about writing performant ACSet code. On terms of performance I completely agree with you about incident in that we should never rely on this kind of brute lookup of data.

Created a custom copy_primal instead to reduce memory footprint of copy_parts!
src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved
for (i, e) in enumerate(e_range)
tgt_v0[e] = src_v0[i]
tgt_v1[e] = src_v1[i]
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this with broadcasting? making a view and then iterating over it seems unnecessary. I guess you could get slightly better cache performance by doing the copy in this order. but a broadcasting assignment should make a copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The views are necessary due to the extraneous allocations caused by ACSet indexing. I could broadcast over the view but I vaguely remember that not being as good for some reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, we need to fix that upstream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These now use the ACSets instead of the views but the iteration is still present. I prefer this for now since vectorized setting in ACSets is still not quite as optimal as looping over individual setting. Once that is totally fixed this can be quickly refactored or just replaced with the original copy_parts!.

src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved
@jpfairbanks
Copy link
Member

@epatters, a lot of this code is working around a lack of high performance views in the ACSets code. Is this another consequence of adding variable attribute support?

Do we have the software engineering resources to dedicate to getting this fixed upstream?

@GeorgeR227
Copy link
Contributor Author

Backing up the need for this, here are the latest benchmarks. These are run on a triangulated grid with 1,502,001 dual vertices, 4,502,000 dual edges and 3,000,000 dual triangles.

image

@GeorgeR227
Copy link
Contributor Author

Here are the functions used as well:

image

src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved
src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved
@lukem12345
Copy link
Member

Backing up the need for this, here are the latest benchmarks. These are run on a triangulated grid with 1,502,001 dual vertices, 4,502,000 dual edges and 3,000,000 dual triangles.

These should go in src/Benchmarks.jl, since we host similar comparisons there.

@lukem12345 lukem12345 added the enhancement New feature or request label May 15, 2024
@GeorgeR227
Copy link
Contributor Author

This is based off of the release of ACSets v0.2.19.

Updated fast subdivide_duals! benchmarks. These will be pulled into the original ACSets implementation:

julia> GC.gc(true)

julia> @time subdivide_duals!(sd, Barycenter())
  2.834950 seconds (30.39 M allocations: 2.828 GiB, 6.22% gc time)

julia> GC.gc(true)

julia> @time subdivide_duals!(sd, FastMesh(), Barycenter())
  0.044771 seconds (4 allocations: 240 bytes)

Benchmarks for EmbeddedDeltaDualComplex2D. The fast code has not been adapted but this is meant to show that performance of the original is nearing the fast code. As a result the original code will probably remain as is since the fast code is not as maintainable and not as superior anymore.

julia> GC.gc(true)

julia> @time EmbeddedDeltaDualComplex2D{Bool, Float64, Point2D}(s);
  0.536737 seconds (5.45 M allocations: 556.682 MiB, 10.63% gc time)

julia> @time EmbeddedDeltaDualComplex2D{Bool, Float64, Point2D}(s, FastMesh());
  0.244258 seconds (3.48 M allocations: 321.850 MiB)

@GeorgeR227
Copy link
Contributor Author

Switched the original subdivison code for fast code and ran the benchmarks:

[ Info: Beginning Dual Mesh Generation Benchmarks
[ Info: Generated Primal Mesh
[ Info: Original Dual Mesh Generation
  0.435358 seconds (5.45 M allocations: 584.804 MiB, 11.34% gc time)
[ Info: New Dual Mesh Generation
  0.343352 seconds (3.48 M allocations: 352.267 MiB, 26.67% gc time)
[ Info: Done

@GeorgeR227
Copy link
Contributor Author

Comparing with the main CombinatorialSpaces branch with ACSets 0.2.17 active.

[ Info: Generated Primal Mesh
[ Info: Original Dual Mesh Generation
  8.935594 seconds (81.95 M allocations: 12.278 GiB, 11.48% gc time)

This is a specialized primal data copying function. I wanted to overload  the original ```copy_parts!``` but I couldn't make it work with the way the code is written now, by which I mean the nested functionality.
@GeorgeR227
Copy link
Contributor Author

Results of replacing the copy_parts! with a more specialized primal data copying function. I feel like the gains, which push original generation to compete with new generation, means this change is worth having. I've tried my best to keep the function short, safe and have consist style with the rest of CombinatorialSpaces.

[ Info: Original Dual Mesh Generation
  0.289681 seconds (2.77 M allocations: 375.576 MiB, 18.82% gc time)
[ Info: New Dual Mesh Generation
  0.334895 seconds (3.48 M allocations: 351.342 MiB, 22.53% gc time)

@lukem12345
Copy link
Member

To be clear, this is because the ACSets.jl function copy_parts! is slower than a hand-written function that reads out the homs and attributes in a loop?

@GeorgeR227
Copy link
Contributor Author

I haven't tested the functions themselves head on but since this is the only change between this commit and the previous commit, copy_parts! is very likely a culprit here. I again imagine this falls down to bad constant propagation somewhere down the line and this hand written version helps a lot as a quick but solid fix.

I realize this is still a workaround but I feel like this doesn't totally circumvent ACSets like the older FastMesh code and I've written it to be as tolerable as possible in that respect.

This behavior is now in DiscreteExteriorCalculus. Benchmarks have also been improved.
return s
function (::Type{S})(s::AbstractDeltaSet1D) where S <: AbstractDeltaDualComplex1D
t = S()
copy_primal_1D!(t, s) # TODO: Revert to copy_parts! when performance is improved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an issue to ACSets.jl that you are having to circumvent copy_parts!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll get one out once I've looked into the issue some more. I'm guessing this again stems from vectorized getting/setting plus other underlying issues.

src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved
src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved
src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved
src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved
test/Benchmarks.jl Show resolved Hide resolved
for op in sort(collect(keys(results)))
test = median(results[op])
for op in sort(collect(keys(dec_op_results)))
test = median(dec_op_results[op])

println("Operator: $op")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repeated code should be a helper function

Copy link
Contributor Author

@GeorgeR227 GeorgeR227 May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it but I'm not sure about how to abstract it. Mainly due to lots of formatting things changing between suites, like output text, order of output, units of output, etc.

Also added dual_edge_vertices.
point_arr[1] = sd[sd[sd[t, :D_∂e1], :D_∂v1], :dual_point]
point_arr[2] = sd[sd[sd[t, :D_∂e2], :D_∂v0], :dual_point]
point_arr[3] = sd[sd[sd[t, :D_∂e0], :D_∂v0], :dual_point]
p1, p2, p3 = dual_triangle_vertices(sd, t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you see whether the one-liner point_arr[1:3] .= sd[[p1,p2,p3], :dual_point] is as fast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't be since 1) You're using vectorized ACSet accessing which isn't as good right now and 2) You've created an array of points by creating [p1,p2,p3].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dual_triangle_vertices(sd, t) returns an array. How much slower will the vectorized access be here. Is there type instability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns anSVector to be precise, which would be better than an Array due to no allocations. However, using vectorized access is significantly worse since memory usage will scale with mesh size while now it is totally constant. Essentially you'll be killing performance gains for some aesthetics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once vectorized access also drops to 0 alloc, then this solution seems good to me.

test/Benchmarks.jl Outdated Show resolved Hide resolved
@lukem12345
Copy link
Member

lukem12345 commented May 28, 2024 via email

@GeorgeR227
Copy link
Contributor Author

This is still part of AlgebraicJulia/ACSets.jl#135. The issue's been closed since the fix is good enough but it is not totally solved.

@epatters
Copy link
Member

epatters commented May 28, 2024

If the fix isn't actually good enough, feel free to reopen the issue.

@lukem12345 lukem12345 merged commit 812ee67 into main May 28, 2024
7 checks passed
KevinDCarlson pushed a commit that referenced this pull request Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants