Faster Dual Mesh Generation #71

GeorgeR227 · 2024-03-01T18:20:00Z

This pull requests improves some of the dual mesh generation code to allow it to generate dual mesh information faster and with a lower memory footprint. This dual information is generated with the same functions but now a FastMesh() is passed in as an argument to dispatch to the new code.

This code is separate from the original mesh generation code as this new code bypasses a lot of the ACSet functionality to achieve better performance. A better fix would offer improvements to the underlying ACSet accessing and setting.

Was able to double the speed of subdivide_duals! for 2D meshes.

Also cleaning up of individual functions. Fusing similar loops and adding multithreading to this would be amazing for very large meshes.

Also changed operator tests to use faster generation. Added PointType struct to grab mesh point data even when EmbeddedDeltaDual structure is not guaranteed in a particular function.

GeorgeR227 · 2024-03-01T18:37:49Z

Some benchmarks run on functions going from the primal mesh all the way to a fully filled out dual mesh.

jpfairbanks

It looks like there are some places where you have an extra layer of dispatch (see line comments). Can we eliminate that layer?

Is the reason to bypass the ACSet machinery the need to put all the @inboundses and @views around? Can you open an upstream issue on ACSets and work with @olynch to fix that?

It looks like you aren't using incident anywhere here. Generally when you have a graph algorithm, if you can write it in terms of edges instead of for all v; for n in neighbors(v) you can get better cache performance and better parallelism. Is that what is happening here?

Can you benchmark the two versions as part of the docs?

src/DiscreteExteriorCalculus.jl

GeorgeR227 · 2024-03-01T23:23:47Z

As for the dispatch, there might be a cleaner way of writing this code out but this current version works and I fear that cleaner versions would need deeper changes to the CombinatorialSpaces package, specifically to way the type hierarchy is structured.

The main reason for bypassing the ACSet machinery, like with the fast operators, is that using that machinery always costs heap allocations which in turn slows down code significantly and balloons the memory footprint, as shown in the benchmarks above. I will definitely make an issue on ACSets about improving this but for now I personally feel that I've deviated from the GPU Decapodes for long enough that I should return to that first.

Coincidentally, a lot of the FastDEC operators I wrote are improved based on that same idea, so flowing down tri->edge->vertex instead of climbing up which would require incident calls. However, this dual mesh code is essentially one-to-one from the original, just with more views and avoidance of indexing the ACSet directly to reduce allocations.

I can for sure add a benchmark of the two different generations.

jpfairbanks · 2024-03-01T23:34:22Z

That is great, I really like what you just wrote about flowing down Tri -> Edge -> Vertex being faster. Can you add a doc page noting that with your benchmarks. You can write something like a Performance Tips page and this can be a first tip. It is always better to avoid incident calls by working form the top dimensional simplicies down to vertices rather than working bottom up.

If we can actually purge enough of the operators code of incident, then we could actually disable the morphism indexes, which would really improve the memory performance. Any calls to incident are going to be a locality problem which is a cache/parallel/GPU performance trap. Definitely an issue worth opening for follow up.

GeorgeR227 · 2024-03-01T23:43:55Z

Yeah, that Performance Tips page sounds like a good idea to disseminate what I've learned about writing performant ACSet code. On terms of performance I completely agree with you about incident in that we should never rely on this kind of brute lookup of data.

Created a custom copy_primal instead to reduce memory footprint of copy_parts!

src/DiscreteExteriorCalculus.jl

jpfairbanks · 2024-05-14T14:24:33Z

src/DiscreteExteriorCalculus.jl

+  for (i, e) in enumerate(e_range)
+    tgt_v0[e] = src_v0[i]
+    tgt_v1[e] = src_v1[i]
+  end


Can we do this with broadcasting? making a view and then iterating over it seems unnecessary. I guess you could get slightly better cache performance by doing the copy in this order. but a broadcasting assignment should make a copy.

The views are necessary due to the extraneous allocations caused by ACSet indexing. I could broadcast over the view but I vaguely remember that not being as good for some reason.

Ugh, we need to fix that upstream.

These now use the ACSets instead of the views but the iteration is still present. I prefer this for now since vectorized setting in ACSets is still not quite as optimal as looping over individual setting. Once that is totally fixed this can be quickly refactored or just replaced with the original copy_parts!.

src/DiscreteExteriorCalculus.jl

jpfairbanks · 2024-05-14T14:51:11Z

@epatters, a lot of this code is working around a lack of high performance views in the ACSets code. Is this another consequence of adding variable attribute support?

Do we have the software engineering resources to dedicate to getting this fixed upstream?

Tests non-orientability

This reverts commit e267b10.

GeorgeR227 · 2024-05-15T14:59:54Z

Backing up the need for this, here are the latest benchmarks. These are run on a triangulated grid with 1,502,001 dual vertices, 4,502,000 dual edges and 3,000,000 dual triangles.

GeorgeR227 · 2024-05-15T15:01:14Z

Here are the functions used as well:

src/DiscreteExteriorCalculus.jl

lukem12345 · 2024-05-15T16:36:02Z

Backing up the need for this, here are the latest benchmarks. These are run on a triangulated grid with 1,502,001 dual vertices, 4,502,000 dual edges and 3,000,000 dual triangles.

These should go in src/Benchmarks.jl, since we host similar comparisons there.

This comes as a result of ACSets v0.2.19.

GeorgeR227 · 2024-05-23T19:30:55Z

This is based off of the release of ACSets v0.2.19.

Updated fast subdivide_duals! benchmarks. These will be pulled into the original ACSets implementation:

julia> GC.gc(true)

julia> @time subdivide_duals!(sd, Barycenter())
  2.834950 seconds (30.39 M allocations: 2.828 GiB, 6.22% gc time)

julia> GC.gc(true)

julia> @time subdivide_duals!(sd, FastMesh(), Barycenter())
  0.044771 seconds (4 allocations: 240 bytes)

Benchmarks for EmbeddedDeltaDualComplex2D. The fast code has not been adapted but this is meant to show that performance of the original is nearing the fast code. As a result the original code will probably remain as is since the fast code is not as maintainable and not as superior anymore.

julia> GC.gc(true)

julia> @time EmbeddedDeltaDualComplex2D{Bool, Float64, Point2D}(s);
  0.536737 seconds (5.45 M allocations: 556.682 MiB, 10.63% gc time)

julia> @time EmbeddedDeltaDualComplex2D{Bool, Float64, Point2D}(s, FastMesh());
  0.244258 seconds (3.48 M allocations: 321.850 MiB)

GeorgeR227 · 2024-05-23T20:27:36Z

Switched the original subdivison code for fast code and ran the benchmarks:

[ Info: Beginning Dual Mesh Generation Benchmarks
[ Info: Generated Primal Mesh
[ Info: Original Dual Mesh Generation
  0.435358 seconds (5.45 M allocations: 584.804 MiB, 11.34% gc time)
[ Info: New Dual Mesh Generation
  0.343352 seconds (3.48 M allocations: 352.267 MiB, 26.67% gc time)
[ Info: Done

GeorgeR227 · 2024-05-24T15:54:29Z

Comparing with the main CombinatorialSpaces branch with ACSets 0.2.17 active.

[ Info: Generated Primal Mesh
[ Info: Original Dual Mesh Generation
  8.935594 seconds (81.95 M allocations: 12.278 GiB, 11.48% gc time)

This is a specialized primal data copying function. I wanted to overload the original ```copy_parts!``` but I couldn't make it work with the way the code is written now, by which I mean the nested functionality.

GeorgeR227 · 2024-05-24T17:08:46Z

Results of replacing the copy_parts! with a more specialized primal data copying function. I feel like the gains, which push original generation to compete with new generation, means this change is worth having. I've tried my best to keep the function short, safe and have consist style with the rest of CombinatorialSpaces.

[ Info: Original Dual Mesh Generation
  0.289681 seconds (2.77 M allocations: 375.576 MiB, 18.82% gc time)
[ Info: New Dual Mesh Generation
  0.334895 seconds (3.48 M allocations: 351.342 MiB, 22.53% gc time)

lukem12345 · 2024-05-24T17:11:59Z

To be clear, this is because the ACSets.jl function copy_parts! is slower than a hand-written function that reads out the homs and attributes in a loop?

GeorgeR227 · 2024-05-24T17:17:30Z

I haven't tested the functions themselves head on but since this is the only change between this commit and the previous commit, copy_parts! is very likely a culprit here. I again imagine this falls down to bad constant propagation somewhere down the line and this hand written version helps a lot as a quick but solid fix.

I realize this is still a workaround but I feel like this doesn't totally circumvent ACSets like the older FastMesh code and I've written it to be as tolerable as possible in that respect.

This behavior is now in DiscreteExteriorCalculus. Benchmarks have also been improved.

lukem12345 · 2024-05-24T20:00:28Z

src/DiscreteExteriorCalculus.jl

-  return s
+function (::Type{S})(s::AbstractDeltaSet1D) where S <: AbstractDeltaDualComplex1D
+  t = S()
+  copy_primal_1D!(t, s) # TODO: Revert to copy_parts! when performance is improved


Can you add an issue to ACSets.jl that you are having to circumvent copy_parts!

Yes, I'll get one out once I've looked into the issue some more. I'm guessing this again stems from vectorized getting/setting plus other underlying issues.

src/DiscreteExteriorCalculus.jl

test/Benchmarks.jl

lukem12345 · 2024-05-24T20:12:27Z

test/Benchmarks.jl

-for op in sort(collect(keys(results)))
-    test = median(results[op])
+for op in sort(collect(keys(dec_op_results)))
+    test = median(dec_op_results[op])

    println("Operator: $op")


This repeated code should be a helper function

I thought about it but I'm not sure about how to abstract it. Mainly due to lots of formatting things changing between suites, like output text, order of output, units of output, etc.

Also added dual_edge_vertices.

lukem12345 · 2024-05-28T14:47:08Z

src/DiscreteExteriorCalculus.jl

-    point_arr[1] = sd[sd[sd[t, :D_∂e1], :D_∂v1], :dual_point]
-    point_arr[2] = sd[sd[sd[t, :D_∂e2], :D_∂v0], :dual_point]
-    point_arr[3] = sd[sd[sd[t, :D_∂e0], :D_∂v0], :dual_point]
+    p1, p2, p3 = dual_triangle_vertices(sd, t)


Can you see whether the one-liner point_arr[1:3] .= sd[[p1,p2,p3], :dual_point] is as fast?

It won't be since 1) You're using vectorized ACSet accessing which isn't as good right now and 2) You've created an array of points by creating [p1,p2,p3].

dual_triangle_vertices(sd, t) returns an array. How much slower will the vectorized access be here. Is there type instability

It returns anSVector to be precise, which would be better than an Array due to no allocations. However, using vectorized access is significantly worse since memory usage will scale with mesh size while now it is totally constant. Essentially you'll be killing performance gains for some aesthetics.

Once vectorized access also drops to 0 alloc, then this solution seems good to me.

test/Benchmarks.jl

lukem12345 · 2024-05-28T15:10:39Z

Is there an issue for this on ACSets.jl

…

On Tue, May 28, 2024, 11:07 AM GeorgeR227 ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/DiscreteExteriorCalculus.jl <#71 (comment)> : > function set_dual_volumes_2d!(::Type{Val{2}}, sd::HasDeltaSet2D, ::Type{point_type}) where point_type point_arr = MVector{3, point_type}(undef) @inbounds for t in parts(sd, :DualTri) - point_arr[1] = sd[sd[sd[t, :D_∂e1], :D_∂v1], :dual_point] - point_arr[2] = sd[sd[sd[t, :D_∂e2], :D_∂v0], :dual_point] - point_arr[3] = sd[sd[sd[t, :D_∂e0], :D_∂v0], :dual_point] + p1, p2, p3 = dual_triangle_vertices(sd, t) Once vectorized access also drops to 0 alloc, then this solution seems good to me. — Reply to this email directly, view it on GitHub <#71 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQYHBYJBOEEXGYMYP6SUM5LZESMRLAVCNFSM6AAAAABECGJVBKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAOBTGEYTANBRHE> . You are receiving this because your review was requested.Message ID: ***@***.*** com>

GeorgeR227 · 2024-05-28T15:12:57Z

This is still part of AlgebraicJulia/ACSets.jl#135. The issue's been closed since the fix is good enough but it is not totally solved.

epatters · 2024-05-28T15:20:59Z

If the fix isn't actually good enough, feel free to reopen the issue.

This reverts commit e6a9a63.

GeorgeR227 and others added 9 commits February 22, 2024 22:13

Played around with dual mesh generation

c38c664

Was able to double the speed of subdivide_duals! for 2D meshes.

Further improvements to subdivision code

6d1ee68

Using MVector to save memory costs

625fe44

Also cleaning up of individual functions. Fusing similar loops and adding multithreading to this would be amazing for very large meshes.

Started work on make_dual_simplices_1d!

5a92870

Further tweaks to 1D

9f5664a

Improvements to dual simplices 2D

dcce18f

Added faster dual mesh generation to tests

c66bd64

Also changed operator tests to use faster generation. Added PointType struct to grab mesh point data even when EmbeddedDeltaDual structure is not guaranteed in a particular function.

Some final changes

60640a6

Merge branch 'main' into gr/fast-mesh

7ae7b8d

Remove personal testing file

0dd4a97

GeorgeR227 requested a review from lukem12345 March 1, 2024 18:38

jpfairbanks reviewed Mar 1, 2024

View reviewed changes

src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved

src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved

src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved

GeorgeR227 added 2 commits March 4, 2024 10:34

Cleaned up the point_wrapper

59e86f3

Replaced copy_parts!

06c5f05

Created a custom copy_primal instead to reduce memory footprint of copy_parts!

jpfairbanks requested changes May 14, 2024

View reviewed changes

jpfairbanks reviewed May 14, 2024

View reviewed changes

src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved

GeorgeR227 added 3 commits May 14, 2024 14:27

Added primal_triangle test mesh

42dabec

Added fin mesh

e267b10

Tests non-orientability

Revert "Added fin mesh"

ab1ea66

This reverts commit e267b10.

Clean up new tests

4772a25

lukem12345 reviewed May 15, 2024

View reviewed changes

src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved

src/DiscreteExteriorCalculus.jl Outdated Show resolved Hide resolved

lukem12345 added the enhancement New feature or request label May 15, 2024

GeorgeR227 and others added 4 commits May 15, 2024 13:27

Moved FastMesh into seperate file

7dd038b

Added @inbounds to for loops

3a9467e

Set dual volumes with SVector

ca0d52d

Update subdivisions to use ACSets interface

a981727

This comes as a result of ACSets v0.2.19.

GeorgeR227 added 3 commits May 23, 2024 15:47

Moved point inference higher

9759b7e

Integrated fast code into original

98786da

Merge branch 'main' into gr/fast-mesh

22496e7

Replace copy_parts!

b2b368e

This is a specialized primal data copying function. I wanted to overload the original ```copy_parts!``` but I couldn't make it work with the way the code is written now, by which I mean the nested functionality.

Remove FastMesh code

8a79927

This behavior is now in DiscreteExteriorCalculus. Benchmarks have also been improved.

GeorgeR227 requested review from lukem12345 and jpfairbanks May 24, 2024 18:50

lukem12345 reviewed May 24, 2024

View reviewed changes

Updated to use built-in functionality

cb28c99

Also added dual_edge_vertices.

lukem12345 reviewed May 28, 2024

View reviewed changes

jpfairbanks reviewed May 28, 2024

View reviewed changes

test/Benchmarks.jl Outdated Show resolved Hide resolved

GeorgeR227 added 3 commits May 28, 2024 11:48

Removed mesh array

e6a9a63

Revert "Removed mesh array"

09ff6d6

This reverts commit e6a9a63.

Cleaned benchmarks some more

8e480ea

lukem12345 approved these changes May 28, 2024

View reviewed changes

lukem12345 merged commit 812ee67 into main May 28, 2024
7 checks passed

KevinDCarlson pushed a commit that referenced this pull request Jun 5, 2024

Faster Dual Mesh Generation (#71)

0645bdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Dual Mesh Generation #71

Faster Dual Mesh Generation #71

GeorgeR227 commented Mar 1, 2024

GeorgeR227 commented Mar 1, 2024

jpfairbanks left a comment

GeorgeR227 commented Mar 1, 2024

jpfairbanks commented Mar 1, 2024

GeorgeR227 commented Mar 1, 2024

jpfairbanks May 14, 2024

GeorgeR227 May 14, 2024

jpfairbanks May 14, 2024

GeorgeR227 May 24, 2024

jpfairbanks commented May 14, 2024

GeorgeR227 commented May 15, 2024

GeorgeR227 commented May 15, 2024

lukem12345 commented May 15, 2024

GeorgeR227 commented May 23, 2024

GeorgeR227 commented May 23, 2024

GeorgeR227 commented May 24, 2024

GeorgeR227 commented May 24, 2024

lukem12345 commented May 24, 2024

GeorgeR227 commented May 24, 2024

lukem12345 May 24, 2024

GeorgeR227 May 24, 2024

lukem12345 May 24, 2024

GeorgeR227 May 24, 2024 •

edited

Loading

lukem12345 May 28, 2024

GeorgeR227 May 28, 2024

lukem12345 May 28, 2024

GeorgeR227 May 28, 2024

GeorgeR227 May 28, 2024

lukem12345 commented May 28, 2024 via email

GeorgeR227 commented May 28, 2024

epatters commented May 28, 2024 •

edited

Loading

Faster Dual Mesh Generation #71

Faster Dual Mesh Generation #71

Conversation

GeorgeR227 commented Mar 1, 2024

GeorgeR227 commented Mar 1, 2024

jpfairbanks left a comment

Choose a reason for hiding this comment

GeorgeR227 commented Mar 1, 2024

jpfairbanks commented Mar 1, 2024

GeorgeR227 commented Mar 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpfairbanks commented May 14, 2024

GeorgeR227 commented May 15, 2024

GeorgeR227 commented May 15, 2024

lukem12345 commented May 15, 2024

GeorgeR227 commented May 23, 2024

GeorgeR227 commented May 23, 2024

GeorgeR227 commented May 24, 2024

GeorgeR227 commented May 24, 2024

lukem12345 commented May 24, 2024

GeorgeR227 commented May 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GeorgeR227 May 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukem12345 commented May 28, 2024 via email

GeorgeR227 commented May 28, 2024

epatters commented May 28, 2024 • edited Loading

GeorgeR227 May 24, 2024 •

edited

Loading

epatters commented May 28, 2024 •

edited

Loading