-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster Dual Mesh Generation #71
Conversation
Was able to double the speed of subdivide_duals! for 2D meshes.
Also cleaning up of individual functions. Fusing similar loops and adding multithreading to this would be amazing for very large meshes.
Also changed operator tests to use faster generation. Added PointType struct to grab mesh point data even when EmbeddedDeltaDual structure is not guaranteed in a particular function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there are some places where you have an extra layer of dispatch (see line comments). Can we eliminate that layer?
Is the reason to bypass the ACSet machinery the need to put all the @inbounds
es and @view
s around? Can you open an upstream issue on ACSets and work with @olynch to fix that?
It looks like you aren't using incident
anywhere here. Generally when you have a graph algorithm, if you can write it in terms of edges
instead of for all v; for n in neighbors(v)
you can get better cache performance and better parallelism. Is that what is happening here?
Can you benchmark the two versions as part of the docs?
As for the dispatch, there might be a cleaner way of writing this code out but this current version works and I fear that cleaner versions would need deeper changes to the CombinatorialSpaces package, specifically to way the type hierarchy is structured. The main reason for bypassing the ACSet machinery, like with the fast operators, is that using that machinery always costs heap allocations which in turn slows down code significantly and balloons the memory footprint, as shown in the benchmarks above. I will definitely make an issue on ACSets about improving this but for now I personally feel that I've deviated from the GPU Decapodes for long enough that I should return to that first. Coincidentally, a lot of the FastDEC operators I wrote are improved based on that same idea, so flowing down tri->edge->vertex instead of climbing up which would require I can for sure add a benchmark of the two different generations. |
That is great, I really like what you just wrote about flowing down Tri -> Edge -> Vertex being faster. Can you add a doc page noting that with your benchmarks. You can write something like a Performance Tips page and this can be a first tip. It is always better to avoid incident calls by working form the top dimensional simplicies down to vertices rather than working bottom up. If we can actually purge enough of the operators code of incident, then we could actually disable the morphism indexes, which would really improve the memory performance. Any calls to incident are going to be a locality problem which is a cache/parallel/GPU performance trap. Definitely an issue worth opening for follow up. |
Yeah, that Performance Tips page sounds like a good idea to disseminate what I've learned about writing performant ACSet code. On terms of performance I completely agree with you about |
Created a custom copy_primal instead to reduce memory footprint of copy_parts!
src/DiscreteExteriorCalculus.jl
Outdated
for (i, e) in enumerate(e_range) | ||
tgt_v0[e] = src_v0[i] | ||
tgt_v1[e] = src_v1[i] | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this with broadcasting? making a view and then iterating over it seems unnecessary. I guess you could get slightly better cache performance by doing the copy in this order. but a broadcasting assignment should make a copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The views are necessary due to the extraneous allocations caused by ACSet indexing. I could broadcast over the view but I vaguely remember that not being as good for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, we need to fix that upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These now use the ACSets instead of the views but the iteration is still present. I prefer this for now since vectorized setting in ACSets is still not quite as optimal as looping over individual setting. Once that is totally fixed this can be quickly refactored or just replaced with the original copy_parts!
.
@epatters, a lot of this code is working around a lack of high performance views in the ACSets code. Is this another consequence of adding variable attribute support? Do we have the software engineering resources to dedicate to getting this fixed upstream? |
Tests non-orientability
This reverts commit e267b10.
These should go in |
This comes as a result of ACSets v0.2.19.
This is based off of the release of ACSets v0.2.19. Updated fast julia> GC.gc(true)
julia> @time subdivide_duals!(sd, Barycenter())
2.834950 seconds (30.39 M allocations: 2.828 GiB, 6.22% gc time)
julia> GC.gc(true)
julia> @time subdivide_duals!(sd, FastMesh(), Barycenter())
0.044771 seconds (4 allocations: 240 bytes) Benchmarks for julia> GC.gc(true)
julia> @time EmbeddedDeltaDualComplex2D{Bool, Float64, Point2D}(s);
0.536737 seconds (5.45 M allocations: 556.682 MiB, 10.63% gc time)
julia> @time EmbeddedDeltaDualComplex2D{Bool, Float64, Point2D}(s, FastMesh());
0.244258 seconds (3.48 M allocations: 321.850 MiB) |
Switched the original subdivison code for fast code and ran the benchmarks: [ Info: Beginning Dual Mesh Generation Benchmarks
[ Info: Generated Primal Mesh
[ Info: Original Dual Mesh Generation
0.435358 seconds (5.45 M allocations: 584.804 MiB, 11.34% gc time)
[ Info: New Dual Mesh Generation
0.343352 seconds (3.48 M allocations: 352.267 MiB, 26.67% gc time)
[ Info: Done |
Comparing with the main CombinatorialSpaces branch with ACSets 0.2.17 active. [ Info: Generated Primal Mesh
[ Info: Original Dual Mesh Generation
8.935594 seconds (81.95 M allocations: 12.278 GiB, 11.48% gc time) |
This is a specialized primal data copying function. I wanted to overload the original ```copy_parts!``` but I couldn't make it work with the way the code is written now, by which I mean the nested functionality.
Results of replacing the [ Info: Original Dual Mesh Generation
0.289681 seconds (2.77 M allocations: 375.576 MiB, 18.82% gc time)
[ Info: New Dual Mesh Generation
0.334895 seconds (3.48 M allocations: 351.342 MiB, 22.53% gc time) |
To be clear, this is because the ACSets.jl function |
I haven't tested the functions themselves head on but since this is the only change between this commit and the previous commit, I realize this is still a workaround but I feel like this doesn't totally circumvent ACSets like the older FastMesh code and I've written it to be as tolerable as possible in that respect. |
This behavior is now in DiscreteExteriorCalculus. Benchmarks have also been improved.
return s | ||
function (::Type{S})(s::AbstractDeltaSet1D) where S <: AbstractDeltaDualComplex1D | ||
t = S() | ||
copy_primal_1D!(t, s) # TODO: Revert to copy_parts! when performance is improved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an issue to ACSets.jl that you are having to circumvent copy_parts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll get one out once I've looked into the issue some more. I'm guessing this again stems from vectorized getting/setting plus other underlying issues.
for op in sort(collect(keys(results))) | ||
test = median(results[op]) | ||
for op in sort(collect(keys(dec_op_results))) | ||
test = median(dec_op_results[op]) | ||
|
||
println("Operator: $op") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This repeated code should be a helper function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it but I'm not sure about how to abstract it. Mainly due to lots of formatting things changing between suites, like output text, order of output, units of output, etc.
Also added dual_edge_vertices.
point_arr[1] = sd[sd[sd[t, :D_∂e1], :D_∂v1], :dual_point] | ||
point_arr[2] = sd[sd[sd[t, :D_∂e2], :D_∂v0], :dual_point] | ||
point_arr[3] = sd[sd[sd[t, :D_∂e0], :D_∂v0], :dual_point] | ||
p1, p2, p3 = dual_triangle_vertices(sd, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you see whether the one-liner point_arr[1:3] .= sd[[p1,p2,p3], :dual_point]
is as fast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't be since 1) You're using vectorized ACSet accessing which isn't as good right now and 2) You've created an array of points by creating [p1,p2,p3]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dual_triangle_vertices(sd, t)
returns an array. How much slower will the vectorized access be here. Is there type instability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns anSVector
to be precise, which would be better than an Array due to no allocations. However, using vectorized access is significantly worse since memory usage will scale with mesh size while now it is totally constant. Essentially you'll be killing performance gains for some aesthetics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once vectorized access also drops to 0 alloc, then this solution seems good to me.
Is there an issue for this on ACSets.jl
…On Tue, May 28, 2024, 11:07 AM GeorgeR227 ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/DiscreteExteriorCalculus.jl
<#71 (comment)>
:
> function set_dual_volumes_2d!(::Type{Val{2}}, sd::HasDeltaSet2D, ::Type{point_type}) where point_type
point_arr = MVector{3, point_type}(undef)
@inbounds for t in parts(sd, :DualTri)
- point_arr[1] = sd[sd[sd[t, :D_∂e1], :D_∂v1], :dual_point]
- point_arr[2] = sd[sd[sd[t, :D_∂e2], :D_∂v0], :dual_point]
- point_arr[3] = sd[sd[sd[t, :D_∂e0], :D_∂v0], :dual_point]
+ p1, p2, p3 = dual_triangle_vertices(sd, t)
Once vectorized access also drops to 0 alloc, then this solution seems
good to me.
—
Reply to this email directly, view it on GitHub
<#71 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQYHBYJBOEEXGYMYP6SUM5LZESMRLAVCNFSM6AAAAABECGJVBKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAOBTGEYTANBRHE>
.
You are receiving this because your review was requested.Message ID:
***@***.***
com>
|
This is still part of AlgebraicJulia/ACSets.jl#135. The issue's been closed since the fix is good enough but it is not totally solved. |
If the fix isn't actually good enough, feel free to reopen the issue. |
This reverts commit e6a9a63.
This pull requests improves some of the dual mesh generation code to allow it to generate dual mesh information faster and with a lower memory footprint. This dual information is generated with the same functions but now a FastMesh() is passed in as an argument to dispatch to the new code.
This code is separate from the original mesh generation code as this new code bypasses a lot of the ACSet functionality to achieve better performance. A better fix would offer improvements to the underlying ACSet accessing and setting.