Optimizing Speed of Hull Algorithm #903

Kushal-Shah-03 · 2024-08-21T11:54:27Z

I've added tracy headers in some functions, to observe the number of function calls, and time taken, to identify possible places to parallelize. I was wondering if there is a way to just run a specific test in our tests or would I have to comment out the others?

codecov · 2024-08-21T12:10:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.46%. Comparing base (d437097) to head (1f02448).
Report is 87 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #903      +/-   ##
==========================================
- Coverage   91.84%   88.46%   -3.38%     
==========================================
  Files          37       62      +25     
  Lines        4976     8685    +3709     
  Branches        0     1056    +1056     
==========================================
+ Hits         4570     7683    +3113     
- Misses        406     1002     +596

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pca006132 · 2024-08-21T13:44:33Z

e.g. ./test/manifold_test --gtest_filter=Hull.Tictac.

Kushal-Shah-03 · 2024-08-21T14:16:40Z

[without TBB]
This was the trace on Hull.Tictac, I think we can look into parallelizing the addPointToFace calls, that should improve the time.

Kushal-Shah-03 · 2024-08-22T17:36:54Z

So, for the addPointToFace, I checked it out for three cases (Tictac, Sphere and MengerSponge tests)
The maximum value for horizonEdgeCount was 17, for disabledFacePointVectors it was 11, and for disabledPoints it was 281249.
So, we could focus on parallelizing the disabledPoints part. Now, I looked back at the implementation, the way the loop works, is basically for each point we will modify some face (first face that happens to have the point on it's positive side) once. So, I would have to do number of points times atomic operations (on the faces). Now, I'm not very sure about this but apparently in tbb vectors you can concurrently push_back (now whether this adds more overhead or not I am not sure). But, if it doesn't the only time we would need to modify the values using atomic set would be when the point is the most distant point, which could lower number of atomic operations on average. So, we can then think of switching the inner and outer loop and then parallelize the points loop.

This is just food for thought and it's possible what I have explained isn't very clear, so let's discuss this in the next meeting.

Kushal-Shah-03 · 2024-08-22T17:43:23Z

I have removed the tracy header from all functions that were O(1). I have pushed the changes so you get an idea of which functions were removed. This is the tracy output after that.

Also @elalish I know we discussed that we wanted to clean up the code a little, but I just wanted to verify whether I should do it in this PR itself, and then we can rename it too refactoring and optimizing that won't be an issue right?

elalish · 2024-08-22T17:45:13Z

@pca006132 knows more about atomics than I do, but my impression is that you mostly pay the atomic price when they collide (you're actually trying to modify the same address at the same time). I think if the number of faces is >= the number of disabled points, then atomics might not be too bad.

It might be easier to open a second PR for readability improvements. What do you think?

Kushal-Shah-03 · 2024-08-22T17:52:06Z

Yeah I was thinking the same, but the number of faces is much less, so I was not sure about what we could do to optimize it in such a situation, so I just thought I'll share my thoughts and we can discuss about it.

Yeah, I was thinking about opening a new PR as well, then we can sync this later once we are satisfied with it. So, I'll go ahead and start making changes in a new branch and send a PR soon.

pca006132 · 2024-08-23T01:00:21Z

I think we can try to parallelize over the disabled points. I don't think we want to use the tbb vector, it is not as light weight as a typical vector. Also, I think considering we have more than several dozens of faces, it may be feasible to use a lock per face. If there is no contention, it can make things simple while giving nice performance improvement.

Kushal-Shah-03 · 2024-09-05T06:50:23Z

#904 (comment) @pca006132 Should I start breaking the Face struct according to this into three structs or is there a better way to approach it?

pca006132 · 2024-09-05T06:52:42Z

Parallelize things will likely be more beneficial.

Kushal-Shah-03 · 2024-09-05T06:54:37Z

Parallelize things will likely be more beneficial.

I've already started work on it, I had some doubts I was hoping to ask you in today's meeting and then continue based on that.

Kushal-Shah-03 · 2024-09-08T14:41:22Z

@pca006132 I was wondering if we should use Par policy for iterating over the faces despite the number of faces being less, since the way I have tried to implement it, it could speed up the time. Also, I was having random errors with it, when I used Par and I was wondering if I used AtomicAdd correctly. I have added in comments, the change needed to make Par work, but it was causing errors, could you go through it once?

@elalish It seems to me like the bug of some verts being excluded is causing the change I made here to fail as well. I'll try to investigate the error again. But for the meanwhile is it alright if I add EXPECT_NEAR for it as well?

pca006132 · 2024-09-08T15:41:55Z

I was wondering if we should use Par policy for iterating over the faces despite the number of faces being less, since the way I have tried to implement it, it could speed up the time.

Just evaluate the two and see how much better it gets. We just care about the actual performance (and correctness), the opinions I gave earlier are heuristics, not rules.

Kushal-Shah-03 · 2024-09-08T15:44:33Z

Yeah that's why I tried to implement it, but I was having errors, despite the code seeming correct to me. Could you go through that once. I've added in comments the changes needed for Par.

pca006132 · 2024-09-08T15:43:54Z

src/manifold/src/quickhull.cpp

+    // return true;
+
+    // For ExecutionPolicy::Seq
+    pointMutex = 1;


what is this pointMutex thing?

it's just a variable I am using to check if that particular point has been assigned to a face yet, I initially planned to have a lock for it, forgot to change the name. I'll change the name.

pca006132 · 2024-09-08T15:44:42Z

src/manifold/src/quickhull.cpp

+    pointMutex = 1;
+
+    // Ensures atomic addition of point to face
+    f.faceMutex->lock();
    if (!f.pointsOnPositiveSide) {
      f.pointsOnPositiveSide = getIndexVectorFromPool();


this call is not thread-safe.

Lock call isn't thread-safe?

I mean the getIndexVectorFromPool.

Oh right, since the Pool is same for all faces, I should use a lock specifically for pool. Thanks for pointing that out!, that helps a lot.

Do note that for every lock you are adding now, you will likely be thinking about how to remove them later...

Anyway, for now, the important thing is to make sure it works. We can gradually improve on performance.

Yes, I will keep that in mind. Also, the function call happens once per face. So, should not add much overhead.

pca006132 · 2024-09-08T15:46:34Z

src/manifold/src/quickhull.h

@@ -126,6 +126,7 @@ class MeshBuilder {
    size_t visibilityCheckedOnIteration = 0;
    std::uint8_t isVisibleFaceOnCurrentIteration : 1;
    std::uint8_t inFaceStack : 1;
+    std::recursive_mutex* faceMutex = new std::recursive_mutex();


And this probably should not be a recursive mutex.

I saw recursive_mutex being used at a lot of places so just figured I could use it, I'll use std::mutex instead?

Recursive mutex is needed only when you may lock the same thing several times in the same thread. Yeah, you can probably just use std::mutex.

pca006132 · 2024-09-08T15:52:26Z

Btw, if you use autoPolicy with disabledPoints->size(), it is almost guaranteed you will not get ExecutionPolicy::Par with that. iirc the number of disabled points is typically pretty small, much less than 1e4 or something.

Kushal-Shah-03 · 2024-09-08T15:53:30Z

Btw, if you use autoPolicy with disabledPoints->size(), it is almost guaranteed you will not get ExecutionPolicy::Par with that. iirc the number of disabled points is typically pretty small, much less than 1e4 or something.

Oh, thanks for pointing that out!

Kushal-Shah-03 · 2024-09-09T20:54:24Z

@pca006132 , I think I've got it working and now I want to try and improve the performance, I will try out both with and without parallelizing over the faces to see which gives better performance. Apart from that I had a couple of questions

I wanted to know what function can I use for atomic set?
How would I go about removing the lock for each face, because it appears to me that if I want to parallelize over the points I would need the lock?

Also to get an idea of how a change affects the average performance I was thinking I could run it on the Thingi10k dataset. While to test for larger cases where it should show improvement, I'll try it on the high quality sphere and MengerSponge. Does that sound good to you?

elalish · 2024-09-10T05:01:11Z

Usually you either need a lock or an atomic, but ideally not both. I think of an atomic as a short hardware lock, and prefer them where possible.

pca006132 · 2024-09-10T05:41:16Z

std::atomic<int>.
There are many possible ways. For some inspirations, you can look at our collider implementation, which uses some index computation to avoid the use of many locks.
Sure, that sounds good.

…tion

pca006132 · 2024-09-13T06:21:05Z

It seems that the tictac hull number of vertices output is somewhat related to evaluation order?

 [ RUN      ] Hull.Tictac
/home/runner/work/manifold/manifold/test/hull_test.cpp:80: Failure
The difference between sphere.NumVert() + tictacSeg and tictac.NumVert() is 2, which exceeds 1, where
sphere.NumVert() + tictacSeg evaluates to 63002,
tictac.NumVert() evaluates to 63004, and
1 evaluates to 1.

pca006132 · 2024-10-22T17:51:20Z

@Kushal-Shah-03 Hi Kushal, do you plan to work on this anytime soon? We have some large refactoring recently so some code here may have to be changed to make things compile.

Kushal-Shah-03 · 2024-10-28T04:54:43Z

Hello @pca006132, I'm so sorry for the late reply, I have been really busy with the recovery and the end semester exams coming nearer. I can try to modify the code so that it compiles, but I don't think I would be able to work on it extensively for now. I can continue with the parallelization in December. We can possibly merge the other parallelization parts (the basic for loops we parallelized) for now, and I can work on the addPointToFace parallelization call later in another PR.

pca006132 · 2024-10-31T13:54:00Z

@Kushal-Shah-03 This is fine. IIRC the parallelization implemented now is not giving much performance improvement? In this case, we probably don't want to merge it yet. I think we can close this PR for now, and you can open a new one when you revisit this later.

Added Tracy headers in some functions

4a8dec9

Made some of the changes suggested in the previous PR

5890aa7

Removed O(1) Zonescoped

1a8458e

Kushal-Shah-03 and others added 2 commits September 5, 2024 11:59

Merge branch 'master' into OptimizingHull

54fc631

Formatting

86bf299

Kushal-Shah-03 and others added 2 commits September 7, 2024 01:27

Merge branch 'elalish:master' into OptimizingHull

01f1a9f

Attempt to add Parallelization

1d295cc

pca006132 reviewed Sep 8, 2024

View reviewed changes

Kushal-Shah-03 added 3 commits September 8, 2024 21:42

Made fixes for Par faces

b698ada

Trying Seq for faces

204d2ba

Added for_each for initializing faces

f0cc993

Kushal-Shah-03 added 3 commits September 10, 2024 12:40

Using atomic<bool>

1f02448

Modified atomic<bool> for block of memory and added extra paralleliza…

128e4b3

…tion

Removing redundant code

ad23150

pca006132 closed this Oct 31, 2024

Optimizing Speed of Hull Algorithm #903

Optimizing Speed of Hull Algorithm #903

Conversation

Kushal-Shah-03 commented Aug 21, 2024

codecov bot commented Aug 21, 2024 • edited Loading

Codecov Report

pca006132 commented Aug 21, 2024

Kushal-Shah-03 commented Aug 21, 2024

Kushal-Shah-03 commented Aug 22, 2024

Kushal-Shah-03 commented Aug 22, 2024

elalish commented Aug 22, 2024

Kushal-Shah-03 commented Aug 22, 2024

pca006132 commented Aug 23, 2024

Kushal-Shah-03 commented Sep 5, 2024

pca006132 commented Sep 5, 2024

Kushal-Shah-03 commented Sep 5, 2024

Kushal-Shah-03 commented Sep 8, 2024

pca006132 commented Sep 8, 2024

Kushal-Shah-03 commented Sep 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pca006132 commented Sep 8, 2024

Kushal-Shah-03 commented Sep 8, 2024

Kushal-Shah-03 commented Sep 9, 2024

elalish commented Sep 10, 2024

pca006132 commented Sep 10, 2024

pca006132 commented Sep 13, 2024

pca006132 commented Oct 22, 2024

Kushal-Shah-03 commented Oct 28, 2024

pca006132 commented Oct 31, 2024

codecov bot commented Aug 21, 2024 •

edited

Loading