-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[multibody] Experiment using M frame for Inverse Dynamics, for timings/sanity check #22253
base: master
Are you sure you want to change the base?
Conversation
a730a7d
to
4eab60a
Compare
Preliminary performance results for M-frame Inverse Dynamics. TL/DR: switching from W to M frame resulted in a 25% speedup for ID. Combined with previous changes ID is 2X faster than when we started. Details:
Cassie benchmark times on my old Puget Xeon [email protected], g++ 11.4 I'm still studying this to see where we can squeeze out more speed. |
Since we're trying to match Pinocchio timings (we think about 2μs for Cassie-size ID), there are some further considerations to make an apples-to-apples comparison. We've been including position & velocity kinematics in ID timings. Possibly Pinocchio is leaving kinematics fixed and just measuring the ID time alone. Also, the above timings were with gcc 11.4 which does a poor optimizing job compared to clang 14.0.0. And, the Pinocchio timings were presumably run on a faster machine than my 7yo Puget. Let's see how those factors affect things. TL/DR: this gets us to 2μs. And we're within 2X even with kinematics included.
Timed on my laptop: Xeon W-11855M CPU @ 3.20GHz, using clang 14.0.0 |
a1754dd
to
5adfeef
Compare
fe6adf8
to
18faf6a
Compare
8df8522
to
0cda315
Compare
Minor update: After profiling, I've been experimenting with SIMD implementations for operations that stand out:
Although all of these can be done with only a few packed floating point operations, only the last one was better than optimized C++ (according to llvm-mca in Godbolt). That's because of the many instructions required to fill and reorder the 4-element ymm registers prior to executing the packed fp. For short functions the loss of inlining is also likely a problem though I couldn't analyze that in Godbolt. Cassiebench timing with the re-express spatial vector SIMD only provided a 2% speedup overall so it's not worth the extra complexity. My conclusion is that we can only get real SIMD speedups with more substantial operations. I'm not seeing good candidates in kinematics and ID, but will revisit this when I get to forward dynamics. Interestingly (to me anyway) the compilers (g++ 11.4, clang 14) managed to do a little 2-wide SIMD when working with 3-vectors, using a double wide xmm operation followed by a scalar operation. This required much less shuffling so the overall performance was better than I could get after the contortions required to pack the 4-wide registers. This suggests to me that it will be futile to attempt to exploit the 8-wide zmm SIMD instructions in AVX512 for small data structures -- they will certainly be useful for large operations though. Moving on now ... |
fcbae08
to
bf578b2
Compare
0c3357a
to
5058cde
Compare
WIP, not intended to merge, don't review
This branch will be an experiment to integrate Alejandro's M-frame inverse dynamics prototype into Drake to see what speedups we can get in real life switching from W to M.
This change is