Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[multibody] Experiment using M frame for Inverse Dynamics, for timings/sanity check #22253

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sherm1
Copy link
Member

@sherm1 sherm1 commented Dec 3, 2024

WIP, not intended to merge, don't review

This branch will be an experiment to integrate Alejandro's M-frame inverse dynamics prototype into Drake to see what speedups we can get in real life switching from W to M.


This change is Reviewable

@sherm1
Copy link
Member Author

sherm1 commented Dec 5, 2024

Preliminary performance results for M-frame Inverse Dynamics. TL/DR: switching from W to M frame resulted in a 25% speedup for ID. Combined with previous changes ID is 2X faster than when we started. Details:

This PR ID-M Master ID-W Before speedup work
time μs 11.6 15.2 21.4
speedup 24% 26% --
total 46% 26% --

Cassie benchmark times on my old Puget Xeon [email protected], g++ 11.4
(ID-W is the World frame version in master, ID-M is the new M-frame method)

I'm still studying this to see where we can squeeze out more speed.
cc @amcastro-tri

@sherm1
Copy link
Member Author

sherm1 commented Dec 6, 2024

Since we're trying to match Pinocchio timings (we think about 2μs for Cassie-size ID), there are some further considerations to make an apples-to-apples comparison. We've been including position & velocity kinematics in ID timings. Possibly Pinocchio is leaving kinematics fixed and just measuring the ID time alone. Also, the above timings were with gcc 11.4 which does a poor optimizing job compared to clang 14.0.0. And, the Pinocchio timings were presumably run on a faster machine than my 7yo Puget. Let's see how those factors affect things. TL/DR: this gets us to 2μs. And we're within 2X even with kinematics included.

P+V+ID-W P+V+ID-M Just ID-M
time μs 4.95 3.74 2.14

Timed on my laptop: Xeon W-11855M CPU @ 3.20GHz, using clang 14.0.0
(ID-W is the World frame version in master, ID-M is the new M-frame method)

@sherm1 sherm1 force-pushed the better_inverse_dynamics branch 2 times, most recently from a1754dd to 5adfeef Compare December 16, 2024 20:02
@sherm1 sherm1 force-pushed the better_inverse_dynamics branch 2 times, most recently from fe6adf8 to 18faf6a Compare December 20, 2024 01:23
@sherm1 sherm1 force-pushed the better_inverse_dynamics branch 4 times, most recently from 8df8522 to 0cda315 Compare January 13, 2025 22:26
@sherm1
Copy link
Member Author

sherm1 commented Jan 13, 2025

Minor update: After profiling, I've been experimenting with SIMD implementations for operations that stand out:

  • Symmetric 3x3 matrix times 3-vector
  • Cross product wXr
  • Double cross produce wXwXr
  • Re-express spatial vector

Although all of these can be done with only a few packed floating point operations, only the last one was better than optimized C++ (according to llvm-mca in Godbolt). That's because of the many instructions required to fill and reorder the 4-element ymm registers prior to executing the packed fp. For short functions the loss of inlining is also likely a problem though I couldn't analyze that in Godbolt.

Cassiebench timing with the re-express spatial vector SIMD only provided a 2% speedup overall so it's not worth the extra complexity. My conclusion is that we can only get real SIMD speedups with more substantial operations. I'm not seeing good candidates in kinematics and ID, but will revisit this when I get to forward dynamics.

Interestingly (to me anyway) the compilers (g++ 11.4, clang 14) managed to do a little 2-wide SIMD when working with 3-vectors, using a double wide xmm operation followed by a scalar operation. This required much less shuffling so the overall performance was better than I could get after the contortions required to pack the 4-wide registers. This suggests to me that it will be futile to attempt to exploit the 8-wide zmm SIMD instructions in AVX512 for small data structures -- they will certainly be useful for large operations though.

Moving on now ...

@sherm1 sherm1 force-pushed the better_inverse_dynamics branch 4 times, most recently from fcbae08 to bf578b2 Compare January 17, 2025 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant