Replace instances of `np.trace(np.dot(...` when computing the trace inner product or Frobenius norm. #430

rileyjmurray · 2024-04-24T16:00:06Z

(On the latest commit of develop, at time of writing. pyGSTi 0.9.12.)

The standard inner product on the space of real $m \times n$ matrices is $(A, B) \mapsto \text{trace}(A^T B)$. There are two candidates for the standard inner products for complex matrices; either $(A, B) \mapsto \text{trace}(A^\dagger B)$ or $(A, B) \mapsto \text{trace}(B^\dagger A)$, which differ up to a complex conjugate.

Several places in pyGSTi require computing these inner products, and explicitly compute a matrix-matrix product before taking a trace. Some places even do this as a way to compute (squared) Frobenius norms. The easy way to find these instances are to search for the text "_np.trace(_np.dot(" (I prepend with underscores since that's pyGSTi's import convention, but there are hits without the underscores as well).

We should replace these instances with a more efficient and consistent pattern. One way to do this is with np.einsum, along with a manual conjugation of one of the matrices. I suggest np.vdot, since that will handle conjugation consistently for us.

Thoughts, @coreyostrove, @sserita?

The text was updated successfully, but these errors were encountered:

coreyostrove · 2024-04-26T02:23:11Z

Some places even do this as a way to compute (squared) Frobenius norms.

This hurts me a bit in my soul.

I'm generally supportive of this. I used einsum for accelerating the trace for products of matrices a bunch when re-implementing the germ selection algorithm and indeed found it to be much faster. One thing that I did find back then when doing profiling was that even though you can do the trace of the product in one fell swoop with einsum, it was faster to actually stop a bit short and compute the diagonal entries and then sum those with np.sum (no idea why this was/is the case). You can see an example of where I used that pattern here:

pyGSTi/pygsti/algorithms/germselection.py

Line 3473 in ee21585

    
           inv_update_term_diag= _np.einsum('ij,ji->i', pinv_E_beta_central_chol, pinv_E_beta_central_chol.T)

. I couldn't tell you exactly what the performance differential was exactly anymore (that's probably a lie, I'm just too lazy to find the notebook where I tested this), but it was enough I decided it was worth doing. Ymmv though, so I'd recommend you try profiling that on your end.

Another observation I made at that time was that einsum was much faster at performing this trace/diagonal calculation subroutine when acting on products on inputs of the form A and A^T, than on general pairs of matrices A and B. If I recall correctly I concluded this was related to some optimization that must have been happening under the hood due to the fact that A^T was a view into A (you're better equipped to understand exactly what would be happening here, but my guess at the time was better cache behavior). I concluded this was related to being a view in part by checking what happened if instead I passed in a copy of A^T and found that the performance fell back in line with general pairs of matrices. I mention all of this because the trace(A@A^T) pattern is one that appears in a bunch of the cost functions we use in experiment design (and probably other places) due to its relation to something called A-optimal experiment design.

One last incoherent nugget extracted from the deeper confines of my memory. I have found that the aforementioned performance boost from using einsum for diagonals/traces of products of matrices of the form A and A^T is big enough that it in some instances justified doing some otherwise weird looking additional calculations. For example, in germ selection there was one particular hotspot in the code related to evaluating an expression of the form trace(A@C@A^T) with A having shape (N, r), C having shape (r,r) and N>>r. You can do this with a one-liner with einsum, but I found that it was significantly advantageous to first take a cholesky decomposition (even with the additional cost that imposed) and then fold the square roots of C (really half since you only need to do this on one since since we know we'll just be taking a transpose) into A before doing the einsum.

Thanks for coming to my TED talk.

coreyostrove · 2024-04-26T02:32:34Z

Whoops, just read your PR related to this. vdot is cool too.

enielse · 2024-04-26T12:00:08Z

I'm generally for this, and would suggest we add comments to the effected lines where it seems appropriate, as while einsum and other options are faster tend to be less readable. One of the frustrations of Python is that in many circumstances you can have either speed or readability but not both at the same time :)

rileyjmurray · 2024-04-26T13:01:58Z

@coreyostrove and @enielse, I like vdot because it's fast, consistent (about how it always handles conjugation in the first argument), and readable (once you know what it's useful for).

@coreyostrove, regarding

Another observation I made at that time was that einsum was much faster at performing this trace/diagonal calculation subroutine when acting on products on inputs of the form A and A^T, than on general pairs of matrices A and B. If I recall correctly I concluded this was related to some optimization that must have been happening under the hood due to the fact that A^T was a view into A (you're better equipped to understand exactly what would be happening here, but my guess at the time was better cache behavior).

There is indeed something going on with A.T being a view of A. Computing np.dot(A.ravel(), A.ravel()) will have half the data movement from RAM into cache as np.dot(A.ravel(), A.copy().ravel()). Supposing that $A$ is $m \times n$, this operation involves $O(mn)$ arithmetic and $O(mn)$ data movement. In a one-for-one comparison, moving data costs more than operating on data, so the data movement ends up being the dominant cost in the operation. This can be contrasted with an operation like A.T @ B where $B$ is also $m \times n$ -- which takes $O(mn^2)$ arithmetic and involves $O(mn)$ data movement, leaving arithmetic as the dominant cost by far.

As for

For example, in germ selection there was one particular hotspot in the code related to evaluating an expression of the form trace(A@C@A^T) with A having shape (N, r), C having shape (r,r) and N>>r. You can do this with a one-liner with einsum, but I found that it was significantly advantageous to first take a cholesky decomposition (even with the additional cost that imposed) and then fold the square roots of C (really half since you only need to do this on one since since we know we'll just be taking a transpose) into A before doing the einsum.

The benefits of Cholesky here don't surprise me! It's cool that you figured this out :)

sserita · 2024-05-07T16:52:53Z

Closed with merged PR :)

rileyjmurray added the enhancement Request for a new feature or a change to an existing feature label Apr 24, 2024

This was referenced Apr 25, 2024

Replace unnecessary calls to .flatten() with .ravel(). #431

Closed

Bugfixes and efficiency improvements for basis conversion, inner products, and computing Frobenius distance #432

Merged

sserita added this to the 0.9.13 milestone May 7, 2024

sserita added the fixed but not in release yet Bug has been fixed, but isn't in an official release yet (just exists on a development branch) label May 7, 2024

sserita closed this as completed May 7, 2024

sserita removed the fixed but not in release yet Bug has been fixed, but isn't in an official release yet (just exists on a development branch) label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace instances of `np.trace(np.dot(...` when computing the trace inner product or Frobenius norm. #430

Replace instances of `np.trace(np.dot(...` when computing the trace inner product or Frobenius norm. #430

rileyjmurray commented Apr 24, 2024

coreyostrove commented Apr 26, 2024

coreyostrove commented Apr 26, 2024

enielse commented Apr 26, 2024

rileyjmurray commented Apr 26, 2024

sserita commented May 7, 2024

Replace instances of np.trace(np.dot(... when computing the trace inner product or Frobenius norm. #430

Replace instances of np.trace(np.dot(... when computing the trace inner product or Frobenius norm. #430

Comments

rileyjmurray commented Apr 24, 2024

coreyostrove commented Apr 26, 2024

coreyostrove commented Apr 26, 2024

enielse commented Apr 26, 2024

rileyjmurray commented Apr 26, 2024

sserita commented May 7, 2024

Replace instances of `np.trace(np.dot(...` when computing the trace inner product or Frobenius norm. #430

Replace instances of `np.trace(np.dot(...` when computing the trace inner product or Frobenius norm. #430