Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLAS/LAPACK calls people are interested in #9

Open
crtrott opened this issue Mar 6, 2017 · 39 comments
Open

BLAS/LAPACK calls people are interested in #9

crtrott opened this issue Mar 6, 2017 · 39 comments
Labels

Comments

@crtrott
Copy link
Member

crtrott commented Mar 6, 2017

This is just to collect stuff. I will update the first post if more comes in. This is not a promise off what is gonna be there when, its just to help us planning. I differentiate global, team, and thread kernels.

BLAS

Global:

  • AXPY
  • SCAL
  • GEMM
  • DOT

Team:

  • GEMM
  • GEMV
  • AXPY
  • DOT

Thread

  • GEMM
  • GEMV
  • AXPY
  • DOT

LAPACK

Global:

  • SYEV
  • HEEVR

Team:

  • GETRF
  • GETRS
  • GETF2
  • GETRI

Thread:

  • GETRF
  • GETRS
@crtrott
Copy link
Member Author

crtrott commented Mar 6, 2017

From Email:

Christian,

At our last CoPA telecon, you had asked which BLAS calls we were using in our QMD code and libs. We currently only use the following.
It sounded like there is a Kokkos wrapper for BLAS or something like that? We would be interested in looking at that in our ExaSP2 proxy work. Thanks.

- Sue

BLAS/LAPACK routines used:

  {s,d,c,z}gemm
  {s,d,c,z}scal
  {s,d,c,z}axpy

  {s,d}syev
  {c,z}heevr

@crtrott
Copy link
Member Author

crtrott commented Mar 6, 2017

From Discussion with LANL
TeamLevel: GEMM, GETRF, GETRS, GEMV (solving block tridiagonal matrix)

@nmhamster
Copy link
Contributor

@crtrott are these batched or plain or both?

@mhoemmen
Copy link
Contributor

mhoemmen commented Mar 6, 2017

GETRI (global, team, and thread) would be useful too. That should actually reduce priority of optimizing GETRS. (GETRI does explicit inverses; use that in the setup phase, and you only need GEMV in the solve phase, instead of GETRS.)

@crtrott
Copy link
Member Author

crtrott commented Mar 6, 2017

For now the requested ones were just plain (i.e. non-batched). The Team ones could be used to build a batch one obviously, but the actual usecase we talked about builds the matrices on the fly for the solve.

@mhoemmen
Copy link
Contributor

mhoemmen commented Mar 6, 2017

@nmhamster BlockCrsMatrix could use either global batched or team.

@crtrott
Copy link
Member Author

crtrott commented Mar 8, 2017

Matt wants:
@bathmatt

Team Level:
Blas

  • gemm (with transpose)

Lapack:

  • getf2
  • getri

Really cares about A_inverse from A. A is generated on the fly in the team.

@bathmatt
Copy link

bathmatt commented Mar 8, 2017 via email

@kyungjoo-kim
Copy link
Contributor

In the low level concept, there is no need to distinct batch from plain because batch means that we use parallel for and put team or serial interface inside. For global ones, they are simply wrapper as we interface them to mkl and cublas.

@bathmatt , I do not think that it is necessary to specialize "h". For double/float, conjugate just return the same value. In implementation level, it does not do different thing either as conj(val) would return val if value type is double.

I am kind of strongly against using character as function arguments. The use of character in blas and lapack is just inherited from fortran and we do not really follow that way.

@mhoemmen
Copy link
Contributor

mhoemmen commented Mar 8, 2017

@kyungjoo-kim It's nice to specialize "h" even for real value types, so that people can write a code for any value type.

@bathmatt
Copy link

bathmatt commented Mar 8, 2017 via email

@mhoemmen
Copy link
Contributor

mhoemmen commented Mar 8, 2017

So, do we want to have Kokkos::conj(v) work for a double then? That works for me.

It does work for a double. Users can always call T instead of H if they don't want to call conj on doubles.

@kyungjoo-kim
Copy link
Contributor

@mhoemmen when we work on this collection of functionalities, we should care about code complexity and better not have many specializations. Since we mostly need implemntation of serial and team level implementation (suppose that they are decorated as kokkos inline function), conv(v) with double will strip out the conv function anyway.

@lucbv
Copy link
Contributor

lucbv commented Mar 24, 2017

Hi, I would be interested in dense GETRF, GETRS, GEMV at thread level for small matrices, up to 3by3 in order to do linear interpolation between mesh nodes. This would look like solving the following:
x = sum_i [N_i(xi, eta, zeta)*x_i]
y = sum_i [N_i(xi, eta, zeta)*y_i]
z = sum_i [N_i(xi, eta, zeta)*z_i]
where my unknowns are xi, eta and zeta and a linearization is applied (N_i functions have cross terms xietazeta).

These calls would be wrapped inside a functor used in a parfor that sweeps over my mesh so would need to work at thread level.

@tawiesn
Copy link

tawiesn commented Apr 3, 2017

MueLu would furthermore need:

Team-level:

Lapack:

  • GEQRF
  • UNGQR

to do local QR decomposition (we need both R and Q built for the TentativePFactory).

So far, i have my own Kokkos implementation of a QR decomposition which does the job for us, but it might make sense to optimize it and adapt the interface to fit to the Lapack calls...

@JanRobertE
Copy link

We would need the following BLAS routines at Team and Thread level:

  • Dgemm
  • Dgemv
  • Daxpy
  • Ddot

@crtrott
Copy link
Member Author

crtrott commented Sep 20, 2017

Jan: is there some optimization point with respect to size of the problems you need?

@JanRobertE
Copy link

That would be ideal, I guess. Depending on the polynomial order of our elements the problem size can vary quite significantly.

@jeffhammond
Copy link

jeffhammond commented Sep 20, 2017

@crtrott It would be really cool if Kokkos used LIBXSMM under the hood on supported platforms. LIBXSMM doesn't use threads so there should be no composition issues with Kokkos.

There are a bunch of nice demonstrations of LIBXSMM for spectral/finite element codes, so this would likely align with @JanRobertE's request.

@srajama1
Copy link
Contributor

@jeffhammond : Our tests compare against LIBXSMM where available, but we don't give that as a callable option within a thread/team yet.

@crtrott
Copy link
Member Author

crtrott commented Sep 20, 2017

I'd love you to help out with this ;-)

If you look at the blas interface you will see that we got a pretty generic way of plugging in TPLs. Putting libxsmm in is certainly an option we should consider. The basic idea for team_level / thread_level versions (and it looks like libxsmm is targeted only at the latter right now) would be something like this:

void team_gemm(team_handle& team, ViewA A, ViewB B, ViewC) {
  Impl::TeamGEMM<ViewA,ViewB,ViewC>::team_gemm(team,a,b,c);
}

template<class ViewA, class ViewB, class ViewC
TeamGEMM_TPL_Avail {
  enum {value = 0};
}

template<class ViewA, class ViewB, class ViewC, int TPL_AVAIL = TeamGEMM_TPL_AVAIL<ViewA,ViewB,ViewC>::value >
TeamGEMM {
   static inline
   void team_gemm(ViewA a, ViewB b, ViewC c) {}
}

Plugging in another TPL would basically require you to specialize the TeamGEMM_TPL_AVAIL and the TeamGEMM struct.

@jhux2
Copy link

jhux2 commented Dec 19, 2017

GETRF and GETRS for applying the inverse of a sparse matrix's block diagonal. For the application I'm currently considering, blocks are 5x5. An application with more physics would cause these blocks to grow.

@mhoemmen
Copy link
Contributor

@jhux2 mentioned yesterday wanting batch versions of those functions. Would all blocks in a batch have the same dimensions, or could they possibly differ?

@jhux2
Copy link

jhux2 commented Dec 20, 2017

Would all blocks in a batch have the same dimensions, or could they possibly differ?

@mhoemmen For the case I'm currently considering, all the blocks would have the same dimensions.

@crtrott
Copy link
Member Author

crtrott commented Dec 20, 2017

We still need a fundamental decision on whether to provide a LAPACK interface. My personal preference is yes, but only stuff people actually ask for, and (at least initially) only calling through to TPL LAPACK.

@egboman
Copy link

egboman commented Dec 20, 2017

If you do LAPACK, I would be interested in xGESVD, xGESDD.

@mhoemmen
Copy link
Contributor

@crtrott For the batch interface, will we assume that the LAPACK implementation can be called in a thread-parallel context? That's a bit risky if it's very old; it wasn't long ago that DLAMCH and friends had the equivalent of static variables in them that were set on first call to whatever routine needed them.

@egboman The above paragraph is directly relevant to you. Eigensolvers are hard, and the logical first parallel batch implementation of dense eigensolvers for non-GPU architectures would just call LAPACK in parallel.

@dholladay00
Copy link

It seems likely that I will be moving away from LU and moving to QR decomposition, which means that instead of needing getr{s,f}, I will be using geqrf, ormqr, and trs{m,v} at the team level. Other requests remain unchanged.

I see that these have been requested by @tawiesn. I have non-pivoted getr{f,s} so that I could run on cuda backend. This change will render cuda backend unusable again. Any chance the kokkos team level qr decomposition is available somewhere and/or references to see how it could be implemented? Info on this would be greatly appreciated.

@mhoemmen
Copy link
Contributor

mhoemmen commented Nov 8, 2018

@dholladay00 The beginning of a hand-rolled QR lives in Trilinos/packages/tpetra/tsqr. I also have a strong interest in finishing this for Kokkos, and a bit of funding to do it, but time is short. I'm in all-day meetings this week, but if you want to talk more about it, just let me know.

@dholladay00
Copy link

@mhoemmen I'll message you on the Kokkos users slack and continue this conversation there.

@kyungjoo-kim
Copy link
Contributor

kyungjoo-kim commented Nov 8, 2018

@dholladay00 Can you explain little bit more about your use case ? @tawiesn 's use case is a bit different from the general use case of the QR factorization. His use case is more or less updating QR is more efficient way to compute some prolongation operator. Recalling the conversion with him, he has a repeated pattern in a matrix and he uses QR in a conventional way but updating is probably better in my opinion (@tawiesn if you are still checking github and found I am wrong, please correct me).

On the other hand, @dholladay00 your use case seems that you just want to invert a matrix or solve a problem. What is the range of your problem sizes ? Variable batch ? or Fixed size batch ? If you don't mind, could you also point out your project ? I am wondering how these developed functions are used.

@jhux2
Copy link

jhux2 commented Nov 8, 2018

@tawiesn 's use case is a bit different from the general use case of the QR factorization. His use case is more or less updating QR is more efficient way to compute some prolongation operator. Recalling the conversion with him, he has a repeated pattern in a matrix and he uses QR in a conventional way but updating is probably better in my opinion

@kyungjoo-kim You are correct, the use case for the prolongator is small blocks, many with the same repeated pattern. I'm not sure what you mean by updating.

@dholladay00
Copy link

@kyungjoo-kim I solve a block tridiagonal matrix in which each block row can have a different size. The size of a given block matrix will generally be from 10x10-1000x1000. The solve is accomplished via the Thomas algorithm and it requires applying the inverse (was done by LU, now done by QR due to increased stability for ill conditioned problems).

Each thread team (Kokkos thread team) is in charge of a block tridiagonal matrix, so I would like to pass a team handle into these calls so that they can be accomplished with the appropriate level of parallelism. With the openmp backend, I currently place the blas/lapack call inside of a Kokkos::single(PerTeam(...) ...). On the cuda backend (currently), the wrapper calls in to a handwritten version that uses the correct level of parallelism (teamthread and threadvector), but I don't have handwritten geqrf/ormqr.

@mhoemmen
Copy link
Contributor

mhoemmen commented Nov 8, 2018

@dholladay00 wrote:

but I don't have handwritten geqrf/ormqr.

I have the start of that, but it's not at all parallel. I also need this. It's a "team collective" interface; the intent on CPU is to pick cache-block-sized chunks on which to work.

@srajama1
Copy link
Contributor

srajama1 commented Nov 8, 2018

Is Tpetra planing to be the place for team based, batched QR ? What are the design decisions in having this in Tpetra ? @kyungjoo-kim is implementing the compact QR in Kokkos Kernels as as well. It is good to synch up.

@mhoemmen
Copy link
Contributor

mhoemmen commented Nov 9, 2018

The point is not to put this in Tpetra; the point is that a partial implementation exists in Tpetra that implementers could use as a starting point.

@srajama1
Copy link
Contributor

srajama1 commented Nov 9, 2018

Point taken. My concerns on the scope-creep remain. Plus I think the use case here is not TS.

@kyungjoo-kim
Copy link
Contributor

There is no notion of compact QR. All I wrote under the KokkosBatched name space is just generic implementations of serial and team interface of subset of bias and lapack. Compact approach via KokkosBatched code is one use case. I want to note that different algorithms and optimizations are needed for different use cases. One may want to use the serial and team code in KokkosBatched routines but the developed codes may not be suited for your use cases. I am writing an eigenvalue solver targeting problem sizes O(10~100). I already have a householder QR which is a by-product from the eigenvalue solver. I will merge them later.

@wlruys
Copy link

wlruys commented Sep 29, 2021

I have a use case where my matrices start small and grow larger than the currently supported team level versions of these routines would support efficiently.

I have an interest in a unified interface so Kokkos can be called for the device level calls too instead of dispatching directly to a device specific cuSOLVER or similar where needed.

Device level:
Lapack:

  • geqrf (unpivoted QR)
  • unmqr (apply Q)

(Also referencing #567 which is another request for the same kernels)

e10harvey added a commit that referenced this issue Jan 18, 2022
Change EPS to 1e-2 for float
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests