BLAS/LAPACK calls people are interested in #9

crtrott · 2017-03-06T17:39:57Z

This is just to collect stuff. I will update the first post if more comes in. This is not a promise off what is gonna be there when, its just to help us planning. I differentiate global, team, and thread kernels.

BLAS

Global:

AXPY
SCAL
GEMM
DOT

Team:

GEMM
GEMV
AXPY
DOT

Thread

GEMM
GEMV
AXPY
DOT

LAPACK

Global:

SYEV
HEEVR

Team:

GETRF
GETRS
GETF2
GETRI

Thread:

GETRF
GETRS

crtrott · 2017-03-06T17:42:52Z

From Email:

Christian,

At our last CoPA telecon, you had asked which BLAS calls we were using in our QMD code and libs. We currently only use the following.
It sounded like there is a Kokkos wrapper for BLAS or something like that? We would be interested in looking at that in our ExaSP2 proxy work. Thanks.

- Sue

BLAS/LAPACK routines used:

  {s,d,c,z}gemm
  {s,d,c,z}scal
  {s,d,c,z}axpy

  {s,d}syev
  {c,z}heevr

crtrott · 2017-03-06T17:43:57Z

From Discussion with LANL
TeamLevel: GEMM, GETRF, GETRS, GEMV (solving block tridiagonal matrix)

nmhamster · 2017-03-06T17:45:04Z

@crtrott are these batched or plain or both?

mhoemmen · 2017-03-06T17:49:06Z

GETRI (global, team, and thread) would be useful too. That should actually reduce priority of optimizing GETRS. (GETRI does explicit inverses; use that in the setup phase, and you only need GEMV in the solve phase, instead of GETRS.)

crtrott · 2017-03-06T17:52:46Z

For now the requested ones were just plain (i.e. non-batched). The Team ones could be used to build a batch one obviously, but the actual usecase we talked about builds the matrices on the fly for the solve.

mhoemmen · 2017-03-06T18:00:41Z

@nmhamster BlockCrsMatrix could use either global batched or team.

crtrott · 2017-03-08T17:35:26Z

Matt wants:
@bathmatt

Team Level:
Blas

gemm (with transpose)

Lapack:

getf2
getri

Really cares about A_inverse from A. A is generated on the fly in the team.

bathmatt · 2017-03-08T19:49:16Z

So, I'm making a gemm routine, the interface for lapack has two chars for A and B matrix, if those chars are 'n' that means no modification to A/B, if 't' then do a transpose, if 'h' do a conj transpose. Now only h is valid in zdgemm, Since we have the matrix types as template parameters we no longer have zdgemm, and just dgemm and we figure out the scalar type from that. What is the best way to handle that? I have the generic interface ` template<class MemberType, typename Scalar, class AView, class BView, class CView> KOKKOS_INLINE_FUNCTION void GEMM(const MemberType &team_member, char transa, char transb, Scalar alpha, AView A, BView B, Scalar beta, CView C) { ` we could specialize on Scalar? and then throw errors if one passes in 'h' to the general interface and only allow Kokkos::cmplx<double>/float into the special ones?

…

On Wed, Mar 8, 2017 at 10:35 AM, Christian Trott ***@***.***> wrote: Matt wants: @bathmatt <https://github.com/bathmatt> Team Level: Blas - gemm (with transpose) Lapack: - getf2 - getri Really cares about A_inverse from A. A is generated on the fly in the team. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOPDIP0yd2pQlpf-TKrI-0aXCIbf0HLTks5rjubegaJpZM4MUdqK> .

kyungjoo-kim · 2017-03-08T23:02:13Z

In the low level concept, there is no need to distinct batch from plain because batch means that we use parallel for and put team or serial interface inside. For global ones, they are simply wrapper as we interface them to mkl and cublas.

@bathmatt , I do not think that it is necessary to specialize "h". For double/float, conjugate just return the same value. In implementation level, it does not do different thing either as conj(val) would return val if value type is double.

I am kind of strongly against using character as function arguments. The use of character in blas and lapack is just inherited from fortran and we do not really follow that way.

mhoemmen · 2017-03-08T23:12:11Z

@kyungjoo-kim It's nice to specialize "h" even for real value types, so that people can write a code for any value type.

bathmatt · 2017-03-08T23:29:27Z

So, do we want to have Kokkos::conj(v) work for a double then? That works for me.

…

On Wed, Mar 8, 2017 at 4:12 PM, Mark Hoemmen ***@***.***> wrote: @kyungjoo-kim <https://github.com/kyungjoo-kim> It's nice to specialize "h" even for real value types, so that people can write a code for any value type. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOPDIKzGTjarr_w8G05IfjQi4_TXN_ACks5rjzXLgaJpZM4MUdqK> .

mhoemmen · 2017-03-08T23:40:38Z

So, do we want to have Kokkos::conj(v) work for a double then? That works for me.

It does work for a double. Users can always call T instead of H if they don't want to call conj on doubles.

kyungjoo-kim · 2017-03-09T00:18:10Z

@mhoemmen when we work on this collection of functionalities, we should care about code complexity and better not have many specializations. Since we mostly need implemntation of serial and team level implementation (suppose that they are decorated as kokkos inline function), conv(v) with double will strip out the conv function anyway.

lucbv · 2017-03-24T22:00:53Z

Hi, I would be interested in dense GETRF, GETRS, GEMV at thread level for small matrices, up to 3by3 in order to do linear interpolation between mesh nodes. This would look like solving the following:
x = sum_i [N_i(xi, eta, zeta)*x_i]
y = sum_i [N_i(xi, eta, zeta)*y_i]
z = sum_i [N_i(xi, eta, zeta)*z_i]
where my unknowns are xi, eta and zeta and a linearization is applied (N_i functions have cross terms xietazeta).

These calls would be wrapped inside a functor used in a parfor that sweeps over my mesh so would need to work at thread level.

tawiesn · 2017-04-03T19:50:25Z

MueLu would furthermore need:

Team-level:

Lapack:

GEQRF
UNGQR

to do local QR decomposition (we need both R and Q built for the TentativePFactory).

So far, i have my own Kokkos implementation of a QR decomposition which does the job for us, but it might make sense to optimize it and adapt the interface to fit to the Lapack calls...

JanRobertE · 2017-09-20T16:32:44Z

We would need the following BLAS routines at Team and Thread level:

Dgemm
Dgemv
Daxpy
Ddot

crtrott · 2017-09-20T16:33:36Z

Jan: is there some optimization point with respect to size of the problems you need?

JanRobertE · 2017-09-20T16:39:24Z

That would be ideal, I guess. Depending on the polynomial order of our elements the problem size can vary quite significantly.

jeffhammond · 2017-09-20T17:03:31Z

@crtrott It would be really cool if Kokkos used LIBXSMM under the hood on supported platforms. LIBXSMM doesn't use threads so there should be no composition issues with Kokkos.

There are a bunch of nice demonstrations of LIBXSMM for spectral/finite element codes, so this would likely align with @JanRobertE's request.

srajama1 · 2017-09-20T17:30:52Z

@jeffhammond : Our tests compare against LIBXSMM where available, but we don't give that as a callable option within a thread/team yet.

crtrott · 2017-09-20T17:39:30Z

I'd love you to help out with this ;-)

If you look at the blas interface you will see that we got a pretty generic way of plugging in TPLs. Putting libxsmm in is certainly an option we should consider. The basic idea for team_level / thread_level versions (and it looks like libxsmm is targeted only at the latter right now) would be something like this:

void team_gemm(team_handle& team, ViewA A, ViewB B, ViewC) {
  Impl::TeamGEMM<ViewA,ViewB,ViewC>::team_gemm(team,a,b,c);
}

template<class ViewA, class ViewB, class ViewC
TeamGEMM_TPL_Avail {
  enum {value = 0};
}

template<class ViewA, class ViewB, class ViewC, int TPL_AVAIL = TeamGEMM_TPL_AVAIL<ViewA,ViewB,ViewC>::value >
TeamGEMM {
   static inline
   void team_gemm(ViewA a, ViewB b, ViewC c) {}
}

Plugging in another TPL would basically require you to specialize the TeamGEMM_TPL_AVAIL and the TeamGEMM struct.

jhux2 · 2017-12-19T20:35:46Z

GETRF and GETRS for applying the inverse of a sparse matrix's block diagonal. For the application I'm currently considering, blocks are 5x5. An application with more physics would cause these blocks to grow.

mhoemmen · 2017-12-20T21:34:37Z

@jhux2 mentioned yesterday wanting batch versions of those functions. Would all blocks in a batch have the same dimensions, or could they possibly differ?

jhux2 · 2017-12-20T21:43:19Z

Would all blocks in a batch have the same dimensions, or could they possibly differ?

@mhoemmen For the case I'm currently considering, all the blocks would have the same dimensions.

crtrott · 2017-12-20T22:06:36Z

We still need a fundamental decision on whether to provide a LAPACK interface. My personal preference is yes, but only stuff people actually ask for, and (at least initially) only calling through to TPL LAPACK.

egboman · 2017-12-20T22:21:39Z

If you do LAPACK, I would be interested in xGESVD, xGESDD.

mhoemmen · 2017-12-30T21:36:12Z

@crtrott For the batch interface, will we assume that the LAPACK implementation can be called in a thread-parallel context? That's a bit risky if it's very old; it wasn't long ago that DLAMCH and friends had the equivalent of static variables in them that were set on first call to whatever routine needed them.

@egboman The above paragraph is directly relevant to you. Eigensolvers are hard, and the logical first parallel batch implementation of dense eigensolvers for non-GPU architectures would just call LAPACK in parallel.

dholladay00 · 2018-11-06T01:42:19Z

It seems likely that I will be moving away from LU and moving to QR decomposition, which means that instead of needing getr{s,f}, I will be using geqrf, ormqr, and trs{m,v} at the team level. Other requests remain unchanged.

I see that these have been requested by @tawiesn. I have non-pivoted getr{f,s} so that I could run on cuda backend. This change will render cuda backend unusable again. Any chance the kokkos team level qr decomposition is available somewhere and/or references to see how it could be implemented? Info on this would be greatly appreciated.

mhoemmen · 2018-11-08T18:14:21Z

@dholladay00 The beginning of a hand-rolled QR lives in Trilinos/packages/tpetra/tsqr. I also have a strong interest in finishing this for Kokkos, and a bit of funding to do it, but time is short. I'm in all-day meetings this week, but if you want to talk more about it, just let me know.

dholladay00 · 2018-11-08T19:10:27Z

@mhoemmen I'll message you on the Kokkos users slack and continue this conversation there.

kyungjoo-kim · 2018-11-08T20:14:12Z

@dholladay00 Can you explain little bit more about your use case ? @tawiesn 's use case is a bit different from the general use case of the QR factorization. His use case is more or less updating QR is more efficient way to compute some prolongation operator. Recalling the conversion with him, he has a repeated pattern in a matrix and he uses QR in a conventional way but updating is probably better in my opinion (@tawiesn if you are still checking github and found I am wrong, please correct me).

On the other hand, @dholladay00 your use case seems that you just want to invert a matrix or solve a problem. What is the range of your problem sizes ? Variable batch ? or Fixed size batch ? If you don't mind, could you also point out your project ? I am wondering how these developed functions are used.

jhux2 · 2018-11-08T20:53:26Z

@tawiesn 's use case is a bit different from the general use case of the QR factorization. His use case is more or less updating QR is more efficient way to compute some prolongation operator. Recalling the conversion with him, he has a repeated pattern in a matrix and he uses QR in a conventional way but updating is probably better in my opinion

@kyungjoo-kim You are correct, the use case for the prolongator is small blocks, many with the same repeated pattern. I'm not sure what you mean by updating.

dholladay00 · 2018-11-08T20:59:05Z

@kyungjoo-kim I solve a block tridiagonal matrix in which each block row can have a different size. The size of a given block matrix will generally be from 10x10-1000x1000. The solve is accomplished via the Thomas algorithm and it requires applying the inverse (was done by LU, now done by QR due to increased stability for ill conditioned problems).

Each thread team (Kokkos thread team) is in charge of a block tridiagonal matrix, so I would like to pass a team handle into these calls so that they can be accomplished with the appropriate level of parallelism. With the openmp backend, I currently place the blas/lapack call inside of a Kokkos::single(PerTeam(...) ...). On the cuda backend (currently), the wrapper calls in to a handwritten version that uses the correct level of parallelism (teamthread and threadvector), but I don't have handwritten geqrf/ormqr.

mhoemmen · 2018-11-08T22:08:14Z

@dholladay00 wrote:

but I don't have handwritten geqrf/ormqr.

I have the start of that, but it's not at all parallel. I also need this. It's a "team collective" interface; the intent on CPU is to pick cache-block-sized chunks on which to work.

srajama1 · 2018-11-08T23:42:13Z

Is Tpetra planing to be the place for team based, batched QR ? What are the design decisions in having this in Tpetra ? @kyungjoo-kim is implementing the compact QR in Kokkos Kernels as as well. It is good to synch up.

mhoemmen · 2018-11-09T01:16:39Z

The point is not to put this in Tpetra; the point is that a partial implementation exists in Tpetra that implementers could use as a starting point.

srajama1 · 2018-11-09T17:18:17Z

Point taken. My concerns on the scope-creep remain. Plus I think the use case here is not TS.

kyungjoo-kim · 2018-11-11T05:32:31Z

There is no notion of compact QR. All I wrote under the KokkosBatched name space is just generic implementations of serial and team interface of subset of bias and lapack. Compact approach via KokkosBatched code is one use case. I want to note that different algorithms and optimizations are needed for different use cases. One may want to use the serial and team code in KokkosBatched routines but the developed codes may not be suited for your use cases. I am writing an eigenvalue solver targeting problem sizes O(10~100). I already have a householder QR which is a by-product from the eigenvalue solver. I will merge them later.

wlruys · 2021-09-29T20:06:05Z

I have a use case where my matrices start small and grow larger than the currently supported team level versions of these routines would support efficiently.

I have an interest in a unified interface so Kokkos can be called for the device level calls too instead of dispatching directly to a device specific cuSOLVER or similar where needed.

Device level:
Lapack:

geqrf (unpivoted QR)
unmqr (apply Q)

(Also referencing #567 which is another request for the same kernels)

Change EPS to 1e-2 for float

crtrott mentioned this issue Sep 20, 2017

Runtime error when compiling for CUDA dynamic parallelism kokkos/kokkos#1111

Closed

dholladay00 mentioned this issue Oct 1, 2018

Batched getri need #305

Closed

ndellingwood added the question label Jan 30, 2019

mzuzek mentioned this issue Jun 15, 2021

#9 Test refactoring #1021

Closed

wlruys mentioned this issue Nov 1, 2021

Adding UNMQR and GEQRF wappers #1165

Open

e10harvey added a commit that referenced this issue Jan 18, 2022

Merge pull request #9 from vqd8a/ci_add_osx

a59e95c

Change EPS to 1e-2 for float

BLAS/LAPACK calls people are interested in #9

BLAS/LAPACK calls people are interested in #9

Comments

crtrott commented Mar 6, 2017 • edited Loading

crtrott commented Mar 6, 2017

crtrott commented Mar 6, 2017

nmhamster commented Mar 6, 2017

mhoemmen commented Mar 6, 2017

crtrott commented Mar 6, 2017

mhoemmen commented Mar 6, 2017

crtrott commented Mar 8, 2017

bathmatt commented Mar 8, 2017 via email

kyungjoo-kim commented Mar 8, 2017

mhoemmen commented Mar 8, 2017

bathmatt commented Mar 8, 2017 via email

mhoemmen commented Mar 8, 2017

kyungjoo-kim commented Mar 9, 2017

lucbv commented Mar 24, 2017 • edited Loading

tawiesn commented Apr 3, 2017

JanRobertE commented Sep 20, 2017

crtrott commented Sep 20, 2017

JanRobertE commented Sep 20, 2017

jeffhammond commented Sep 20, 2017 • edited Loading

srajama1 commented Sep 20, 2017

crtrott commented Sep 20, 2017 • edited Loading

jhux2 commented Dec 19, 2017

mhoemmen commented Dec 20, 2017

jhux2 commented Dec 20, 2017

crtrott commented Dec 20, 2017

egboman commented Dec 20, 2017

mhoemmen commented Dec 30, 2017

dholladay00 commented Nov 6, 2018

mhoemmen commented Nov 8, 2018

dholladay00 commented Nov 8, 2018

kyungjoo-kim commented Nov 8, 2018 • edited Loading

jhux2 commented Nov 8, 2018

dholladay00 commented Nov 8, 2018

mhoemmen commented Nov 8, 2018

srajama1 commented Nov 8, 2018

mhoemmen commented Nov 9, 2018

srajama1 commented Nov 9, 2018

kyungjoo-kim commented Nov 11, 2018

wlruys commented Sep 29, 2021 • edited Loading

crtrott commented Mar 6, 2017 •

edited

Loading

lucbv commented Mar 24, 2017 •

edited

Loading

jeffhammond commented Sep 20, 2017 •

edited

Loading

crtrott commented Sep 20, 2017 •

edited

Loading

kyungjoo-kim commented Nov 8, 2018 •

edited

Loading

wlruys commented Sep 29, 2021 •

edited

Loading