-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BLAS/LAPACK calls people are interested in #9
Comments
From Email:
|
From Discussion with LANL |
@crtrott are these batched or plain or both? |
GETRI (global, team, and thread) would be useful too. That should actually reduce priority of optimizing GETRS. (GETRI does explicit inverses; use that in the setup phase, and you only need GEMV in the solve phase, instead of GETRS.) |
For now the requested ones were just plain (i.e. non-batched). The Team ones could be used to build a batch one obviously, but the actual usecase we talked about builds the matrices on the fly for the solve. |
@nmhamster BlockCrsMatrix could use either global batched or team. |
Matt wants: Team Level:
Lapack:
Really cares about A_inverse from A. A is generated on the fly in the team. |
So, I'm making a gemm routine, the interface for lapack has two chars for A
and B matrix, if those chars are 'n' that means no modification to A/B, if
't' then do a transpose, if 'h' do a conj transpose. Now only h is valid
in zdgemm,
Since we have the matrix types as template parameters we no longer have
zdgemm, and just dgemm and we figure out the scalar type from that. What
is the best way to handle that?
I have the generic interface
`
template<class MemberType,
typename Scalar,
class AView,
class BView,
class CView>
KOKKOS_INLINE_FUNCTION void
GEMM(const MemberType &team_member, char transa, char transb, Scalar alpha,
AView A, BView B, Scalar beta, CView C) {
`
we could specialize on Scalar? and then throw errors if one passes in 'h'
to the general interface and only allow Kokkos::cmplx<double>/float into
the special ones?
…On Wed, Mar 8, 2017 at 10:35 AM, Christian Trott ***@***.***> wrote:
Matt wants:
@bathmatt <https://github.com/bathmatt>
Team Level:
Blas
- gemm (with transpose)
Lapack:
- getf2
- getri
Really cares about A_inverse from A. A is generated on the fly in the team.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AOPDIP0yd2pQlpf-TKrI-0aXCIbf0HLTks5rjubegaJpZM4MUdqK>
.
|
In the low level concept, there is no need to distinct batch from plain because batch means that we use parallel for and put team or serial interface inside. For global ones, they are simply wrapper as we interface them to mkl and cublas. @bathmatt , I do not think that it is necessary to specialize "h". For double/float, conjugate just return the same value. In implementation level, it does not do different thing either as conj(val) would return val if value type is double. I am kind of strongly against using character as function arguments. The use of character in blas and lapack is just inherited from fortran and we do not really follow that way. |
@kyungjoo-kim It's nice to specialize "h" even for real value types, so that people can write a code for any value type. |
So, do we want to have Kokkos::conj(v) work for a double then? That works
for me.
…On Wed, Mar 8, 2017 at 4:12 PM, Mark Hoemmen ***@***.***> wrote:
@kyungjoo-kim <https://github.com/kyungjoo-kim> It's nice to specialize
"h" even for real value types, so that people can write a code for any
value type.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AOPDIKzGTjarr_w8G05IfjQi4_TXN_ACks5rjzXLgaJpZM4MUdqK>
.
|
It does work for a double. Users can always call T instead of H if they don't want to call conj on doubles. |
@mhoemmen when we work on this collection of functionalities, we should care about code complexity and better not have many specializations. Since we mostly need implemntation of serial and team level implementation (suppose that they are decorated as kokkos inline function), conv(v) with double will strip out the conv function anyway. |
Hi, I would be interested in dense GETRF, GETRS, GEMV at thread level for small matrices, up to 3by3 in order to do linear interpolation between mesh nodes. This would look like solving the following: These calls would be wrapped inside a functor used in a parfor that sweeps over my mesh so would need to work at thread level. |
MueLu would furthermore need: Team-level: Lapack:
to do local QR decomposition (we need both R and Q built for the TentativePFactory). So far, i have my own Kokkos implementation of a QR decomposition which does the job for us, but it might make sense to optimize it and adapt the interface to fit to the Lapack calls... |
We would need the following BLAS routines at Team and Thread level:
|
Jan: is there some optimization point with respect to size of the problems you need? |
That would be ideal, I guess. Depending on the polynomial order of our elements the problem size can vary quite significantly. |
@crtrott It would be really cool if Kokkos used LIBXSMM under the hood on supported platforms. LIBXSMM doesn't use threads so there should be no composition issues with Kokkos. There are a bunch of nice demonstrations of LIBXSMM for spectral/finite element codes, so this would likely align with @JanRobertE's request. |
@jeffhammond : Our tests compare against LIBXSMM where available, but we don't give that as a callable option within a thread/team yet. |
I'd love you to help out with this ;-) If you look at the blas interface you will see that we got a pretty generic way of plugging in TPLs. Putting libxsmm in is certainly an option we should consider. The basic idea for team_level / thread_level versions (and it looks like libxsmm is targeted only at the latter right now) would be something like this: void team_gemm(team_handle& team, ViewA A, ViewB B, ViewC) {
Impl::TeamGEMM<ViewA,ViewB,ViewC>::team_gemm(team,a,b,c);
}
template<class ViewA, class ViewB, class ViewC
TeamGEMM_TPL_Avail {
enum {value = 0};
}
template<class ViewA, class ViewB, class ViewC, int TPL_AVAIL = TeamGEMM_TPL_AVAIL<ViewA,ViewB,ViewC>::value >
TeamGEMM {
static inline
void team_gemm(ViewA a, ViewB b, ViewC c) {}
} Plugging in another TPL would basically require you to specialize the TeamGEMM_TPL_AVAIL and the TeamGEMM struct. |
GETRF and GETRS for applying the inverse of a sparse matrix's block diagonal. For the application I'm currently considering, blocks are 5x5. An application with more physics would cause these blocks to grow. |
@jhux2 mentioned yesterday wanting batch versions of those functions. Would all blocks in a batch have the same dimensions, or could they possibly differ? |
@mhoemmen For the case I'm currently considering, all the blocks would have the same dimensions. |
We still need a fundamental decision on whether to provide a LAPACK interface. My personal preference is yes, but only stuff people actually ask for, and (at least initially) only calling through to TPL LAPACK. |
If you do LAPACK, I would be interested in xGESVD, xGESDD. |
@crtrott For the batch interface, will we assume that the LAPACK implementation can be called in a thread-parallel context? That's a bit risky if it's very old; it wasn't long ago that @egboman The above paragraph is directly relevant to you. Eigensolvers are hard, and the logical first parallel batch implementation of dense eigensolvers for non-GPU architectures would just call LAPACK in parallel. |
It seems likely that I will be moving away from LU and moving to QR decomposition, which means that instead of needing I see that these have been requested by @tawiesn. I have non-pivoted |
@dholladay00 The beginning of a hand-rolled QR lives in |
@mhoemmen I'll message you on the Kokkos users slack and continue this conversation there. |
@dholladay00 Can you explain little bit more about your use case ? @tawiesn 's use case is a bit different from the general use case of the QR factorization. His use case is more or less updating QR is more efficient way to compute some prolongation operator. Recalling the conversion with him, he has a repeated pattern in a matrix and he uses QR in a conventional way but updating is probably better in my opinion (@tawiesn if you are still checking github and found I am wrong, please correct me). On the other hand, @dholladay00 your use case seems that you just want to invert a matrix or solve a problem. What is the range of your problem sizes ? Variable batch ? or Fixed size batch ? If you don't mind, could you also point out your project ? I am wondering how these developed functions are used. |
@kyungjoo-kim You are correct, the use case for the prolongator is small blocks, many with the same repeated pattern. I'm not sure what you mean by updating. |
@kyungjoo-kim I solve a block tridiagonal matrix in which each block row can have a different size. The size of a given block matrix will generally be from 10x10-1000x1000. The solve is accomplished via the Thomas algorithm and it requires applying the inverse (was done by LU, now done by QR due to increased stability for ill conditioned problems). Each thread team (Kokkos thread team) is in charge of a block tridiagonal matrix, so I would like to pass a team handle into these calls so that they can be accomplished with the appropriate level of parallelism. With the openmp backend, I currently place the blas/lapack call inside of a |
@dholladay00 wrote:
I have the start of that, but it's not at all parallel. I also need this. It's a "team collective" interface; the intent on CPU is to pick cache-block-sized chunks on which to work. |
Is Tpetra planing to be the place for team based, batched QR ? What are the design decisions in having this in Tpetra ? @kyungjoo-kim is implementing the compact QR in Kokkos Kernels as as well. It is good to synch up. |
The point is not to put this in Tpetra; the point is that a partial implementation exists in Tpetra that implementers could use as a starting point. |
Point taken. My concerns on the scope-creep remain. Plus I think the use case here is not TS. |
There is no notion of compact QR. All I wrote under the KokkosBatched name space is just generic implementations of serial and team interface of subset of bias and lapack. Compact approach via KokkosBatched code is one use case. I want to note that different algorithms and optimizations are needed for different use cases. One may want to use the serial and team code in KokkosBatched routines but the developed codes may not be suited for your use cases. I am writing an eigenvalue solver targeting problem sizes O(10~100). I already have a householder QR which is a by-product from the eigenvalue solver. I will merge them later. |
I have a use case where my matrices start small and grow larger than the currently supported team level versions of these routines would support efficiently. I have an interest in a unified interface so Kokkos can be called for the device level calls too instead of dispatching directly to a device specific cuSOLVER or similar where needed. Device level:
(Also referencing #567 which is another request for the same kernels) |
This is just to collect stuff. I will update the first post if more comes in. This is not a promise off what is gonna be there when, its just to help us planning. I differentiate global, team, and thread kernels.
BLAS
Global:
Team:
Thread
LAPACK
Global:
Team:
Thread:
The text was updated successfully, but these errors were encountered: