-
Notifications
You must be signed in to change notification settings - Fork 848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hybrid OpenMP-MPI implementation Strategy + Vectorization (SIMD) #789
Comments
Intro to SIMDThe ALU of modern CPU are capable of processing multiple elements of built-in types simultaneously by applying one instruction (e.g. add) to a register of those elements. Registers are at the very top of the memory hierarchy, for any computation to be performed data needs to be in registers. Why should we care about SIMD? Relation with data structures Relation with algorithms What elements should we try to process simultaneously? Intro to SPMDThis one is simpler, in a nutshell multiple threads operate on the sub domain of an MPI rank. Why should we care about SPMD? Relation with algorithms
None of these is without drawbacks.
Coloring is what one sees most in the literature, and yet I lean towards gather-to-scatter. Fewer things can go wrong with it as it is easy to understand, one gets the maximum amount of parallelism. I will now take two familiar routines, computing gradients (Green-Gauss) and limiters, vectorize / parallelize them in different ways, and measure relative performance to illustrate some of these key points introduced here. There will be C++ snipets and there will be some x86 assembly too :) |
Disclaimer With that out of the way :) ... Green-Gauss GradientsThis is the plain edge-loop version of the code with boundary contributions omitted for simplicity: void computeGradients(size_t nEdge,
size_t nPoint,
size_t nVar,
size_t nDim,
const vector<pair<size_t,size_t> >& connectivity,
const Matrix& area,
const vector<double>& volume,
const Matrix& phi,
VectorOfMatrix& grad)
{
grad.setZero();
for(size_t iEdge=0; iEdge<nEdge; ++iEdge)
{
size_t iPoint = connectivity[iEdge].first;
size_t jPoint = connectivity[iEdge].second;
for(size_t iVar=0; iVar<nVar; ++iVar)
{
double phi_ave = 0.5*(phi(iPoint,iVar)+phi(jPoint,iVar));
for(size_t iDim=0; iDim<nDim; ++iDim)
{
double flux = phi_ave*area(iEdge,iDim);
grad(iPoint,iVar,iDim) += flux;
grad(jPoint,iVar,iDim) -= flux;
}
}
}
for(size_t iPoint=0; iPoint<nPoint; ++iPoint)
for(size_t iVar=0; iVar<nVar; ++iVar)
for(size_t iDim=0; iDim<nDim; ++iDim)
grad(iPoint,iVar,iDim) /= volume[iPoint];
} This is more or less what SU2 does with minor differences on how the edges ( Suppose now that due to a perfect storm the number of variables is 4, here is how with a few pragmas we get gcc to vectorize: template<size_t nVar>
void computeGradients_impl(size_t nEdge,
size_t nPoint,
size_t nDim,
const vector<pair<size_t,size_t> >& connectivity,
const Matrix& area,
const vector<double>& volume,
const Matrix& phi,
VectorOfMatrix& grad)
{
grad.setZero();
for(size_t iEdge=0; iEdge<nEdge; ++iEdge)
{
size_t iPoint = connectivity[iEdge].first;
size_t jPoint = connectivity[iEdge].second;
double phi_ave[nVar];
#pragma omp simd
for(size_t iVar=0; iVar<nVar; ++iVar)
phi_ave[iVar] = 0.5*(phi(iPoint,iVar)+phi(jPoint,iVar));
for(size_t iDim=0; iDim<nDim; ++iDim)
{
#pragma omp simd
for(size_t iVar=0; iVar<nVar; ++iVar)
{
double flux = phi_ave[iVar]*area(iEdge,iDim);
grad(iPoint,iVar,iDim) += flux;
grad(jPoint,iVar,iDim) -= flux;
}
}
}
for(size_t iPoint=0; iPoint<nPoint; ++iPoint)
for(size_t iDim=0; iDim<nDim; ++iDim)
#pragma omp simd
for(size_t iVar=0; iVar<nVar; ++iVar)
grad(iPoint,iVar,iDim) /= volume[iPoint];
} Well it is not just a few pragmas, we need to make the number of variables known at compile time (via a template parameter) and we need to transpose how the gradient is stored, i.e. instead of {xyz, xyz, xyz, xyz} we need {xxxx, yyyy, zzzz}. This code gets a speed-up of 2.2. This code is generic but the template needs to be instantiated for every possible number of variables and we need a We can switch to a point-based loop and process multiple points in a SIMD way, that avoids the void computeGradients(size_t nPoint,
size_t nVar,
size_t nDim,
const Adjacency& adj,
const Matrix& area,
const vector<double>& volume,
const Matrix& phi,
VectorOfMatrix& grad)
{
for(size_t iPoint=0; iPoint<nPoint; ++iPoint)
{
for(size_t iVar=0; iVar<nVar; ++iVar)
for(size_t iDim=0; iDim<nDim; ++iDim)
grad(iPoint,iVar,iDim) = 0.0;
for(size_t iNeigh=0; iNeigh<adj.nNeighbor(iPoint); ++iNeigh)
{
size_t jPoint = adj.jPoint(iPoint,iNeigh);
size_t iEdge = adj.iEdge(iPoint,iNeigh);
double dir = adj.dir(iPoint,iNeigh);
for(size_t iVar=0; iVar<nVar; ++iVar)
{
double phi_ave = 0.5*(phi(iPoint,iVar)+phi(jPoint,iVar));
for(size_t iDim=0; iDim<nDim; ++iDim)
grad(iPoint,iVar,iDim) += phi_ave*dir*area(iEdge,iDim);
}
}
for(size_t iVar=0; iVar<nVar; ++iVar)
for(size_t iDim=0; iDim<nDim; ++iDim)
grad(iPoint,iVar,iDim) /= volume[iPoint];
}
} The The SIMD version of this code is: void computeGradients(size_t nPoint,
size_t nVar,
size_t nDim,
const Adjacency<4>& adj,
const Matrix& area,
const Vector& volume,
const Matrix& phi,
VectorOfMatrix& grad)
{
const size_t SIMDLEN = 4;
for(size_t iPoint=0; iPoint<nPoint; iPoint+=SIMDLEN)
{
for(size_t iVar=0; iVar<nVar; ++iVar)
for(size_t iDim=0; iDim<nDim; ++iDim)
grad.setVec(iPoint,iVar,iDim,Array<double,SIMDLEN>(0.0));
for(size_t iNeigh=0; iNeigh<adj.nNeighbor_vec(iPoint); ++iNeigh)
{
auto jPoint = adj.jPoint_vec(iPoint,iNeigh);
auto iEdge = adj.iEdge_vec(iPoint,iNeigh);
auto dir = adj.dir_vec(iPoint,iNeigh);
for(size_t iVar=0; iVar<nVar; ++iVar)
{
auto phi_ave = (phi.getVec(iPoint,iVar)+
phi.getVec(jPoint,iVar))*0.5;
for(size_t iDim=0; iDim<nDim; ++iDim)
grad.addVec(iPoint,iVar,iDim,
phi_ave*dir*area.getVec(iEdge,iDim));
}
}
for(size_t iVar=0; iVar<nVar; ++iVar)
for(size_t iDim=0; iDim<nDim; ++iDim)
grad.setVec(iPoint,iVar,iDim,
grad.getVec(iPoint,iVar,iDim)/volume.getVec(iPoint));
}
} I think this is just as readable especially considering that in SU2 we always need to use some Set/Get/Add/Sub method to update a variable, the difference is that here those methods have overloads to operate on small fixed size vectors. The speedup is 1.35 (i.e. 35% faster than edge-based reference) note that the improvement relative to scalar-point-based is only 1.6, those pesky gathers... The loop advances template<class T, size_t N>
class Array
{
#define FOREACH for(size_t k=0; k<N; ++k)
public:
enum : size_t {Size = N};
enum : size_t {Align = N*sizeof(T)};
private:
// fixed size and aligned array of internal data, naturally maps to a SIMD register
alignas(Align) T vals_[N];
/*
* Some helper methods go here
*/
public:
// **** CONSTRUCTORS **** //
// We want to be able to construct this type from single scalars,
// a memory location from which we LOAD data,
// or a memory location and some offsets from which we GATHER data.
// In addition to the "normal" constructors.
// scalar broadcasting ctor
STRONGINLINE Array(T x) {bcast(x);}
// loading ctor
STRONGINLINE Array(const T* ptr)
{
#pragma omp simd aligned(ptr:Align)
FOREACH vals_[k] = ptr[k];
}
// gathering ctor
template<class U>
STRONGINLINE Array(const T* base_ptr, const U& offsets)
{
#pragma omp simd
FOREACH vals_[k] = base_ptr[offsets[k]];
}
/*
* Other traditional constructors (default, copy-ctor, move-ctor, etc) go here
*/
// **** ACCESSORS **** //
STRONGINLINE T& operator[] (size_t k) {return vals_[k];}
STRONGINLINE T operator[] (size_t k) const {return vals_[k];}
// **** MATH OPERATORS **** //
STRONGINLINE Array& operator= (const Array& rhs)
{
#pragma omp simd
FOREACH vals_[k] = rhs.vals_[k];
return *this;
}
STRONGINLINE Array& operator+= (const Array& rhs)
{
#pragma omp simd
FOREACH vals_[k] += rhs.vals_[k];
return *this;
}
STRONGINLINE Array operator+ (const Array& rhs) const { return Array(*this)+=rhs; }
/*
* Many other operators go here.
*/
};
// Common math function overloads
template<class T>
STRONGINLINE T vmax(const T& a, const T& b)
{
T res;
#pragma omp simd
for(size_t k=0; k<T::Size; ++k)
res[k] = (a[k]>b[k])? a[k] : b[k];
return res;
}
#undef FOREACH There are other (better) ways to do this, for example using x86 intrinsics (in header To pull this off we do not need to have // use the "pointer ctor" to return an array starting at "row0"
Array<double,4> Matrix<double>::getVec(size_t row0, size_t col) const {
return Array<double,4>(&data_[row0+col*rows_]);
}
// use the "gather ctor" to return an array with the indices in "rows"
template<class U>
Array<double,4> Matrix<double>::getVec(const U& rows, size_t col) const {
return Array<double,4>(&data_[col*rows_], rows);
} After inlining those copies get optimized away. The But even within a SIMD-sized group points can have different number of neighbors... This concept of padding is important for something else, you may have noticed that the SIMD point-loops I showed make no provisions for values of nPoint that are not multiples of SIMDLEN, that is because the containers already took care of that by rounding up the number of columns, and so that seemingly out-of-bounds access is safe (ain't encapsulation great). Padding also aligns the start of each column, thus it is a generally good thing to have (on large dimensions) whether used or not. Here is a relative performance recap before we talk bout parallelization
Parallel execution I will start at the end for this, all it takes to parallellize the points loops with OpenMP is to take this: for(size_t iPoint=0; iPoint<nPoint; iPoint+=SIMDLEN) And add some pixie dust #pragma omp parallel for schedule(dynamic,128)
for(size_t iPoint=0; iPoint<nPoint; iPoint+=SIMDLEN) This means each thread gets chunks of 128 loop iterations (512 points) to work on, assigned in a dynamic way, the 4 core speedup (still relative to our reference) is 3.8 for the SIMD code and 2.8 for the scalar code. Parallelizing the edge loops is a bit more intricate, as this: for(size_t iEdge=0; iEdge<nEdge; ++iEdge)
{
size_t iPoint = connectivity[iEdge].first;
size_t jPoint = connectivity[iEdge].second; Becomes: for(size_t color=0; color<colorStart.size()-1; ++color)
#pragma omp parallel for schedule(dynamic,CHUNK_SIZE)
for(size_t k=colorStart[color]; k<colorStart[color+1]; ++k)
{
#if SORT_BY_COLOR==1
size_t iEdge = k;
#else
size_t iEdge = edgeIdx[k];
#endif
size_t iPoint = connectivity[iEdge].first;
size_t jPoint = connectivity[iEdge].second; Apologies for the macro but it is just to illustrate that if we re-sort edge data after coloring the edge index is the loop index, otherwise the edge indices for each color need to be stored in a separate array. I mentioned in the introduction coloring reduces locality and therefore performance, here is the effect of color group size on the execution time of the scalar code on one thread: Running the edge-loop version on 4 cores (8192 group + sorting) we get speedups (relative to reference) of 1.98 and 2.04 for the scalar and SIMD versions respectively (yes I quadruple checked). This is the 4 core summary:
I think the point-based versions scale better because they are a bit less memory-bound as they write to the gradient sequentially and they have a bit more compute due to the duplicated computations. Conclusion Next I will talk about limiters, almost all concepts are introduced so it will be shorter (promise). As a little appetizer let me tell you we can recover the extra memory and we could be looking at a 2.7x speedup for gradients+limiters. |
Nice progress @pcarruscag! I like the concept of your SIMD-friendly class that will take care of the data structure under the hood coupled with a standard type of loop statement (w/ +SIMDLEN). This should make it pretty easy for folks to still modify the kernels without having to worry about the data alignment, and they can reuse the same simple 'for' construct repeatedly. Another reason to have our own lightweight class for this is that you can avoid dependence on OpenMP for SIMD (although that feature looks to have potential and wasn't available until somewhat recently) as well as the intrinsics. In my experience, the latter is especially bad for portability and readability (part of why we left the CaF work in a separate repo). It starts to become so specialized that compiling and modifying become difficult. W.r.t. OpenMP, another roadblock there a few years ago was making sure it is interoperable with CoDi for the adjoint, but I know this has been worked on and may be available by now. Might keep an open mind about point vs. edge. In some places, we may be able to pump up the compute in our loops by fusing kernels, as previously discussed (and I am guessing you are working on this already with gradients/limiters). Could change the final performance numbers significantly. Lastly, I know you are not there yet, but it is worth considering whether you can reuse anything you are developing in the kernels here for the linear solver routines. At some point, you will successfully reduce the cost of the residual kernels (RHS) to the bandwidth limit, and the majority of the iteration cost will be in the linear solver (it is already about 50% of the iteration cost before optimization, if I recall). Before making final decisions on strategy, you should consider if it will help in any of the linear solver routines too. |
Thanks @economon! LimitersScalar (reference) version of the code: void computeLimiters(size_t nPoint,
size_t nVar,
size_t nDim,
const vector<pair<size_t,size_t> >& connectivity,
const Matrix& coords,
const Matrix& phi,
const VectorOfMatrix& grad,
Matrix& phiMax,
Matrix& phiMin,
Matrix& limiter)
{
for(size_t iPoint=0; iPoint<nPoint; ++iPoint)
{
for(size_t iVar=0; iVar<nVar; ++iVar)
{
phiMax(iPoint,iVar) = phi(iPoint,iVar);
phiMin(iPoint,iVar) = phi(iPoint,iVar);
limiter(iPoint,iVar) = 2.0;
}
}
for(auto edge : connectivity)
{
size_t iPoint = edge.first;
size_t jPoint = edge.second;
for(size_t iVar=0; iVar<nVar; ++iVar)
{
phiMax(iPoint,iVar) = max(phiMax(iPoint,iVar), phi(jPoint,iVar));
phiMin(iPoint,iVar) = min(phiMin(iPoint,iVar), phi(jPoint,iVar));
phiMax(jPoint,iVar) = max(phiMax(jPoint,iVar), phi(iPoint,iVar));
phiMin(jPoint,iVar) = min(phiMin(jPoint,iVar), phi(iPoint,iVar));
}
}
for(auto edge : connectivity)
{
size_t iPoint = edge.first;
size_t jPoint = edge.second;
double d_ij[3] = {0.0, 0.0, 0.0};
for(size_t iDim=0; iDim<nDim; ++iDim)
d_ij[iDim] = 0.5*(coords(jPoint,iDim)-coords(iPoint,iDim));
for(size_t iVar=0; iVar<nVar; ++iVar)
{
double proj_i = 0.0, proj_j = 0.0;
for(size_t iDim=0; iDim<nDim; ++iDim)
{
proj_i += d_ij[iDim]*grad(iPoint,iVar,iDim);
proj_j -= d_ij[iDim]*grad(jPoint,iVar,iDim);
}
double lim_i = phiMax(iPoint,iVar);
double lim_j = phiMax(jPoint,iVar);
const double eps = numeric_limits<double>::epsilon();
if(proj_i <= 0.0)
{
lim_i = phiMin(iPoint,iVar);
proj_i = min(proj_i, -eps);
}
if(proj_j <= 0.0)
{
lim_j = phiMin(jPoint,iVar);
proj_j = min(proj_j, -eps);
}
lim_i = (lim_i-phi(iPoint,iVar))/proj_i;
limiter(iPoint,iVar) = min(limiter(iPoint,iVar), lim_i);
lim_j = (lim_j-phi(jPoint,iVar))/proj_j;
limiter(jPoint,iVar) = min(limiter(jPoint,iVar), lim_j);
}
}
for(size_t iPoint=0; iPoint<nPoint; ++iPoint)
{
for(size_t iVar=0; iVar<nVar; ++iVar)
{
double lim = limiter(iPoint,iVar);
limiter(iPoint,iVar) = lim*(lim+2)/(lim*lim+lim+2);
}
}
} Something in the code above is a bit different from the implementation in SU2, namely: double lim_i = phiMax(iPoint,iVar);
if(proj_i <= 0.0) {
lim_i = phiMin(iPoint,iVar);
proj_i = min(proj_i, -eps);
} This is the bit of code that selects the right delta based on the sign of the projection and avoids division by zero, this less readable version does the same with one branch instead of three, simplifying "if" statements is essential for vectorization, so to make the comparison fair I used the same strategy in the scalar code. To make this post shorter I will show the SIMD and parallel version of the code right away. Trying to process multiple edges instead of multiple variables has all the problems I mentioned for the gradients, so again we use the trick of templating on the number of variables. template<size_t nVar>
void computeLimiters_impl(size_t nPoint,
size_t nDim,
const vector<size_t>& colorStart,
const vector<size_t>& edgeIdx,
const vector<pair<size_t,size_t> >& connectivity,
const Matrix& coords,
const Matrix& phi,
const VectorOfMatrix& grad,
Matrix& phiMax,
Matrix& phiMin,
Matrix& limiter)
{
// initialize
#pragma omp parallel for schedule(dynamic,TARGET_CHUNK_SIZE)
for(size_t iPoint=0; iPoint<nPoint; ++iPoint)
{
#pragma omp simd
for(size_t iVar=0; iVar<nVar; ++iVar)
{
phiMax(iPoint,iVar) = phi(iPoint,iVar);
phiMin(iPoint,iVar) = phi(iPoint,iVar);
limiter(iPoint,iVar) = 2.0;
}
}
// find min and max neighbor
for(size_t color=0; color<colorStart.size()-1; ++color)
#pragma omp parallel for schedule(dynamic,CHUNK_SIZE)
for(size_t k=colorStart[color]; k<colorStart[color+1]; ++k)
{
#if SORT_BY_COLOR==1
size_t iEdge = k;
#else
size_t iEdge = edgeIdx[k];
#endif
size_t iPoint = connectivity[iEdge].first;
size_t jPoint = connectivity[iEdge].second;
// some hand-holding needed for simd min/max with gcc,
// one of the min/max operands needs to be on the stack
// (so the compiler knows the two do not overlap?)
double phi_i[nVar], phi_j[nVar];
#pragma omp simd
for(size_t iVar=0; iVar<nVar; ++iVar)
{
phi_i[iVar] = phi(iPoint,iVar);
phi_j[iVar] = phi(jPoint,iVar);
}
#pragma omp simd
for(size_t iVar=0; iVar<nVar; ++iVar)
{
phiMax(iPoint,iVar) = max(phiMax(iPoint,iVar), phi_j[iVar]);
phiMin(iPoint,iVar) = min(phiMin(iPoint,iVar), phi_j[iVar]);
phiMax(jPoint,iVar) = max(phiMax(jPoint,iVar), phi_i[iVar]);
phiMin(jPoint,iVar) = min(phiMin(jPoint,iVar), phi_i[iVar]);
}
}
for(size_t color=0; color<colorStart.size()-1; ++color)
#pragma omp parallel for schedule(dynamic,CHUNK_SIZE)
for(size_t k=colorStart[color]; k<colorStart[color+1]; ++k)
{
#if SORT_BY_COLOR==1
size_t iEdge = k;
#else
size_t iEdge = edgeIdx[k];
#endif
size_t iPoint = connectivity[iEdge].first;
size_t jPoint = connectivity[iEdge].second;
// i to j vector
double d_ij[3] = {0.0, 0.0, 0.0};
for(size_t iDim=0; iDim<nDim; ++iDim)
d_ij[iDim] = 0.5*(coords(jPoint,iDim)-coords(iPoint,iDim));
// projections
double proj_i[nVar], proj_j[nVar];
#pragma omp simd
for(size_t iVar=0; iVar<nVar; ++iVar)
proj_i[iVar] = proj_j[iVar] = 0.0;
for(size_t iDim=0; iDim<nDim; ++iDim)
{
#pragma omp simd
for(size_t iVar=0; iVar<nVar; ++iVar)
{
proj_i[iVar] += d_ij[iDim]*grad(iPoint,iVar,iDim);
proj_j[iVar] -= d_ij[iDim]*grad(jPoint,iVar,iDim);
}
}
// choose the "right" delta based on sign of projection
// and avoid division by zero
double lim_i[nVar], lim_j[nVar];
#pragma omp simd
for(size_t iVar=0; iVar<nVar; ++iVar)
{
lim_i[iVar] = phiMax(iPoint,iVar);
lim_j[iVar] = phiMax(jPoint,iVar);
}
const double eps = numeric_limits<double>::epsilon();
// very simple if's are required to get vectorization
// trough vector comparisons and masked blends
#pragma omp simd
for(size_t iVar=0; iVar<nVar; ++iVar)
{
if(proj_i[iVar] <= 0.0)
{
lim_i[iVar] = phiMin(iPoint,iVar);
proj_i[iVar] = min(proj_i[iVar], -eps);
}
if(proj_j[iVar] <= 0.0)
{
lim_j[iVar] = phiMin(jPoint,iVar);
proj_j[iVar] = min(proj_j[iVar], -eps);
}
}
#pragma omp simd
for(size_t iVar=0; iVar<nVar; ++iVar)
{
lim_i[iVar] = (lim_i[iVar]-phi(iPoint,iVar))/proj_i[iVar];
limiter(iPoint,iVar) = min(limiter(iPoint,iVar), lim_i[iVar]);
lim_j[iVar] = (lim_j[iVar]-phi(jPoint,iVar))/proj_j[iVar];
limiter(jPoint,iVar) = min(limiter(jPoint,iVar), lim_j[iVar]);
}
}
#pragma omp parallel for schedule(dynamic,TARGET_CHUNK_SIZE)
for(size_t iPoint=0; iPoint<nPoint; ++iPoint)
{
#pragma omp simd
for(size_t iVar=0; iVar<nVar; ++iVar)
{
double lim = limiter(iPoint,iVar);
limiter(iPoint,iVar) = lim*(lim+2)/(lim*lim+lim+2);
}
}
} Again to keep things short here is the parallel and SIMD point-loop version (like for gradients it is very similar to the scalar and sequential version). void computeLimiters(size_t nPoint,
size_t nVar,
size_t nDim,
const Adjacency<4>& adj,
const Matrix& coords,
const Matrix& phi,
const VectorOfMatrix& grad,
Matrix& limiter)
{
const size_t SIMDLEN = 4;
using FltVec = Array<double,SIMDLEN>;
// working variables
FltVec phiMax[MAXNVAR], phiMin[MAXNVAR], prjMax[MAXNVAR], prjMin[MAXNVAR];
const double eps = numeric_limits<double>::epsilon();
#pragma omp parallel for schedule(dynamic,128) private(phiMax,phiMin,prjMax,prjMin)
for(size_t iPoint=0; iPoint<nPoint; iPoint+=SIMDLEN)
{
for(size_t iVar=0; iVar<nVar; ++iVar)
{
phiMin[iVar] = phiMax[iVar] = phi.getVec(iPoint,iVar);
prjMax[iVar] = eps;
prjMin[iVar] = -eps;
}
for(size_t iNeigh=0; iNeigh<adj.nNeighbor(iPoint); ++iNeigh)
{
auto jPoint = adj.jPoint_vec(iPoint,iNeigh);
FltVec d_ij[3] = {FltVec(0.0), FltVec(0.0), FltVec(0.0)};
for(size_t iDim=0; iDim<nDim; ++iDim)
d_ij[iDim] = (coords.getVec(jPoint,iDim)-
coords.getVec(iPoint,iDim))*0.5;
for(size_t iVar=0; iVar<nVar; ++iVar)
{
FltVec prj = 0.0;
for(size_t iDim=0; iDim<nDim; ++iDim)
prj += d_ij[iDim]*grad.getVec(iPoint,iVar,iDim);
prjMax[iVar] = vmax(prjMax[iVar], prj);
prjMin[iVar] = vmin(prjMin[iVar], prj);
phiMax[iVar] = vmax(phiMax[iVar], phi.getVec(jPoint,iVar));
phiMin[iVar] = vmin(phiMin[iVar], phi.getVec(jPoint,iVar));
}
}
for(size_t iVar=0; iVar<nVar; ++iVar)
{
FltVec lim = vmin(FltVec(2.0), vmin(
(phiMax[iVar]-phi.getVec(iPoint,iVar))/prjMax[iVar],
(phiMin[iVar]-phi.getVec(iPoint,iVar))/prjMin[iVar]));
limiter.setVec(iPoint,iVar, lim*(lim+2.0)/(lim*lim+lim+2.0));
}
}
} In terms of algorithm, for each point we find the min and max neighbor values and the min (negative) and max (positive) projections, those are then combined in a final Like @economon said, fusing the gradient kernel with the limiter kernel is trivial with these point loops, and I do not think it affects readability much since one can clearly tell "what is what" (I will not put it here but it really is a matter of copy paste), including the boundaries could be a bit more challenging, but I will give performance number nevertheless. Performance summary
The basic point version does not lose to edge based because, contrary to gradients, it does not require duplication of computations while benefiting from sequential access to gradients. If we consider the combined cost of gradients and limiters, and compare the scalar "edge+edge" with the SIMD "point+point" and "fused point" we get:
Fusing point loops only gives a 14% improvement vs separate loops due to the difference in gathered data, only one gather is amortized and the remaining memory accesses are very efficient. |
Thank you for the very rigorous work. May be it would be nice to keep the possibility to do async MPI-I/O later. For reference, people from intel explain a clever strategy in a recently published paper "Asynchronous and multithreaded communications on irregular |
Thanks @MicK7 I will have a look, my initial thought was to have a simple strategy where within each MPI rank parallelism is extracted via colouring or scatter-to-gather transformations and only one thread per rank participates in the message passing, I have no experience here though so this might be a bad strategy, idk. Back to business: Parallel strategy for flux computationBecause significant computation is required to obtain each edge's flux, it does not make sense to attempt a "point-loop" strategy (which would double the effort). Matrix UpdatesBy this I mean the void testLoop1(const vector<size_t>& colorStart,
const vector<size_t>& edgeIdx,
const vector<pair<size_t,size_t> >& connectivity,
double** blk_i, double** blk_j,
SparseMatrix& matrix)
{
matrix.setZero();
for(size_t color=0; color<colorStart.size()-1; ++color)
#pragma omp parallel for schedule(dynamic,CHUNK_SIZE)
for(size_t k=colorStart[color]; k<colorStart[color+1]; ++k)
{
size_t iEdge = edgeIdx[k];
size_t iPoint = connectivity[iEdge].first;
size_t jPoint = connectivity[iEdge].second;
matrix.addBlock(iPoint, iPoint, blk_i);
matrix.addBlock(iPoint, jPoint, blk_j);
matrix.subBlock(jPoint, jPoint, blk_j);
matrix.subBlock(jPoint, iPoint, blk_i);
}
} This and a few more memory reads is why we can't have nice things, i.e. massive speedups with vectorization. Believe it or not this loop sets ~75% of the maximum speed at which the residual edge loop can run (bandwidth bottleneck).
Assuming these modification our dummy loop becomes void testLoop2(const vector<size_t>& colorStart,
const vector<size_t>& edgeIdx,
const vector<pair<size_t,size_t> >& connectivity,
const double* blk_i, const double* blk_j,
SparseMatrix& matrix)
{
matrix.setDiagZero();
for(size_t color=0; color<colorStart.size()-1; ++color)
#pragma omp parallel for schedule(dynamic,CHUNK_SIZE)
for(size_t k=colorStart[color]; k<colorStart[color+1]; ++k)
{
size_t iEdge = edgeIdx[k];
size_t iPoint = connectivity[iEdge].first;
size_t jPoint = connectivity[iEdge].second;
matrix.updateBlocks(iEdge, iPoint, jPoint, blk_i, blk_j);
}
} where STRONGINLINE void SparseMatrix::updateBlocks(size_t edge,
size_t row, size_t col, const double* blk_i, const double* blk_j)
{
size_t bii = diagMap[row], bij = edgeMap[edge].first,
bjj = diagMap[col], bji = edgeMap[edge].second;
#pragma omp simd
for(size_t k=0; k<blkSz; ++k)
{
coeffs[bii+k] += blk_i[k]; coeffs[bij+k] = +blk_j[k];
coeffs[bji+k] = -blk_i[k]; coeffs[bjj+k] -= blk_j[k];
}
} This is 47% faster, which for a memory bound task is massive! We could also parallelize the matrix updates without colouring by setting only the off-diagonal coefficients and then setting the diagonal entries to the column sum. Therefore colouring is the way to go. Note: With vectorized numerics we insert blocks for 4 or 8 edges into the matrix at a time, the data for those inserts will be in a slightly weird format, which will make MUSCL ReconstructionThe MUSCL reconstruction, characteristic of upwind schemes, is the simplest building block to show the (negative) implications of storing the data as structures of arrays (SoA) on the performance of some operations. void computeResidual(size_t nVar,
size_t nDim,
const vector<size_t>& colorStart,
const vector<size_t>& edgeIdx,
const vector<pair<size_t,size_t> >& connectivity,
const Matrix& coords,
const Matrix& phi,
const VectorOfMatrix& grad,
const Matrix& limiter,
Matrix& residual)
{
residual.setZero();
for(size_t color=0; color<colorStart.size()-1; ++color)
#pragma omp parallel for schedule(dynamic,CHUNK_SIZE)
for(size_t k=colorStart[color]; k<colorStart[color+1]; ++k)
{
size_t iEdge = edgeIdx[k];
size_t iPoint = connectivity[iEdge].first;
size_t jPoint = connectivity[iEdge].second;
double d_ij[MAXNDIM];
for(size_t iDim=0; iDim<nDim; ++iDim)
d_ij[iDim] = 0.5*(coords(jPoint,iDim)-coords(iPoint,iDim));
for(size_t iVar=0; iVar<nVar; ++iVar)
{
double phiL = phi(iPoint,iVar);
double phiR = phi(jPoint,iVar);
for(size_t iDim=0; iDim<nDim; ++iDim)
{
phiL += limiter(iPoint,iVar)*grad(iPoint,iVar,iDim)*d_ij[iDim];
phiR -= limiter(jPoint,iVar)*grad(jPoint,iVar,iDim)*d_ij[iDim];
}
double flux = 0.5*(phiL+phiR);
residual(iPoint,iVar) += flux;
residual(jPoint,iVar) -= flux;
}
}
} after vectorizing this to handle multiple edges simultaneously with the SIMD-friendly type the core of the loop becomes using FltVec = Array<double,SIMDLEN>;
...
FltVec d_ij[MAXNDIM];
for(size_t iDim=0; iDim<nDim; ++iDim)
d_ij[iDim] = (coords.getVec(jPoint,iDim)-coords.getVec(iPoint,iDim))*0.5;
for(size_t iVar=0; iVar<nVar; ++iVar)
{
FltVec phiL = 0.0;
FltVec phiR = 0.0;
for(size_t iDim=0; iDim<nDim; ++iDim)
{
phiL += grad.getVec(iPoint,iVar,iDim)*d_ij[iDim];
phiR -= grad.getVec(jPoint,iVar,iDim)*d_ij[iDim];
}
phiL = phi.getVec(iPoint,iVar) + limiter.getVec(iPoint,iVar)*phiL;
phiR = phi.getVec(jPoint,iVar) + limiter.getVec(jPoint,iVar)*phiR;
FltVec flux = (phiL+phiR)*0.5;
for(size_t k=0; k<SIMDLEN; ++k) {
residual(iPoint[k],iVar) += flux[k];
residual(jPoint[k],iVar) -= flux[k];
}
} Note that at the end of the loop we need to de-swizzle the flux to update the multiple indexes references by iPoint and jPoint, which are now short arrays of integers (this operation can be moved to the container, akin to With SoA (aka column major storage) this code is 1.5 times slower than the scalar version. The reason for that is poor locality (of the spacial variety), as we loop through the number of variables and dimensions we are accessing the data in strides of nPoint, as the contiguous index is the first one so that we can perform vector read/writes when computing gradients and limiters. If we go back to arrays of structures (AoS, aka row major storage, basically what we have in #753) performance is only 9% worse (the code is identical). Those 9% are mostly due to increased integer arithmetic in the accesses to the data, on each call to for(size_t iDim=0; iDim<nDim; ++iDim)
phiL += grad.getVec(iPoint,iVar,iDim)*d_ij[iDim]; gets compiled into this monstrosity .L13:
vpmuludq ymm0, ymm4, ymm1
vmovq xmm15, rax
vmovapd ymm6, ymm11
mov rdx, rax
vpbroadcastq ymm15, xmm15
sal rdx, 5
add rax, 1
vpaddq ymm0, ymm0, ymm2
vpsllq ymm0, ymm0, 32
vpaddq ymm0, ymm5, ymm0
vmovdqa YMMWORD PTR [rbp-240], ymm0
vpaddq ymm0, ymm3, ymm0
vmovdqa YMMWORD PTR [rbp-208], ymm0
vpaddq ymm0, ymm15, ymm0
vmovdqa YMMWORD PTR [rbp-176], ymm0
vgatherqpd ymm15, QWORD PTR [rdi+ymm0*8], ymm6
vmovapd ymm0, YMMWORD PTR [rsi+rdx]
vfmadd213pd ymm0, ymm15, YMMWORD PTR [rbp-336]
vmovapd YMMWORD PTR [rbp-336], ymm0
cmp rbx, rax
jne .L13 the meat of which is .L15:
vmovsd xmm5, QWORD PTR [rsp-40+rax*8]
vfmadd231sd xmm0, xmm5, QWORD PTR [r15+rax*8]
add rax, 1
cmp rcx, rax
jne .L15 I think the reason for this is that there are plenty of integer registers (64bit) to keep memory locations (rsp, rax, r15 in the above) but there are only 16 ymm registers (256bit). In any case we need to give the compiler a hand, the calculation we need is template<size_t VecLen, size_t Incr = VecLen>
class GatherIterator
{
private:
using IntVec = Array<size_t,VecLen>;
using FltVec = Array<double,VecLen>;
IntVec offsets_;
const double* data_;
public:
GatherIterator() = delete;
GatherIterator(const double* data, IntVec offsets) : offsets_(offsets), data_(data) {}
STRONGINLINE FltVec operator()() const { return FltVec(data_,offsets_); }
STRONGINLINE FltVec operator++(int) {
auto ret = (*this)(); offsets_ += Incr; return ret;
}
}; so silly in fact, it only moves forward, we use it in our loop like so ...
auto gradI = grad.getColIterator(iPoint);
auto gradJ = grad.getColIterator(jPoint);
for(size_t iVar=0; iVar<nVar; ++iVar)
{
FltVec phiL = 0.0;
FltVec phiR = 0.0;
for(size_t iDim=0; iDim<nDim; ++iDim)
{
phiL += (gradI++)*d_ij[iDim];
phiR -= (gradJ++)*d_ij[iDim];
}
... to get better assembly .L7:
vmovapd ymm3, ymm13
vmovapd ymm2, YMMWORD PTR [rbp-400]
add rax, 32
vgatherqpd ymm0, QWORD PTR [rcx+ymm1*8], ymm3
vpaddq ymm1, ymm1, ymm11
vmovapd YMMWORD PTR [rbp-272], ymm0
vmovapd YMMWORD PTR [rbp-240], ymm0
vfmadd132pd ymm0, ymm2, YMMWORD PTR [rax-32]
vmovdqa YMMWORD PTR [rbp-208], ymm1
vmovapd YMMWORD PTR [rbp-400], ymm0
cmp rax, rbx
jne .L7 which makes the vectorized code perform just as well as the scalar code, iterators could also be used for the other variables but that would start to hurt readability without improving the performance much. Note: There is also a chance the compiler (gcc) is not doing this kind of optimization because of the way I wrote the code... So we need AoS to avoid losing performance in lightweight numerics classes. Before we look into the impact of not using SoA in the gradient and limiters routines let me tell you there is a way to have the best of both worlds, enter the array of structures of arrays or as I like to call it zig zag storage, aka a right mess. For these reasons I think we should sacrifice ultimate performance and keep node data in AoS storage. The major impact on gradients and limiters is the way the code is written, to vectorize the computation we need to compute the gradient into a local variable and then "transpose" it when storing it, i.e. FltVec phiI[MAXNVAR], gradI[MAXNVAR][MAXNDIM];
...
for(size_t iVar=0; iVar<nVar; ++iVar)
{
auto flux = weight*(phiI[iVar]+phi.getVec(jPoint,iVar));
for(size_t iDim=0; iDim<nDim; ++iDim)
gradI[iVar][iDim] += a_ij[iDim]*flux;
}
}
for(size_t iVar=0; iVar<nVar; ++iVar)
for(size_t iDim=0; iDim<nDim; ++iDim)
for(size_t k=0; k<SIMDLEN; ++k)
grad(iPoint+k,iVar,iDim) = gradI[iVar][iDim][k];
... Similarly when computing the gradient we need to first fetch/transpose it to be able to vectorize subsequent computations FltVec gradI[MAXNVAR][MAXNDIM];
for(size_t iVar=0; iVar<nVar; ++iVar)
for(size_t iDim=0; iDim<nDim; ++iDim)
for(size_t k=0; k<SIMDLEN; ++k)
gradI[iVar][iDim][k] = grad(iPoint+k,iVar,iDim);
... Performance wise this is actually better than the SoA version (4% on gradients, 35% on limiters) as it also benefits from better locality, and it is only slightly (3%) worse than zig zag storage, especially when fusing limiters and gradients as the transposition of the gradient into storage is greatly amortised. We lose the ability to vectorize primitive variable updates efficiently with AoS but currently that only accounts for 3% of the runtime and it is a memory bound operation therefore it would not gain much from vectorization anyway. On the subject of de-swizzling data remember I said the writes into CSysMatrix would be a bit weird, that is because each Jacobian contribution will be a "matrix of short arrays" that needs to be transformed into a short array of matrices, the result of that is code like the above that explicitly manipulates the lanes of our SIMD type, such code can be completely hidden inside CSysMatrix which is good because a 4x4 vectorized transpose and matrix update looks like this // block j, subs from jj and goes to ij
T0 = blk_j[ k ].unpackLo(blk_j[k+1]); T1 = blk_j[ k ].unpackHi(blk_j[k+1]);
T2 = blk_j[k+2].unpackLo(blk_j[k+3]); T3 = blk_j[k+2].unpackHi(blk_j[k+3]);
C0 = T0.widePermuteLo(T2); C1 = T1.widePermuteLo(T3);
C2 = T0.widePermuteHi(T2); C3 = T1.widePermuteHi(T3);
(Array4d(&bjj[0][k])-C0).store(&bjj[0][k]);
(Array4d(&bjj[1][k])-C1).store(&bjj[1][k]);
(Array4d(&bjj[2][k])-C2).store(&bjj[2][k]);
(Array4d(&bjj[3][k])-C3).store(&bjj[3][k]);
C0.store(&bij[0][k]); C1.store(&bij[1][k]);
C2.store(&bij[2][k]); C3.store(&bij[3][k]); I am showing this because it represents a readability worst case in terms of manipulating SIMD types, we might end up with one or two of these to get the best performance possible but they will always be encapsulated and deep in kernel-type areas of SU2 that are almost never touched. Conclusions
Next I will try to estimate how much we can gain for a "realistic" numerics class. |
This has been a long long exposition (nerd joke) but bear with me I am almost done, and I will summarise the results in the form of a proposal (I'll probably put that at the top of the first post). "Real" numericsReal in the sense that the flop to byte ratio (amount of computation per amount of data) is comparable to a real numerics scheme, say Roe for example. void computeResidual(size_t nVar,
size_t nDim,
const vector<Connectivity<SIMDLEN> >& connectivities,
const Matrix& coords,
const Matrix& phi,
const VectorOfMatrix& grad,
const Matrix& limiter,
RowMajorMatrix& residual,
SparseMatrix& matrix)
{
using FltVec = Array<double,SIMDLEN>;
residual.setZero();
matrix.setDiagZero();
size_t color = 0;
for(const auto& connectivity : connectivities)
{
#pragma omp parallel for schedule(dynamic,CHUNK_SIZE)
for(size_t iEdge=0; iEdge<connectivity.size(); iEdge+=SIMDLEN)
{
auto iPoint = connectivity.first_vec(iEdge);
auto jPoint = connectivity.second_vec(iEdge);
FltVec d_ij[MAXNDIM];
for(size_t iDim=0; iDim<nDim; ++iDim)
d_ij[iDim] = (coords.getVec(jPoint,iDim)-coords.getVec(iPoint,iDim))*0.5;
FltVec phiL[MAXNVAR], phiR[MAXNVAR], flux[MAXNVAR],
blk_i[MAXNVAR*MAXNVAR],
blk_j[MAXNVAR*MAXNVAR];
for(size_t iVar=0; iVar<nVar; ++iVar)
{
// Reconstruction goes here
flux[iVar] = (phiL[iVar]+phiR[iVar])*0.5;
}
// some silly way to make the Jacobians depend on the reconstruction
for(size_t iVar=0; iVar<nVar; ++iVar)
for(size_t jVar=0; jVar<nVar; ++jVar)
blk_j[iVar*nVar+jVar] = (phiL[iVar]*phiR[jVar]-phiL[jVar]*phiR[iVar])*0.5;
// the matrix-matrix multiplications
for(size_t i=0; i<WORKITERS; ++i) {
// blk_i = blk_j * blk_j
for(size_t k=0; k<nVar*nVar; ++k) blk_j[k] = blk_i[k];
}
// something akin to a dissipation term
for(size_t iVar=0; iVar<nVar; ++iVar) {
FltVec sum = flux[iVar];
for(size_t kVar=0; kVar<nVar; ++kVar)
sum += blk_j[iVar*nVar+kVar]*(phiL[kVar]-phiR[kVar])*0.5;
// residuals for iPoint and jPoint updated here
matrix.updateBlocks_v(color, iEdge, iPoint, jPoint, blk_i, blk_j);
}
++color;
}
} The more WORKITERS we have the better the vectorized code is going to look, I used a conservative number based on: The vectorized code is 1.5 times faster. SpMv - Sparse matrix-vector multiplicationWith all these speedups the linear solvers will start taking well over 50% of the time, and so it is desirable to make some improvements there too. I think I will do the estimated global speedup together with the summary/proposal. |
I'll have the summary at the bottom, you guys read papers you know to look for the conclusions. Global speedup (Conclusions)I took the expected speedups for each prototyped area of the code and applied them to the profile I measured for the benchmark case from #716 / #753 (which is not very linear solver intensive). Despite most of my posts being focused on SIMD my main motivation is the hybrid parallelization which will allow important algorithmic improvements when running on hundreds of cores, namely the multigrid and additive linear preconditioners will retain their effectiveness at much larger core counts. Proposed changes (Summary)Hybrid parallel
SIMD
General improvements
Work itemsThis is a lot of work and some changes will be significant, I will divide the work in steps, off the top of my head this order seems ok:
If you foresee conflicts with your ongoing work let's start talking sooner rather than later, I tried to order items to delay major disruptions as much as possible. |
Easter Egg - Mixed PrecisionThe work proposed above should have the solver run at the speed dictated by the available memory bandwidth (for implicit applications). This should be relatively easy to do cleanly since the relevant classes are already templated and have "mixing-type logic" for when they are used with AD. Except for central schemes, currently we would not gain much by doing it as residual loops are compute-bound and the linear solvers do not use that much time. |
Ok, SIMD update, with #753, #959, and #966 we now have a unified storage type for the data we need in CNumerics. This means that we (I) only need to implement "SIMD accessor methods" (i.e. that return a SIMD type instead of a su2double) for one class (C2DContainer and co.). I think to do SIMD right we need a new way of going about CNumerics, these are my design requirements for "CNewNumerics":
To achieve all this, the "CNewNumerics" will work as a template (obvs) decorator/visitor. The template machinery to support this is actually not too crazy: #include <array>
#include <cmath>
// An example type to use instead of the container that stores solution data for all vertices.
struct SolutionContainer
{
std::array<double,3> velocity;
std::array<double,3> areaVector;
};
using ResultType = double;
// We want classes with this interface.
class VirtualInterface
{
public:
virtual ResultType Compute(const SolutionContainer&) const = 0;
};
// The Compute method is to be composed via an inheritance chain, to do this
// we allow each building block to inherit from any class. These classes should
// be function objects that have no member variables, all data used in the
// resulting Compute method will be on the stack.
template<typename Base>
class ComputeArea : Base
{
protected:
// Different template instantiations will be made for
// 2D/3D to allow perfect loop unrolling.
enum : int {nDim = Base::nDim};
// To share variables between building blocks we will pass
// down a struct which is also composed by inheritance
struct WorkVarsType : Base::WorkVarsType
{
double area; // add "area" to the variables of Base
};
// The final implementation of Compute will be a call down the chain.
// The final constructed WorkVarsType is not known at this stage,
// hence we also template the method.
template<typename WV>
void Compute(WV& wv, const SolutionContainer& sol) const
{
// Boilerplate, call base first. This is akin to the decorator design pattern
// without polymorphism. The working variables resemble Python's "self" which
// makes this solution reasonably idiomatic.
Base::Compute(wv, sol);
// Then do our specific job.
wv.area = 0.0;
for(int i=0; i<nDim; ++i)
wv.area += pow(sol.areaVector[i],2);
wv.area = sqrt(wv.area);
}
};
// Same mechanics as above
template<typename Base>
class ComputeFlux : Base
{
protected:
enum : int {nDim = Base::nDim};
struct WorkVarsType : Base::WorkVarsType
{
double flux; // ...add new member
};
template<typename WV>
void Compute(WV& wv, const SolutionContainer& sol) const
{
// ...call base
Base::Compute(wv,sol);
// ...do aditional work
wv.flux = 0.0;
for(int i=0; i<nDim; ++i)
wv.flux += sol.velocity[i]*sol.areaVector[i];
}
};
// This class is used to terminate the chain, it makes the link
// with the interface and it is used to specify any fixed sizes.
template<int NDIM>
class Terminator : private VirtualInterface
{
protected:
enum : int {nDim = NDIM};
struct WorkVarsType {};
template<typename... Ts>
void Compute(Ts&...) const {}
};
// Finally we use the building blocks to implement Compute.
// The blocks can be reordered depending on application to
// help the compiler fuse loops or minimize register spillage,
// the resulting WorkVarsType definition will be equivalent.
class ComposedClass: public
ComputeFlux< ComputeArea< Terminator<3> > >
{
public:
ResultType Compute(const SolutionContainer& sol) const;
};
ResultType ComposedClass::Compute(const SolutionContainer& sol) const
{
// Create the working variables on the stack.
ComputeFlux::WorkVarsType wv;
// Pass down the working variables and whatever other arguments.
// If the convention was followed, all building blocks will run.
// Recall that all Compute's were templates, they will be
// instantiated here and we can force them to be inlined.
ComputeFlux::Compute(wv, sol);
// Do some additional work if needed and return result.
return wv.flux / wv.area;
} |
WIP @ #1022 |
Done, but no one wants to review #1022 |
Preamble
I am moving the discussion about SIMD that started in #716 here and adding hybrid parallelization.
The two topics go hand in hand since both (SPMD and SIMD) consist of processing multiple data (MD) elements simultaneous, either by a single program (SP) that is run by multiple threads (generally with shared view of memory), or by a single instruction (SI) run by a single core.
The reason SIMD came up in #716 is that, as I will demonstrate, vectorization needs to be supported by data structures. On the other hand SPMD needs to be supported by algorithms designed to avoid race conditions, two or more threads modifying the same memory location.
Instead of continuing #716 I think it is better to let that become documentation for #753.
I did not add without loss of readability to the title because it is long as is, that requirement is present nonetheless.
I open these issues in the hope that people participate (I am not a fast writer so this is actually a lot of work) and so far great comments and insights have come from those with experience in these topics (kudos to @economon and @vdweide). But please participate even if you never heard of these topics, your opinion about readability and "developability" of the code is important! I think the code-style should be accessible to people starting a PhD (after they read a bit about C++...).
My (ambitious) goal with this work is to lay down an architecture for performance, i.e. not just to improve the performance of a few key numerical schemes but to create mechanisms applicable to all existing and future ones. Moreover I want that to be possible with minimal changes to the way those bits of code are currently written.
The text was updated successfully, but these errors were encountered: