Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel merging #286

Merged
merged 159 commits into from
Jan 7, 2020
Merged

Kernel merging #286

merged 159 commits into from
Jan 7, 2020

Conversation

neworderofjamie
Copy link
Contributor

@neworderofjamie neworderofjamie commented Dec 19, 2019

So here it is - as forewarned - it's a bit of a beast, but I think a necessary one! I'm going to attempt to explain the logic with pointers into the code and samples of generated code here:

  1. A ModelSpecMerged is created from the ModelSpec in https://github.com/genn-team/genn/blob/kernel_merging/src/genn/genn/code_generator/generateAll.cc#L54. This calls methods like NeuronGroup::canBeMerged which in turn call methods like NeuronModels::Base::canBeMerged to determine which groups can be merged. There's a lot of nuance here so I've added quite a few unit tests at both levels (hence the increse in test coverage). The ModelSpecMerged contains vectors of NeuronGroupMerged and SynapseGroupMerged for each kernel, each of which is a simple class containing an archetype NeuronGroup/SynapseGroup used as the basis for code generation and a vector containing NeuronGroup/SynapseGroup which can be simulated with the same code.
  2. In generate runner, structs are declared in https://github.com/genn-team/genn/blob/kernel_merging/src/genn/genn/code_generator/generateRunner.cc#L871-L982 for each merged group which are used to parse stuff that differs between groups in a merged group. For a merged group of LIF neurons, these might look something like:
    struct MergedNeuronUpdateGroup0
     {
        unsigned int numNeurons;
        unsigned int *spkCnt;
        unsigned int *spk;
        scalar *V;
        scalar *RefracTime;
        float *inSynInSyn0;
        float *inSynInSyn1;    
    };
    As well as declaring the structs, the MergedStructGenerator class also builds an array of these structs, pointing to existing allocated arrays etc, something like:
    MergedNeuronUpdateGroup0 mergedNeuronUpdateGroup0[] =  {
            {2000, d_glbSpkCntI, d_glbSpkI, d_VI, d_RefracTimeI, d_inSynII, d_inSynEI, },
            {8000, d_glbSpkCntE, d_glbSpkE, d_VE, d_RefracTimeE, d_inSynIE, d_inSynEE, },
     };
  3. The backend then provides device-side arrays (in CUDA normally __device__ __constant__) and push functions (in CUDA using cudaMemcpyToSymbol) to copy these to device.
  4. The CPU backend simply loop through the groups in each merged group and passes the structs to the generated code. However, in CUDA, it's harder as each thread needs to know which group it should be processing. The approach I've gone for (after several failed ideas) is to have an additional sorted array of starting ids for each group:
    __device__ __constant__ unsigned int d_mergedNeuronUpdateGroupStartID0[] = {0, 2048, };
    Each thread then searches this using a simple binary search (O(log n) complexity on number of groups) generated at https://github.com/genn-team/genn/blob/kernel_merging/include/genn/backends/cuda/backend.h#L307-L325 which should be pretty efficient as groups are still block aligned so all threads in the block will follow the same path so there's no divergence and the indices are in constant memory which should be very fast for this access pattern.
  5. In the kernels themselves, members of this struct are substituted in rather than the device variables (there are no more dd_ device symbols) e.g. in the start of a neuron kernel:
    if(lid < group.numNeurons) {
        scalar lV = group.V[lid];
        scalar lRefracTime = group.RefracTime[lid];

There is additional complexity around extra global parameters as they need updating within the structure but, more or less, the same basic system is used for all kernel types (including initialization).

I don't think the result is perfect yet but, I think getting it merged and fixing small things in seperate pull requests is the answer rather than making this even more complex:

  • Automatic decision about using constant vs global memory space
  • If merging results in merged groups with only one group inside, use hard-coded parameters
  • There are some places where we're now doing integer divides/modulo on constants e.g. number of neurons that are provided via merged structures - this is not ideal and should be improved with some sort of classic fast divide-by-constant optimization
  • I've tried to keep the runner generation tidy but the code for creating merged structs has added quite a lot of complexity here
  • Passing non-pointer extra global parameters is not very efficient - all non-pointer extra global parameters get copied into the merged structs every timestep - some sort of double-buffering to detect which ones have changed would improve this

Fixes #260

neworderofjamie and others added 30 commits November 12, 2019 18:00
@codecov
Copy link

codecov bot commented Dec 19, 2019

Codecov Report

Merging #286 into master will increase coverage by 3.46%.
The diff coverage is 93.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #286      +/-   ##
==========================================
+ Coverage   84.27%   87.74%   +3.46%     
==========================================
  Files          47       60      +13     
  Lines        7150     8331    +1181     
==========================================
+ Hits         6026     7310    +1284     
+ Misses       1124     1021     -103
Impacted Files Coverage Δ
include/genn/genn/weightUpdateModels.h 43.75% <ø> (ø) ⬆️
include/genn/genn/synapseGroupInternal.h 0% <ø> (ø) ⬆️
include/genn/genn/postsynapticModels.h 50% <ø> (+33.33%) ⬆️
include/genn/genn/currentSourceInternal.h 0% <ø> (ø) ⬆️
src/genn/backends/cuda/backend.cc 85.05% <ø> (+2.32%) ⬆️
include/genn/genn/currentSource.h 75% <ø> (ø) ⬆️
...ude/genn/backends/cuda/presynapticUpdateStrategy.h 100% <ø> (ø) ⬆️
include/genn/genn/synapseGroup.h 86.36% <ø> (ø) ⬆️
include/genn/genn/initVarSnippet.h 100% <ø> (ø) ⬆️
include/genn/genn/neuronModels.h 100% <ø> (+20%) ⬆️
... and 63 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0cd4110...73819f0. Read the comment docs.

Copy link
Member

@tnowotny tnowotny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the rough design and it seems sensible albeit the entire effort makes me feel ever so slightly uneasy because of the growing complexity of things. That been said, the results support your thinking and maybe it is simply unavoidable.
So, I will approve the pull request and hopefully we won't regret it later ;-)

@neworderofjamie
Copy link
Contributor Author

Glad you approve! I totally share your slight unease about the added complexity but it does seem necessary going forwards

# Conflicts:
#	src/genn/backends/cuda/backend.cc
#	src/genn/backends/single_threaded_cpu/backend.cc
#	src/genn/genn/code_generator/generateRunner.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error with too many extra global parameters
2 participants