Kernel merging #286

neworderofjamie · 2019-12-19T15:21:43Z

So here it is - as forewarned - it's a bit of a beast, but I think a necessary one! I'm going to attempt to explain the logic with pointers into the code and samples of generated code here:

A ModelSpecMerged is created from the ModelSpec in https://github.com/genn-team/genn/blob/kernel_merging/src/genn/genn/code_generator/generateAll.cc#L54. This calls methods like NeuronGroup::canBeMerged which in turn call methods like NeuronModels::Base::canBeMerged to determine which groups can be merged. There's a lot of nuance here so I've added quite a few unit tests at both levels (hence the increse in test coverage). The ModelSpecMerged contains vectors of NeuronGroupMerged and SynapseGroupMerged for each kernel, each of which is a simple class containing an archetype NeuronGroup/SynapseGroup used as the basis for code generation and a vector containing NeuronGroup/SynapseGroup which can be simulated with the same code.
In generate runner, structs are declared in https://github.com/genn-team/genn/blob/kernel_merging/src/genn/genn/code_generator/generateRunner.cc#L871-L982 for each merged group which are used to parse stuff that differs between groups in a merged group. For a merged group of LIF neurons, these might look something like:
```
struct MergedNeuronUpdateGroup0
 {
    unsigned int numNeurons;
    unsigned int *spkCnt;
    unsigned int *spk;
    scalar *V;
    scalar *RefracTime;
    float *inSynInSyn0;
    float *inSynInSyn1;    
};
```
As well as declaring the structs, the MergedStructGenerator class also builds an array of these structs, pointing to existing allocated arrays etc, something like:
```
MergedNeuronUpdateGroup0 mergedNeuronUpdateGroup0[] =  {
        {2000, d_glbSpkCntI, d_glbSpkI, d_VI, d_RefracTimeI, d_inSynII, d_inSynEI, },
        {8000, d_glbSpkCntE, d_glbSpkE, d_VE, d_RefracTimeE, d_inSynIE, d_inSynEE, },
 };
```
The backend then provides device-side arrays (in CUDA normally __device__ __constant__) and push functions (in CUDA using cudaMemcpyToSymbol) to copy these to device.
The CPU backend simply loop through the groups in each merged group and passes the structs to the generated code. However, in CUDA, it's harder as each thread needs to know which group it should be processing. The approach I've gone for (after several failed ideas) is to have an additional sorted array of starting ids for each group:
```
__device__ __constant__ unsigned int d_mergedNeuronUpdateGroupStartID0[] = {0, 2048, };
```
Each thread then searches this using a simple binary search (O(log n) complexity on number of groups) generated at https://github.com/genn-team/genn/blob/kernel_merging/include/genn/backends/cuda/backend.h#L307-L325 which should be pretty efficient as groups are still block aligned so all threads in the block will follow the same path so there's no divergence and the indices are in constant memory which should be very fast for this access pattern.
In the kernels themselves, members of this struct are substituted in rather than the device variables (there are no more dd_ device symbols) e.g. in the start of a neuron kernel:
```
if(lid < group.numNeurons) {
    scalar lV = group.V[lid];
    scalar lRefracTime = group.RefracTime[lid];
```

There is additional complexity around extra global parameters as they need updating within the structure but, more or less, the same basic system is used for all kernel types (including initialization).

I don't think the result is perfect yet but, I think getting it merged and fixing small things in seperate pull requests is the answer rather than making this even more complex:

Automatic decision about using constant vs global memory space
If merging results in merged groups with only one group inside, use hard-coded parameters
There are some places where we're now doing integer divides/modulo on constants e.g. number of neurons that are provided via merged structures - this is not ideal and should be improved with some sort of classic fast divide-by-constant optimization
I've tried to keep the runner generation tidy but the code for creating merged structs has added quite a lot of complexity here
Passing non-pointer extra global parameters is not very efficient - all non-pointer extra global parameters get copied into the merged structs every timestep - some sort of double-buffering to detect which ones have changed would improve this

Fixes #260

…ormally have to hand

… i.e. snippets bound to parameters

…ble to run simple neuron-only models

…in NeuronGroup::addOutSyn

…o 2019 - correct fix will be to upgrade gtest

…t thresholds

… backend

…- derived parameters can only be substituted in ``ModelSpec::finalize``

codecov · 2019-12-19T17:05:58Z

Codecov Report

Merging #286 into master will increase coverage by 3.46%.
The diff coverage is 93.96%.

@@            Coverage Diff             @@
##           master     #286      +/-   ##
==========================================
+ Coverage   84.27%   87.74%   +3.46%     
==========================================
  Files          47       60      +13     
  Lines        7150     8331    +1181     
==========================================
+ Hits         6026     7310    +1284     
+ Misses       1124     1021     -103

Impacted Files	Coverage Δ
include/genn/genn/weightUpdateModels.h	`43.75% <ø> (ø)`	⬆️
include/genn/genn/synapseGroupInternal.h	`0% <ø> (ø)`	⬆️
include/genn/genn/postsynapticModels.h	`50% <ø> (+33.33%)`	⬆️
include/genn/genn/currentSourceInternal.h	`0% <ø> (ø)`	⬆️
src/genn/backends/cuda/backend.cc	`85.05% <ø> (+2.32%)`	⬆️
include/genn/genn/currentSource.h	`75% <ø> (ø)`	⬆️
...ude/genn/backends/cuda/presynapticUpdateStrategy.h	`100% <ø> (ø)`	⬆️
include/genn/genn/synapseGroup.h	`86.36% <ø> (ø)`	⬆️
include/genn/genn/initVarSnippet.h	`100% <ø> (ø)`	⬆️
include/genn/genn/neuronModels.h	`100% <ø> (+20%)`	⬆️
... and 63 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0cd4110...73819f0. Read the comment docs.

…only appears on mac * Forgot to remove hack in cudaMemsetting of bitmask connectivity * Fixed copy-and-paste bug in allocation of bitmask

tnowotny

I think I understand the rough design and it seems sensible albeit the entire effort makes me feel ever so slightly uneasy because of the growing complexity of things. That been said, the results support your thinking and maybe it is simply unavoidable.
So, I will approve the pull request and hopefully we won't regret it later ;-)

neworderofjamie · 2020-01-07T10:43:32Z

Glad you approve! I totally share your slight unease about the added complexity but it does seem necessary going forwards

# Conflicts: # src/genn/backends/cuda/backend.cc # src/genn/backends/single_threaded_cpu/backend.cc # src/genn/genn/code_generator/generateRunner.cc

neworderofjamie and others added 30 commits November 12, 2019 18:00

Deep comparison of neuron models

dbdf684

Switch from model references to model pointers as these are what we n…

a027a17

…ormally have to hand

Add a unit test to test neuron model comparison

706b2c3

Deep comparison of weight update models

ac87ec0

Add a unit test to test weight update model comparison

8973024

Deep comparison of postsynaptic models

475fefb

Add a unit test to test postsynaptic model comparison

4676a77

Deep comparison of current source models

d42daba

Add a unit test to test current source model comparison

17d121c

Deep comparison of init var snippets

81d9ab1

Add a unit test for init var snippet comparison

81154a1

Deep comparison of init sparse connectivity snippets

cc93e5e

Add a unit test for init sparse connectivity snippet comparison

b6bbcd9

Deep comparison of current sources

2af4dd8

Add a unit test for current source compatibility

5b06aa4

Added comparison of VarInits

cb08407

Added comparison of SynapseGroup

5c30da9

Added Visual Studio project for unit tests

1f2022c

Ignore Visual Studio test duration files

b93290b

Made GeNN projects less irritating to work with by adding headers

5fbf65d

Fixed small bug in SynapseGroup::canWUBeMerged

97cab01

Added test for comparing variable and sparse connectivity initialiser…

b2b1431

… i.e. snippets bound to parameters

Added synapse group merging tests

4a2c5b9

sparse index type should also match

9f3462a

Implemented comparison of NeuronGroup

f52da61

Added tests for merging neuron groups

c823e6b

Include unit tests in Windows test runner

a96a5ca

Fixed test

e6caab6

Whether spike times are required or not is also a grounds for merging

dbe8d59

Started hacking - not totally clear why it's crashing but SHOULD be a…

388fd59

…ble to run simple neuron-only models

neworderofjamie and others added 8 commits December 19, 2019 09:18

slight tidy of neuron group threshold condition - now gets populated …

fd502e9

…in NeuronGroup::addOutSyn

Fixup all visual studio test projects so they build with visual studi…

0304453

…o 2019 - correct fix will be to upgrade gtest

Added missing files to genn project

3dbe656

Slightly rough support for extra global parameters in spike-like even…

feb7a95

…t thresholds

Fixed typo in single-threaded CPU backend

b41e379

**CHERRYPICK** give error if procedural connectivity is used with CPU…

a60511c

… backend

**CHERRYPICK** use parallel build for windows python wheel creation

97c372c

Fixed ambiguity which was breaking SWIG

503fb3b

neworderofjamie added the enhancement label Dec 19, 2019

neworderofjamie added this to the GeNN 4.2.0 milestone Dec 19, 2019

neworderofjamie requested a review from tnowotny December 19, 2019 15:28

neworderofjamie added 4 commits December 19, 2019 16:33

Sadly slight tidying of neuron group spk event threshold didn't work …

2ccf33a

…- derived parameters can only be substituted in ``ModelSpec::finalize``

Fixed incorrect EGP substitution

f05883e

Missing source suffix for neuron substitution

a32b141

Added some missing windows projects for feature tests

5624411

neworderofjamie and others added 5 commits December 19, 2019 17:08

Fixed unused parameter warning

22e222d

Finally found and fixed random bug with bitmask initialization which …

8e7f724

…only appears on mac * Forgot to remove hack in cudaMemsetting of bitmask connectivity * Fixed copy-and-paste bug in allocation of bitmask

Missing functions breaking dynamic build

fc90fb7

Export fiddling to fix dynamic build on Windows

6fcbe07

Turns out that all classes in hierarchy need to be exported on Windows

6727952

tnowotny approved these changes Jan 6, 2020

View reviewed changes

Merge branch 'master' into kernel_merging

73819f0

# Conflicts: # src/genn/backends/cuda/backend.cc # src/genn/backends/single_threaded_cpu/backend.cc # src/genn/genn/code_generator/generateRunner.cc

neworderofjamie merged commit 417e19e into master Jan 7, 2020

neworderofjamie deleted the kernel_merging branch January 7, 2020 12:58

neworderofjamie mentioned this pull request Mar 6, 2020

Invalid argument error when initialising many populations with XORWOW RNGs #308

Closed

neworderofjamie modified the milestones: GeNN 4.2.0, GeNN 4.3.0 Mar 24, 2020

neworderofjamie mentioned this pull request Mar 26, 2021

Linker-imposed model complexity limit on Windows #408

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel merging #286

Kernel merging #286

neworderofjamie commented Dec 19, 2019 •

edited

Loading

codecov bot commented Dec 19, 2019 •

edited

Loading

tnowotny left a comment

neworderofjamie commented Jan 7, 2020

Kernel merging #286

Kernel merging #286

Conversation

neworderofjamie commented Dec 19, 2019 • edited Loading

codecov bot commented Dec 19, 2019 • edited Loading

Codecov Report

tnowotny left a comment

Choose a reason for hiding this comment

neworderofjamie commented Jan 7, 2020

neworderofjamie commented Dec 19, 2019 •

edited

Loading

codecov bot commented Dec 19, 2019 •

edited

Loading