Built in batching support #392

neworderofjamie · 2021-01-20T12:07:39Z

Playing with machine learning models has illustrated how important batching is to achieve decent performance on GPUs. However, the current way batching is implemented (e.g. in mlGeNN) using the system created in #323 has a lot of problems:

When you add lots of populations, merging keeps the sizes of the kernels under control but the size of runner.cc can easily explode and take a long time to compile. If you're manually building your model you can do some stuff to counteract this (as I did aggressively in the multi-area model), but in the context of a higher-level library like mlGeNN this is trickier
Copying to-from the GPU becomes more expensive as copying a variable from device to host requires a cudaMemcpy for each batch.
While the cost of the binary search used within the kernel merging is minimal, adding more populations isn't free.
The kernel merging performs best when the structures fit into constant cache (which is optimised for many threads reading the same data) but, multiplying the number of populations by batch size reduces the size of model for which this is possible.
Ideally trainable neuron parameters should also be sharable across batches but this would require more ugly API in the style of ModelSpec::addSlaveSynapsePopulation
For batch-learning rather than just inference, we need to be able to perform reductions across batches e.g. to sum gradients. However, using the master/slave system, this would be difficult as there's no link between the (non-shared) variables in the master and slave.

So, in this PR, inspired by TensorFlow I've added support for batching to GeNN at a more fundamental level. Essentially this works by making all kernel launches 2D rather than 1D with the 2nd dimensions indicating the batch. This means the binary thread search continues as before along the first dimension. The VarAccess enumeration (the third member of the struct used to define model variables) has been extended to include flags marking whether variables are shared or duplicated and duplicated variables are allocated with an extra dimension (for a non-delayed variables they become 2D and for a delayed variable 3D). This means that a single push or pull will operate on a variable across all batches - much more efficient than multiple small cudaMemcpys. In C++ the resultant data structure are somewhat irritating but, in PyGeNN, batches are exposed as an additional numpy array dimensions which is efficient and TensorFlowish.

Comparing build time of a VGG16 model is mlGeNN shows the effect on compile times:

Comparing inference time on "simple CNN" model (smaller models benefit from batching more and are more effected by overheads caused by lots of memcpy:

Notable features are:

Building the model with batch size 100 on the Geforce 1050ti machine results to Visual C++ running out of memory due to overly large runner.cc (unlike GCC Visual C++ uses an internal fixed sized heap so doesn't just go off churning through all your RAM and swap)
Using "batching2" results in significant speedup when using both sparse or procedural connectivity.
On some configurations using the old batching system batch size 100 is slower than batch size 50 - presumably because the overheads described above become larger than the reducing gains of the larger batch size.

The "pull current spikes" style operations are the only operations which cannot be readily optimized in this way so they are implemented internally using multiple cudaMemcpy calls although I've used asynchronous and 2D versions where possible to hopefully improve performance. To maximize performance, the recording system described in #372 is the way forward.

In the process of doing all this, I ended up refactoring the index calculation code as it was duplicated across the codebase rather and that would only be made worse by adding the extra complexity of indexing into the correct batch. This refactoring should be well-covered by the existing tests and I have added additional ones to cover batching.

The next step, to be done in a future PR, will be to use this as the basis for batch-learning by adding new VarAccess types e.g. READ_WRITE_SHARED_REDUCE_ADD which could be implemented on a single GPU as a variable which can only be written via atomic add and, across GPUs, using NCCL reduce operations (READ_WRITE_DUPLICATE_REDUCE_ADD would be an alternative that would require more memory and reduction kernels but, depending on the frequency of reductions vs updates, might be more efficient overall). This will be a key step in enabling GeNN to generate the sort of code I've been hacking together with custom kernels and macros for the eProp experiments.

# Conflicts: # src/genn/genn/code_generator/generateRunner.cc

# Conflicts: # src/genn/genn/code_generator/backendSIMT.cc # src/genn/genn/code_generator/generateInit.cc

…a VarAccess is formed from an access mode e.g. READ_ONLY or READ_WRITE and a duplication mode determining whether variables should be duplicated across batches or shared

…macros conflicting with new enums

…lly called

…mode

…s from type-safe enums

… seperate function

This reverts commit 944326d. # Conflicts: # src/genn/backends/cuda/backend.cc # src/genn/genn/code_generator/backendSIMT.cc # src/genn/genn/code_generator/generateNeuronUpdate.cc

…dels with batch size > 1 are buil

…y build settings

…``genCurrentVariablePull``

…types

* Moved index calculation logic down to BackendBase * Removed unused methods * Incorporated guts of several methods into index calculation logic, as it's now only in one place * Updated single-threaded CPU backend to use more helpers

tnowotny

The overall design makes sense and the performance looks promising. Tests should have some good coverage, so should be ok (not able to check all details - diffs became pretty massive with the indexing refactoring).

neworderofjamie added 30 commits January 4, 2021 12:07

batch size API

1a6ae1b

correctly size allocations

bee324e

# Conflicts: # src/genn/genn/code_generator/generateRunner.cc

start of batch-enabled code generation

944326d

# Conflicts: # src/genn/genn/code_generator/backendSIMT.cc # src/genn/genn/code_generator/generateInit.cc

init kernels should still be 1D

ac6fa5b

neuron update and init code generation

3027abd

framework for implementing batch-aware push and pull functions

7764341

2D memcpys for pushing and pulling current timestep variables

9180af4

multiple memcpys for pushing and pulling current spikes

b0bdb90

2D memcpys for pushing and pulling current spikes with delay

a1097f6

Moved VarAccess enum to new header file and extended so, internally, …

a9e3e4c

…a VarAccess is formed from an access mode e.g. READ_ONLY or READ_WRITE and a duplication mode determining whether variables should be duplicated across batches or shared

added WIN32_LEAN_AND_MEAN to all projects to prevent ancient NetBIOS …

d605184

…macros conflicting with new enums

fixed nasty bug in getSynapseMatrixWeight - thankfully it's not actua…

8eb0347

…lly called

SpikeSourceArray model should use new VarAccess::READ_ONLY_DUPLICATE …

85cef2e

…mode

removed VarAccess from models.h

f328a21

added some comments indicating my concerns about extracting component…

a6fb5a6

…s from type-safe enums

updated code generation to use updated VarAccess enumeration

5e96a0a

ignore some additional types of userproject output

67e8078

simple test for pulling spikes from batched models

7bde5ac

fixed bug in spike emit logic

ee2682b

spike recording

cb50f8f

simulation RNG initialisation and feature test

8d8d9ef

moved synapse kernel pre and postsynaptic index calculation code into…

f286a8b

… seperate function

Revert "start of batch-enabled code generation"

dab81b6

This reverts commit 944326d. # Conflicts: # src/genn/backends/cuda/backend.cc # src/genn/genn/code_generator/backendSIMT.cc # src/genn/genn/code_generator/generateNeuronUpdate.cc

synapse support for batching

6f1ea72

first test of batched synapses

2eaedc3

small optimization

2b802b5

fixed up OpenCL backend so it at least compiles again

b288872

fixed up SingleThreadedCPU backend so it compiles and complains if mo…

46000fc

…dels with batch size > 1 are buil

Added WIN32_LEAN_AND_MEAN (and NOMINMAX while we're at it) to setup.p…

7d240cf

…y build settings

batch size property in GeNNModel

727524a

neworderofjamie added 9 commits January 21, 2021 17:44

add batch to substitutions

f8e45f2

additional test for pulling current variable

4402d2b

made cast valid C

90666a4

tidied test

12a7702

correct slice pitch in OpenCL backend genCurrentVariablePush and …

bf3f3e4

…``genCurrentVariablePull``

rename batch decode test

fb591a0

by merging tests we don't need all the utils headers

69dc0eb

extended batch_decode_matrix_conn_gen test to cover all connectivity …

706c407

…types

Fix for pushing and pulling current spikes on AMD hardware using OpenCL

4d8fb99

neworderofjamie added this to the GeNN 4.5.0 milestone Jan 22, 2021

neworderofjamie added the enhancement label Jan 22, 2021

neworderofjamie added 7 commits January 22, 2021 13:44

fixed duplicated synapse variable initialization

16159ef

fixed prev spike time kernel

e1159d5

wip test for batched prev spike time

a9036b2

makefile for *nix

fbc1274

another makefile for *nix

e7cd08a

fixed test

f2a9105

bit of tidying

0f5a72c

* Moved index calculation logic down to BackendBase * Removed unused methods * Incorporated guts of several methods into index calculation logic, as it's now only in one place * Updated single-threaded CPU backend to use more helpers

neworderofjamie marked this pull request as ready for review January 27, 2021 11:57

neworderofjamie requested review from jamesturner246 and tnowotny January 27, 2021 11:58

neworderofjamie and others added 4 commits January 30, 2021 17:40

fixed typo (thanks @jamesturner246)

c6dbc2d

missing skip file

04ebf84

fixed small bug when using slave synapse populations

cbc33bc

fixed typo

15951f8

tnowotny approved these changes Feb 3, 2021

View reviewed changes

neworderofjamie merged commit 1a2b5dc into master Feb 3, 2021

neworderofjamie deleted the batching2 branch February 3, 2021 18:34

neworderofjamie mentioned this pull request May 25, 2021

Fixes spike recorder #422

Merged

neworderofjamie mentioned this pull request Aug 5, 2021

Batch reductions #447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Built in batching support #392

Built in batching support #392

neworderofjamie commented Jan 20, 2021 •

edited

Loading

tnowotny left a comment

Built in batching support #392

Built in batching support #392

Conversation

neworderofjamie commented Jan 20, 2021 • edited Loading

tnowotny left a comment

Choose a reason for hiding this comment

neworderofjamie commented Jan 20, 2021 •

edited

Loading