-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch reductions #447
Batch reductions #447
Conversation
# Conflicts: # include/genn/genn/currentSource.h # include/genn/genn/neuronGroup.h # src/genn/genn/synapseGroup.cc
…`` and ``CustomUpdateWUGroupMergedBase::getVarRefIndex``
* added test of error
…:isReduction`` so they are set irrespective of actual batch size of model
…lSpec::addCustomUpdate`` rather than only when finalizing model (always a good thing)
…into ``BackendBase``
…ppers to handle transposes involving custom WU update variablesadd additional error to prevent reduction and transpose operations being attempted simultaneously
Codecov Report
@@ Coverage Diff @@
## master #447 +/- ##
==========================================
+ Coverage 88.00% 88.08% +0.07%
==========================================
Files 78 78
Lines 16605 16824 +219
==========================================
+ Hits 14614 14820 +206
- Misses 1991 2004 +13
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the comment about not initialising reduction type variables below it all makes sense ...
// Loop through variable references | ||
for(const auto &v : cm->getVarRefs()) { | ||
// If variable reference is a reduction target, define variable initialised to correct initial value for reduction | ||
// **NOTE** by not initialising this, compilers should emit a warning if user code doesn't set it to something |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhm ... I am not quite sure I understand what is going on here. Is this an old comment (you seem to be initialising below after all) or am I missing the entire plot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot - I've moved the comment to where it belongs
# Conflicts: # include/genn/genn/currentSource.h # include/genn/genn/neuronGroup.h
So the batching system (#392) let you do parallel inference but, in order to do parallel training, you need to be able to sum up (reduce) what you're learning online across all elements in the batch and apply them to the (shared) weights. This PR implements this via some new
VarAccess
modes:REDUCE_SUM
andREDUCE_MAX
which signal that writes to these variables should be reductions. I went backwards and forwards a lot about the syntax for this one but ended up not really adding any new syntax so a gradient reduce and zeroing custom update might look like this:The nice thing with this lack of syntax is that you can do stuff like implement softmax with
$(reducedGradient) = exp($(gradient));
(although that doesn't make a lot of sense reducing across batches) and backends which don't support batching (like single-threaded CPU) can basically just stick in a write back to global memory after the generic code generation and this will automatically turn into an (unnecessary) copy operation.Because typically
NUM_BATCHES << NUM_WEIGHTS
, this reduction is quite different than those typically talked about in the literature (https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf) so I dug into the TF source to see how they implement reductions of this type and, forNUM_WEIGHTS > 4096
, they use this very simple algorithm:REDUCE_MAX
can be established)This makes sense as you get good coalescing of global memory reads and no need for atomics etc and, as GeNN will fuse any compatible reductions together so they're run in parallel, I think any reasonable model will easily occupy the GPU (which I think the 4096 vaguely represents).
I've been using this to do parallel eProp where one of these reductions on the gradients is followed by an Adam optimizer custom update which applies the now-non-batched gradients to the shared weights (via #446). However, it's pretty flexible so you could actually use it with STDP or whatever - you'd apply your STDP rule to a deltaG variable which would be duplicated across the batches, reduce these and add them to the (shared) weights.
On the Titan V, increasing the batch size gives decreases the effective time to train a single stimuli by around 4.5x as shown: