-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Built in batching support #392
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Conflicts: # src/genn/genn/code_generator/generateRunner.cc
# Conflicts: # src/genn/genn/code_generator/backendSIMT.cc # src/genn/genn/code_generator/generateInit.cc
…a VarAccess is formed from an access mode e.g. READ_ONLY or READ_WRITE and a duplication mode determining whether variables should be duplicated across batches or shared
…macros conflicting with new enums
…s from type-safe enums
… seperate function
This reverts commit 944326d. # Conflicts: # src/genn/backends/cuda/backend.cc # src/genn/genn/code_generator/backendSIMT.cc # src/genn/genn/code_generator/generateNeuronUpdate.cc
…dels with batch size > 1 are buil
…``genCurrentVariablePull``
* Moved index calculation logic down to BackendBase * Removed unused methods * Incorporated guts of several methods into index calculation logic, as it's now only in one place * Updated single-threaded CPU backend to use more helpers
tnowotny
approved these changes
Feb 3, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall design makes sense and the performance looks promising. Tests should have some good coverage, so should be ok (not able to check all details - diffs became pretty massive with the indexing refactoring).
Merged
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Playing with machine learning models has illustrated how important batching is to achieve decent performance on GPUs. However, the current way batching is implemented (e.g. in mlGeNN) using the system created in #323 has a lot of problems:
cudaMemcpy
for each batch.ModelSpec::addSlaveSynapsePopulation
So, in this PR, inspired by TensorFlow I've added support for batching to GeNN at a more fundamental level. Essentially this works by making all kernel launches 2D rather than 1D with the 2nd dimensions indicating the batch. This means the binary thread search continues as before along the first dimension. The
VarAccess
enumeration (the third member of the struct used to define model variables) has been extended to include flags marking whether variables are shared or duplicated and duplicated variables are allocated with an extra dimension (for a non-delayed variables they become 2D and for a delayed variable 3D). This means that a single push or pull will operate on a variable across all batches - much more efficient than multiple smallcudaMemcpy
s. In C++ the resultant data structure are somewhat irritating but, in PyGeNN, batches are exposed as an additional numpy array dimensions which is efficient and TensorFlowish.Comparing build time of a VGG16 model is mlGeNN shows the effect on compile times:

Comparing inference time on "simple CNN" model (smaller models benefit from batching more and are more effected by overheads caused by lots of memcpy:

Notable features are:
The "pull current spikes" style operations are the only operations which cannot be readily optimized in this way so they are implemented internally using multiple
cudaMemcpy
calls although I've used asynchronous and 2D versions where possible to hopefully improve performance. To maximize performance, the recording system described in #372 is the way forward.In the process of doing all this, I ended up refactoring the index calculation code as it was duplicated across the codebase rather and that would only be made worse by adding the extra complexity of indexing into the correct batch. This refactoring should be well-covered by the existing tests and I have added additional ones to cover batching.
The next step, to be done in a future PR, will be to use this as the basis for batch-learning by adding new
VarAccess
types e.g.READ_WRITE_SHARED_REDUCE_ADD
which could be implemented on a single GPU as a variable which can only be written via atomic add and, across GPUs, using NCCL reduce operations (READ_WRITE_DUPLICATE_REDUCE_ADD
would be an alternative that would require more memory and reduction kernels but, depending on the frequency of reductions vs updates, might be more efficient overall). This will be a key step in enabling GeNN to generate the sort of code I've been hacking together with custom kernels and macros for the eProp experiments.