Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Built in batching support #392

Merged
merged 67 commits into from
Feb 3, 2021
Merged

Built in batching support #392

merged 67 commits into from
Feb 3, 2021

Conversation

neworderofjamie
Copy link
Contributor

@neworderofjamie neworderofjamie commented Jan 20, 2021

Playing with machine learning models has illustrated how important batching is to achieve decent performance on GPUs. However, the current way batching is implemented (e.g. in mlGeNN) using the system created in #323 has a lot of problems:

  • When you add lots of populations, merging keeps the sizes of the kernels under control but the size of runner.cc can easily explode and take a long time to compile. If you're manually building your model you can do some stuff to counteract this (as I did aggressively in the multi-area model), but in the context of a higher-level library like mlGeNN this is trickier
  • Copying to-from the GPU becomes more expensive as copying a variable from device to host requires a cudaMemcpy for each batch.
  • While the cost of the binary search used within the kernel merging is minimal, adding more populations isn't free.
  • The kernel merging performs best when the structures fit into constant cache (which is optimised for many threads reading the same data) but, multiplying the number of populations by batch size reduces the size of model for which this is possible.
  • Ideally trainable neuron parameters should also be sharable across batches but this would require more ugly API in the style of ModelSpec::addSlaveSynapsePopulation
  • For batch-learning rather than just inference, we need to be able to perform reductions across batches e.g. to sum gradients. However, using the master/slave system, this would be difficult as there's no link between the (non-shared) variables in the master and slave.

So, in this PR, inspired by TensorFlow I've added support for batching to GeNN at a more fundamental level. Essentially this works by making all kernel launches 2D rather than 1D with the 2nd dimensions indicating the batch. This means the binary thread search continues as before along the first dimension. The VarAccess enumeration (the third member of the struct used to define model variables) has been extended to include flags marking whether variables are shared or duplicated and duplicated variables are allocated with an extra dimension (for a non-delayed variables they become 2D and for a delayed variable 3D). This means that a single push or pull will operate on a variable across all batches - much more efficient than multiple small cudaMemcpys. In C++ the resultant data structure are somewhat irritating but, in PyGeNN, batches are exposed as an additional numpy array dimensions which is efficient and TensorFlowish.

Comparing build time of a VGG16 model is mlGeNN shows the effect on compile times:
chart (2)

Comparing inference time on "simple CNN" model (smaller models benefit from batching more and are more effected by overheads caused by lots of memcpy:
Simple CNN sim time for 100 samples (2)
Notable features are:

  • Building the model with batch size 100 on the Geforce 1050ti machine results to Visual C++ running out of memory due to overly large runner.cc (unlike GCC Visual C++ uses an internal fixed sized heap so doesn't just go off churning through all your RAM and swap)
  • Using "batching2" results in significant speedup when using both sparse or procedural connectivity.
  • On some configurations using the old batching system batch size 100 is slower than batch size 50 - presumably because the overheads described above become larger than the reducing gains of the larger batch size.

The "pull current spikes" style operations are the only operations which cannot be readily optimized in this way so they are implemented internally using multiple cudaMemcpy calls although I've used asynchronous and 2D versions where possible to hopefully improve performance. To maximize performance, the recording system described in #372 is the way forward.

In the process of doing all this, I ended up refactoring the index calculation code as it was duplicated across the codebase rather and that would only be made worse by adding the extra complexity of indexing into the correct batch. This refactoring should be well-covered by the existing tests and I have added additional ones to cover batching.

The next step, to be done in a future PR, will be to use this as the basis for batch-learning by adding new VarAccess types e.g. READ_WRITE_SHARED_REDUCE_ADD which could be implemented on a single GPU as a variable which can only be written via atomic add and, across GPUs, using NCCL reduce operations (READ_WRITE_DUPLICATE_REDUCE_ADD would be an alternative that would require more memory and reduction kernels but, depending on the frequency of reductions vs updates, might be more efficient overall). This will be a key step in enabling GeNN to generate the sort of code I've been hacking together with custom kernels and macros for the eProp experiments.

# Conflicts:
#	src/genn/genn/code_generator/generateRunner.cc
# Conflicts:
#	src/genn/genn/code_generator/backendSIMT.cc
#	src/genn/genn/code_generator/generateInit.cc
…a VarAccess is formed from an access mode e.g. READ_ONLY or READ_WRITE and a duplication mode determining whether variables should be duplicated across batches or shared
This reverts commit 944326d.

# Conflicts:
#	src/genn/backends/cuda/backend.cc
#	src/genn/genn/code_generator/backendSIMT.cc
#	src/genn/genn/code_generator/generateNeuronUpdate.cc
@neworderofjamie neworderofjamie added this to the GeNN 4.5.0 milestone Jan 22, 2021
* Moved index calculation logic down to BackendBase
* Removed unused methods
* Incorporated guts of several methods into index calculation logic, as it's now only in one place
* Updated single-threaded CPU backend to use more helpers
@neworderofjamie neworderofjamie marked this pull request as ready for review January 27, 2021 11:57
Copy link
Member

@tnowotny tnowotny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall design makes sense and the performance looks promising. Tests should have some good coverage, so should be ok (not able to check all details - diffs became pretty massive with the indexing refactoring).

@neworderofjamie neworderofjamie merged commit 1a2b5dc into master Feb 3, 2021
@neworderofjamie neworderofjamie deleted the batching2 branch February 3, 2021 18:34
@neworderofjamie neworderofjamie mentioned this pull request Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants