Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] possible 10x speed improvment #231

Closed
wants to merge 13 commits into from

Conversation

lukasheinrich
Copy link
Contributor

@lukasheinrich lukasheinrich commented Sep 5, 2018

Description

The hardest bottleneck is that compute the expected poisson rate for each sample in each channel separately (these can have many bins and this computation is vectorized, but especially in SUSY analyses often it's only 1-bin anyways so that doesn't help much)

The solution is to vectorize the computation across channels and samples. But the problem is that the number of samples in each channel is not the same (somewhat similar to @jpivarski's awkward-arrays).

But we can still make this computation vectorized by adding some padding and construcing a cube of shape (nchannels, nsamples, nbins)

where nsamples and nbins are the maximum values of samples and bins observed in the spec

Then the approach is to

  1. create the cube and an index that for each modifier keeps track of which cells in the cube it affects (via multi-indices)
  2. loop over all modifiers and apply it to just the cells indicated by the multiindex
    each modifier creates a "factor field" of the same shape as the cube (for histosys we need to have a special case)
  3. sum all samples (now the cube has shape (nchannels,nbins)
  4. linearize the data via .ravel() now the expected_actualdata has (nchannels*nbins,)
  5. remove padding fields (via a multindex as well)

some simple benchmarking shows some promise

screenshot

@kratsg @matthewfeickert this is not yet passing all tests, and step 4 is missing. right now I have only benchmarked this on a test case where no padding is needed

  • Tests are passing
  • "WIP" removed from the title of the pull request

@lukasheinrich
Copy link
Contributor Author

lukasheinrich commented Sep 5, 2018

@kratsg can you confirm that this is the scale of the MBJ workspace (30 channels, 7 samples per channel, 5 modifiers per sample, 1 bin per sample)

screenshot

@matthewfeickert
Copy link
Member

@lukasheinrich This is fantastic news! Looking forward to checking this out more tonight.

@kratsg
Copy link
Contributor

kratsg commented Sep 5, 2018

I have some code changes locally to do something like this, but I never got it to work/pass tests -- the math confused me a little bit for combining everything. I'll need some time to look through this and see if I understand it.

@jpivarski
Copy link
Member

Just to consider it as an option, you could use awkward-array itself.

awkward-array will always be minimal-dependency, as it is intended as a layer under packages like this, which can use it for basic problems like "sum within groups." While the base library only uses Numpy, optimizations for specialized hardware will be implemented as external add-ons. If pyhf depends on awkward-array, then it could benefit from the add-ons.

I just finished a study of the problem you describe: adding items in variable-sized groups (as a function of Poisson average group size). The vectorized algorithm Jaydeep developed this summer is a clear improvement over the sequential for loop you get from Numpy:
sum_rates_logy

@lukasheinrich
Copy link
Contributor Author

lukasheinrich commented Sep 5, 2018

@jpivarski yes this was my first instinct. However, in pyhf we want to support multiple tensorbackends such as TensorFlow and PyTorch etc that all more or less implement the numpy tensor ops. This way, we could make easy use of hardware acceleration etc.

Would this also be a goal of awkward-array? or would you want to keep this numpy only?

@lukasheinrich
Copy link
Contributor Author

this is the old way of computing this where we

  1. first compute each sample in each channel (let most column)
  2. sum up samples for each channel (2nd column)
  3. concatenate the channels (3rd column)

screenshot

@jpivarski
Copy link
Member

@lukasheinrich The base awkward-array package will support any library that has a Numpy interface. All access to Numpy is passed through awkward.util.numpy, which will allow one library named "numpy" to be swapped out for another with the same functions. The main one I have in mind is CuPy. I know that PyTorch wraps arrays that can also be wrapped by numpy.frombuffer, but TensorFlow keeps its arrays hidden somehow.

As long as the library either implements the Numpy API or can be made to do so (e.g. with shims translating every TensorFlow function into its Numpy equivalent), then awkward-array supports that. This is an assumption that's broken by the optimizers (like in that last plot); it's why we always have to have a basic version that only asks for the arrays to quack like Numpy.

@jpivarski
Copy link
Member

@lukasheinrich TensorFlow eager tensors have a .numpy() function to convert to Numpy. I don't know if that's a view or a copy.

@lukasheinrich
Copy link
Contributor Author

@jpivarski yeah we are more or less trying to have these shims here

https://github.com/diana-hep/pyhf/tree/master/pyhf/tensor

and there is https://github.com/tensorly/tensorly which seems to try to do something similar

@kratsg
Copy link
Contributor

kratsg commented Sep 5, 2018

@lukasheinrich TensorFlow eager tensors have a .numpy() function to convert to Numpy. I don't know if that's a view or a copy.

Isn't it a numpy array under the hood - by view?

print(type(tf.Session().run(tf.constant([1,2,3]))))

@jpivarski
Copy link
Member

If so, that would be good. I think that ML frameworks like PyTorch and TensorFlow implement their own array classes so that they can freely move them from CPU to GPU and/or ignore the distinction between eager and lazy evaluation. Surely they all have methods to move them to the CPU and eagerly evaluate, the corner of these four options where Numpy lives.

@kratsg
Copy link
Contributor

kratsg commented Sep 5, 2018

Hrmm, this is generating a cube and vectorizing that portion of it -- but we'd still want to do a previous step before this of dealing with meta-modifiers first because that reduces the dimensionality of the cube we need at the end. No?

In most cases, the largest dimensionality is the number of modifiers (unless we're CMS).

@lukasheinrich
Copy link
Contributor Author

@kratsg the meta modifiers touch a different portion of the code (the computation of the constraint term in the pdf) actually I started with this as well and it's still in this PR but commented out

https://github.com/diana-hep/pyhf/pull/231/files#diff-0e8e9106451dbaea56a5ff43a27335edR287

but I didn't really see any improvement

@kratsg
Copy link
Contributor

kratsg commented Sep 5, 2018

but I didn't really see any improvement

Damn. Separate idea, I should update the _ModelConfig to provide a channels, samples, modifiers options which are just list of strings. Then you can generate an appropriate index by calling something like _ModelConfig.samples.index('ttbar') to get one dimension of that index efficiently(?).

See diana-hep/pyhf@5d7e4a8 as an example.

>>> spec = {  ... }
>>> model = pyhf.Model(spec)
>>> model.channels
['firstchannel']
>>> model.modifiers
['mu', 'stat_firstchannel']
>>> model.samples
['mu', 'bkg1', 'bkg2', 'bkg3']

@kratsg kratsg mentioned this pull request Sep 5, 2018
2 tasks
@lukasheinrich
Copy link
Contributor Author

i'll rebase after #236

@lukasheinrich lukasheinrich force-pushed the performance/factorfields branch from 9f0e61c to edf0cc9 Compare September 6, 2018 00:30
@lukasheinrich
Copy link
Contributor Author

ok got this to pass the tests for the numpy backend.. for the ML backends the issue is that tensor assignment doesn't work, but at least for TF there is tf.assign -- need to check

also @kratsg I actually added the op_codes to the modifiers, I missed that in the review of #236

@lukasheinrich
Copy link
Contributor Author

so good and bad news.. these are the profiles of the MBJ execution

prof2_fields.txt
prof2_master.txt

note that both are roughly the same toplevel line (which is the bad news)

most time is spent here in the interpolation (as we knew)

     1574    0.015    0.000    0.101    0.000 interpolate.py:19(_hfinterp_code1)
     1574    0.014    0.000    0.098    0.000 interpolate.py:19(_hfinterp_code1)

but the new cube version should allow us to more easily vectorize that computation

here are the 1574 interpolations

sum([len(v['indices']) for k,v in p.modindex.items() if 'normsys' in k])
>>> 1574

but the number of cubes that are actually computed is only 89 , so if we can vectorize the interpolation such that it interpolates multiple slices in the cube at once we can improve

@lukasheinrich
Copy link
Contributor Author

i.e. instead of this python loop

https://github.com/diana-hep/pyhf/pull/231/files#diff-0e8e9106451dbaea56a5ff43a27335edR148

we would want something tensor-native

@kratsg kratsg force-pushed the performance/factorfields branch from 311d664 to 1aec654 Compare September 7, 2018 20:56
@kratsg
Copy link
Contributor

kratsg commented Sep 7, 2018

Looking at this comment, https://github.com/diana-hep/pyhf/pull/231#issuecomment-418817390 -- I tried to reimplement the same thing (using the makespec definition in #219) and I do not see improvement in the loops

screenshot 2018-09-07 14 14 52

@lukasheinrich
Copy link
Contributor Author

@kratsg did you push your code to a separate branch ? is there a diff between the first commit in this branch and yours? maybe we can figure out what the bottleneck is

@kratsg
Copy link
Contributor

kratsg commented Sep 7, 2018

@kratsg did you push your code to a separate branch ? is there a diff between the first commit in this branch and yours? maybe we can figure out what the bottleneck is

I can push the notebook and the utility function in to this branch if you're ok with it. I already rebased your branch.

@lukasheinrich
Copy link
Contributor Author

@kratsg do you still se the ~1k s on the mbj example after the rebase? I'm fine with working together on this branch, but we should make sure we don't regress

@kratsg
Copy link
Contributor

kratsg commented Sep 8, 2018

@kratsg do you still se the ~1k s on the mbj example after the rebase? I'm fine with working together on this branch, but we should make sure we don't regress

Ok, I'll re-run it on the full MBJ with the default run.

@kratsg
Copy link
Contributor

kratsg commented Sep 9, 2018

New interpolation codes will be added in #251.

@kratsg
Copy link
Contributor

kratsg commented Sep 28, 2018

Close in favor of #285.

@kratsg kratsg closed this Sep 28, 2018
@kratsg kratsg deleted the performance/factorfields branch October 7, 2018 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat/enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants