Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Restructure Peaks2MapsKernel to operate like other kernels #410

Merged
merged 13 commits into from
Dec 10, 2020

Conversation

tsalo
Copy link
Member

@tsalo tsalo commented Nov 26, 2020

Closes #347.

Changes proposed in this pull request:

  • Restructure Peaks2MapsKernel to follow the KernelTransformer convention.
  • Rename compute_ma to compute_ale_ma to match convention.
  • Rename peaks2maps to compute_p2m_ma to match convention.
  • Do not test Peaks2MapsKernel in test_estimator_performance.py (see Peaks2MapsKernel+ALE is too memory-intensive for tests #417).

@tsalo
Copy link
Member Author

tsalo commented Nov 30, 2020

The jobs are being killed, which means that I need to take a look at them and see if I can reduce the computational load.

@jdkent
Copy link
Member

jdkent commented Dec 1, 2020

Looks like _compute_null_analytic for the ALE estimator is a source of the increased memory consumption.
Specifically here:

ale_idx = np.where(ale_hist > 0)[0]
exp_idx = np.where(exp_hist > 0)[0]
# Compute output MA values, ale_hist indices, and probabilities
ale_scores = 1 - np.outer(1 - hist_bins[exp_idx], 1 - hist_bins[ale_idx]).ravel()

ALE has 100s of non-zero histogram bins, whereas p2m has 1000s of non-zero bins, creating the variables ale_scores, score_idx, and probabilities with millions of entries.

A potential solution could be to save those variables with single precision (i.e., float32) as opposed to double precision (float64) if there are more than say 2000 exp_idx or ale_idx. This appears to be meaningful precision loss when computing score_idx using ale_scores as a float32 or float64 in my example:

ale_scores32 = 1 - np.outer(1 - hist_bins[exp_idx], 1 - hist_bins[ale_idx]).ravel().astype(np.float32)
ale_scores64 = ale_scores = 1 - np.outer(1 - hist_bins[exp_idx], 1 - hist_bins[ale_idx]).ravel()

ale_scores32.size # 38529432

score_idx32 = np.floor(ale_scores32 * inv_step_size).astype(int)
score_idx64 = np.floor(ale_scores64 * inv_step_size).astype(int)

(score_idx64 == score_idx32).sum() # 38506719
(score_idx64 != score_idx32).sum() # 22713

so this may not be a good solution.

Another potential solution is to turn off/skip tests for ale(analytic)+p2m_kernel since they take too much RAM.

@tsalo
Copy link
Member Author

tsalo commented Dec 2, 2020

Our ale_scores precision should be determined by the precision of our histogram bins, I think. The smallest difference we should have to be able to track in ale_scores should correspond to our histogram bin resolution (in #411 it's 1e-5) squared (i.e., 1e-10). We should be able to determine the lowest-resolution float we can use based on our histogram bins using numpy.min_scalar_type, and then convert our values ahead of time!

I can't fully explain the differences, but if anything I think that float32 is more accurate. If we have a step_size of 1e-5, then np.float64(step_size ** 2) is 1.0000000000000002e-10, while np.float32(step_size ** 2) is 1e-10. Basically, I'm thinking the differences could come down to a Python precision issue that shows itself more in float64 than float32.

EDIT: Unfortunately, it looks like numpy's float classes seem to have precision issues that throw the results of numpy.min_scalar_type into question.

import numpy as np

np.min_scalar_type(0.9999)
# dtype('float16')

1 - 0.9999 == 0
# False (correct)

np.float16(1 - 0.9999) == 0
# False (correct)

0.9999 == 1
# False (correct)

np.float16(0.9999) == 1
# True (wrong)

Since numpy.float16 can detect 1 - 0.9999, but can't detect 0.9999, and numpy.min_scalar_type says that 0.9999 should work with numpy.float16, we can't use it.

@tsalo
Copy link
Member Author

tsalo commented Dec 8, 2020

I'm just going to excise all precision-related changes and drop Peaks2MapsKernel from the performance tests.

We should open a new issue about the problem, though.

@tsalo
Copy link
Member Author

tsalo commented Dec 10, 2020

The docs build is failing, but I don't think that's tied to changes in this PR (see #418), so I am going to merge if all other tests pass.

@tsalo tsalo marked this pull request as ready for review December 10, 2020 16:14
@@ -384,6 +384,7 @@ def download_peaks2maps_model(data_dir=None, overwrite=False, verbose=1):
url = "https://zenodo.org/record/1257721/files/ohbm2018_model.tar.xz?download=1"

temp_dataset_name = "peaks2maps_model_ohbm2018__temp"
data_dir = _get_dataset_dir("", data_dir=data_dir, verbose=verbose)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably help with #367 and #368..

@tsalo tsalo merged commit 10ff10b into neurostuff:master Dec 10, 2020
@tsalo tsalo deleted the fix-peaks2maps branch December 10, 2020 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Peaks2MapsKernel does not work with any CBMA algorithms
2 participants