-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CategoricalMatrix A.Tb reproducibility. #348
Conversation
@@ -10,24 +10,29 @@ void _transpose_matvec_${dropfirst}( | |||
F* res, | |||
Int res_size | |||
) { | |||
#pragma omp parallel | |||
int num_threads = omp_get_max_threads(); | |||
std::vector<F> all_res(num_threads * res_size, 0.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I understanding correctly that we are using ~N times the memory compared to before (N being the number of threads)?
For transpose matvec this can be a problem because the result is as large as the input. Therefore, if the user has a very large categorical matrix as the main part of X, it will require N times more memory than X.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was a new vector restemp
not allocated per thread before (see https://github.com/Quantco/tabmat/blob/main/src/tabmat/ext/cat_split_helpers-tmpl.cpp#L15)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Then let's merge this. Thanks for the contribution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
The bit reproducibility of
CategoricalMatrix
'stranspose_matvec
can't be guaranteed since different threads can increment same entry in the output vector - since fp addition is not associative, the output will not necessarily be the same every time. We can get this level of reproducibility if each slice has one thread associated to it which is then aggregated deterministically (parallelising over the number of categories when aggregating into the output).This is not entirely for free (result below on my Macbook with spawns 12 threads), so I'll leave this to regular users to determine if this makes more sense behind a flag or if it is an acceptable trade off:
Checklist
CHANGELOG.rst
entry