-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelization: switch from multiprocessing to joblib #137
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being late but I made two comments in the code.
At least the one on hard-breaking change would be good to resolve imho.
@@ -86,6 +87,10 @@ class GWR(GLM): | |||
name_x : list of strings | |||
Names of independent variables for use in output | |||
|
|||
n_jobs : integer | |||
The number of jobs (default 1) to run in parallel. -1 means using all processors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I am coming late to the party but would you consider using -1 as a default? That is quite common across ML world and it is what users generally expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was curious what others' opinion was on this... i tend to default to -1 personally, but joblib itself [indirectly] defaults to 1, and in esda we do both (join counts are conservative, G defaults to -1, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think scikit usually defaults to 1 so i dont think there's a standard expectation in ML world
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind either as long as it is documented (which it is). But for heavily parallelisable code like this one, I tend to prefer parallel execution by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was only looking at scikit-learn
and adapts to what they have. I actually personally prefer -1
as the default.
@@ -285,7 +291,7 @@ def _local_fit(self, i): | |||
return influ, resid, predy, betas.reshape(-1), w, Si, tr_STS_i, CCT | |||
|
|||
def fit(self, ini_params=None, tol=1.0e-5, max_iter=20, solve='iwls', | |||
lite=False, pool=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hard-breaking change we should avoid. I suggest keeping the keyword and warning when it is not None.
Another note, looking at the notebook. The section |
Nice catch on the typo. Will fix it. I think the curve looks like this because 1) the data is still small 2) the linear part of the computation limits it from further scaling. Here the curves are just showing the parallelization works, the specific scalability depends on so many factors. |
Only because of the number of available cores, no? If you ran the same code on a 32 core CPU, the minimum would be at 31-32, not 8 like here. |
I forgot to mention I have 12 physical cores on the machine. Will you suggest to add a note in notebook saying this is based on a 12-core machine. |
Yes, that is what I meant. It is interesting that it does not scale linearly to 12 cores... |
This PR replaces the repvious
multiprocessing.pool
based parallelization tojoblib
based. Users only need to specify then_jobs
parameter instead of passing apool
.Example use:
GWR
MGWR
The test sets are updated to reflect the new interface. The notebook example is also updated and the effectiveness can be compared here: joblib-based (new) vs. mp-based (old)