-
-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Incremental search improvements #370
Conversation
My main motivation for not using the CV suffix was that it doesn't support the |
Hm... My motivation for changing to I've made some other changes too, summarized in #370 (comment). |
I think that this is probably the sort of thing we should discuss before anyone invests significant work in the renaming. My guess is that we might cycle through a dozen names before finding the right one. I agree that incremental isn't great, but I also think that decay is probably also not ideal. |
On the base of the name, Adaptive is certainly descriptive, but is it too general? About CV, I'm I don't have a strong opinion either way. The thing that pushed me to remove it was In [26]: KFold(1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-f1793b6f886b> in <module>()
----> 1 KFold(1)
~/Envs/pandas-dev/lib/python3.6/site-packages/scikit-learn/sklearn/model_selection/_split.py in __init__(self, n_splits, shuffle, random_state)
417 def __init__(self, n_splits=3, shuffle=False,
418 random_state=None):
--> 419 super(KFold, self).__init__(n_splits, shuffle, random_state)
420
421 def _iter_test_indices(self, X, y=None, groups=None):
~/Envs/pandas-dev/lib/python3.6/site-packages/scikit-learn/sklearn/model_selection/_split.py in __init__(self, n_splits, shuffle, random_state)
282 "k-fold cross-validation requires at least one"
283 " train/test split by setting n_splits=2 or more,"
--> 284 " got n_splits={0}.".format(n_splits))
285
286 if not isinstance(shuffle, bool):
ValueError: k-fold cross-validation requires at least one train/test split by setting n_splits=2 or more, got n_splits=1. |
86c2182
to
8531fc2
Compare
I've updated this (and the notes at #370 (comment)):
I've added This required some changes to always get the most recent score (with the highest partial fit call) from This PR needs certainly needs feedback and another review before merge. I've added the WIP label for another review by me. |
Sorry, missed this notification earlier.
I think that was already being done.
idle question: do we know that partial_fit_score is monotonically increasing for a given model through history? I don't think we can assume that later = higher (not saying that you do. Haven't looked at the code yet). |
My reasons for not implementing cv_results originally were
So going back to me not wanting |
best_index += len(v) | ||
if k == best_model_id: | ||
break | ||
for h in hist2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just be done by _fit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be but isn't (only surviving model_id
s are in info). I'll make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made the change which is makes info
have the same data as history
, but in a different format. This breaks a test:
assert len(models) == len(info) == 1 |
The relevant bits of my diff are
- assert len(info) == len(models) == 1
+ assert len(info) > len(models) == 1
+ assert set(models.keys()).issubset(set(info.keys()))
+
+ calls = {k: [h["partial_fit_calls"] for h in hist] for k, hist in info.items()}
+ for k, call in calls.items():
+ assert (np.diff(call) >= 1).all()
This change provides some ease-of-use and another test. Do we want to make this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you post what info
looks like now? Breaking tests is fine for now, if we think it's a better data format.
""" list of dicts to dict of lists. Assumes same keys in all dicts. | ||
https://stackoverflow.com/questions/5558418/list-of-dicts-to-from-dict-of-lists | ||
""" | ||
return {k: [dic[k] for dic in LD] for k in LD[0]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: easier to inline below, especially if we're using it only once.
I do not think The search = AdaptiveSearchCV()
datasets = [train_test_split(X, y) for _ in range(cv)]
scores = [search.fit(X1, y1).score(X2, y2)
for X1, X2, y1, y2 in datasets] though I don't think this should be implemented right now (one train/test set is used in practice).
Correct – monotonically increasing scores is a very bad assumption (though that's almost the case when I think we should choose
The protection against Some of this will be fixed by #370 (comment), but not all of it.
|
Agreed, though it's good to know that this could be done in the future.
IIUC, at that point |
@TomAugspurger and I talked about this PR, and generated a couple more TODOs:
|
What was the motivation behind this name? Were there any alternatives proposed? |
Many :) Some considered were IncrementalSearch (the same), DecaySearch, ExponentialSearch, ExponentialDecaySearch, TimeInverseDecaySearch, probably others. We want to convey the most important aspect of this strategy, the decaying number of models trained as we moved through time. TimeDecaySearch seemed to strike a reasonable balance between accuracy and length. As for CV, the train-test split we do internally is a form of cross validation, so it's best to include the suffix. |
Note that this is also currently what backs RandomSearch, which doesn't significantly decrease the number of models over time (except when they seem to plateau) |
2007c95
to
8eb1582
Compare
The required level of improvement to consider stopping training on | ||
that model. The most recent score must be at at most ``tol`` better | ||
than the all of the previous ``patience`` scores for that model. | ||
Increasing ``tol`` will tend to reduce training time, at the cost | ||
of worse models. | ||
|
||
Setting this value to be negative allows for some noise tolerance in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this doesn't sit with the previous paragraph (maybe the previous paragraph is wrong).
It sounds like tol
is influencing the question "should we stop?". Something like if current_score - previous_score > tol: stop
. Setting this to negative would mean we'll consider stopping on a model that performed worse than the previous one, while with positive the final model must always be better than the previous patience
(until we hit max_iter
).
I also can't tell what we mean by "more accurate estimator", or why that would be. I can imagine it being because we're not overfitting as much, but I can also imagine other explanations.
Looking at the code, we say if all(current_score < old + tol for old in patience)
. So a negative tol makes that condition more likely to be true, i.e. we continue training even when our score has dipped. So I think the first paragraph explaining tol
is incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set it negative to stop training when the scores starts to decrease. I'd expect to stop when performance starts to degrade, not when performance doesn't increase enough.
I guess my expectations are wrong; PyTorch's ReduceLROnPlateau
performs it's action when all(old > best + threshold for old in plateau)
for some positive threshold
. That is, it's action is performed when models haven't improved by threshold
in patience
epochs.
TODO:
- default
tol=1e-4
(PyTorch'sReduceLROnPlateau
threshold
).
estimators and hyperparameter combinationms) and repeatedly calls the underlying estimator's | ||
``partial_fit`` method with batches of data. | ||
|
||
These model selection algorithms are *adaptive*: they decide to keep training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should split these two ideas a bit more cleanly in the docs:
- Adaptive: Using previous fit information to decide which models to train next
- Incremental: Training on large datasets by incrementally passing batches to
partial_fit
.
As written, this paragraph implies that both of these are necessary, when you could imagine Incremental without Adaptive, and Adaptive without Incremental.
I'm going back a bit on what we discussed yesterday. Our implementation requires Incremental (it only works with underlying estimators that implement partial_fit
). So to me that's the most important of the two, and if we're only going to highlight one (e.g. the section title) it should be incremental.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you; I'm also rethinking naming. I agree that Incremental
is more important to highlight than Adaptive
; all adaptive algorithms must be incremental, but not all incremental algorithms are adaptive.
TODO:
- rename
BaseAdaptiveSearchCV
=>BaseIncrementalSearchCV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done some more thinking on naming. TimeDecaySearchCV
does two things:
- stop training on plateau (one very basic adaptive scheme)
- a fancier adaptive scheme (i.e.,
inverse
/topk
).
I also think we should rename TimeDecay*
; that really only conveys that fewer models are trained as time progresses, which is also conveyed with Adaptive
.
TODO:
- separate "stop on plateau" and adaptive methods in
TimeDecaySearchCV
- rename
TimeDecaySearchCV
=>InverseTimeDecaySearchCV
* ``model_id`` | ||
* ``partial_fit_calls`` | ||
* ``params`` | ||
* ``param_{key}`` for every ``key`` in the parameters for that model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than "parameters for that model", should it be "hyperparameters in params
"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or "where key
is every key in params
"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scikit-learn gives an example cv_results_
in their docs, and don't address this generally.
|
||
best_score_ : float | ||
Best score achieved on the hold out data, where "best" means "highest | ||
score after a models final partial fit call". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing apostrophe in "model's".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the explanation "highest final score on the validation set" clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I avoided that because I wanted to define "final". I've changed it to
where "best" means "the highest score on the hold out set after a model's last partial fit call."
bad = set(info) - set(best) | ||
self._all_models.update(set(info.keys())) | ||
instructions = self._adapt(info) | ||
bad = set(self._all_models) - set(instructions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting a bug with this sometimes. This commit attempts to resolve it. On occasion, I get a KeyError
in _incremental.py
with speculative.pop(ident)
at
model = speculative.pop(ident) |
which I take to mean some models are being killed early. This happens bad == set()
at
bad = set(models) - set(instructions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrocklin any thoughts here? I've seen this occasionally to, but haven't been able to debug it successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No thoughts off-hand. Is anyone able to produce a minimal example by any chance?
@@ -256,6 +256,11 @@ def get_futures(partial_fit_calls): | |||
|
|||
models = {k: client.submit(operator.getitem, v, 0) for k, v in models.items()} | |||
yield wait(models) | |||
info = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight preference for
info = defaultdict(list)
for h in history:
info[h["model_id"]].append(h)
info = dict(info)
@@ -599,20 +614,94 @@ def score(self, X, y=None): | |||
return self.scorer_(self.best_estimator_, X, y) | |||
|
|||
|
|||
class IncrementalSearch(BaseIncrementalSearch): | |||
class AdaptiveStopOnPlateauSearchCV(BaseIncrementalSearchCV): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent with the name in the __init__.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 on this name. I think that it is far too long and technical. I think that it would scare non-techncial users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AdaptiveStopOnPlateauSearchCV
is private (or at least that's what I intend). It is intended for use by other developers who have an adaptive algorithm they'd like to implement and also have stop on plateau (e.g., #221).
Here's the structure I intended:
BaseIncrementalSearchCV
: private.- Use: base class for all incremental searches
AdaptiveStopOnPlateauSearchCV
: private, inherits fromBaseIncrementalSearchCV
- Use: allow other devs easily create adaptive algorithms that stop on plateau (by overriding
_adapt
)
- Use: allow other devs easily create adaptive algorithms that stop on plateau (by overriding
InverseTimeDecaySearchCV
: public, inherits fromAdaptiveStopOnPlateauSearchCV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that something like the implementation of IncrementalSearch
in master should be public. I think that InverseTimeDecaySearchCV
is also probably an overly-technical name.
I recommend that we open an issue where we can discuss name changes to what is in master today and future restructurings. In general I recommend doing this before investing much work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. The changes to info
and cv_results_
, etc. are useful in their own right. Those shouldn't be held up by a naming discussion.
@@ -15,7 +15,7 @@ The underlying estimator will need to be able to train on each cross-validation | |||
See :ref:`hyperparameter.drop-in` for more. | |||
|
|||
If your data is large and the underlying estimator implements ``partial_fit``, you can | |||
Dask-ML's :ref:`*incremental* hyperparameter optimizers <hyperparameter.incremental>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm starting to doubt the "two kinds of hyperparameter optimization estimators" I laid out on line 6.
After our conversation the other day, it sounds like there are (at least) 3.
- Static search: All combinations to be tried are specified up front. The full dataset (or a CV split of the full dataset) is used on each training call.
- Adaptive search: Regardless of exactly which data is used (CV split of full dataset, or a batch), the salient feature of adaptive search is that models are prioritized according to past performance.
- Incremental search: Regardless of how models are chosen for training (static or adaptive), the salient feature of Incremental search is that models are trained on batches of data passed to
_parital_fit
. The underlying estimator must implementpartial_fit
.
Our current implementation is a mix of 2 and 3. Does that framing accord with other's thoughts? Can we think of a clear way to explain that? The docs below do a decent job I think. This line, though, I think emphasizes adaptive too much given the framing above (you have a large dataset).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a clearer way of explaining it:
There are two dimensions to searches:
- Does the search use
partial_fit
orfit
?- Is the search passive or adaptive? That is, does it use previous evaluations to select which models to train further?
Your categories fall pretty nicely into this (maybe renaming "static" to "passive").
FWIW, that would require force pushing commits, which makes reviewing somewhat harder since it breaks the “new changes” UI.
…________________________________
From: Scott Sievert <[email protected]>
Sent: Saturday, October 13, 2018 10:27:34 AM
To: dask/dask-ml
Cc: Tom Augspurger; Mention
Subject: Re: [dask/dask-ml] API: Incremental search improvements (#370)
how painful would it be for you to separate this PR into a few PRs, each of which addresses only a single issue?
I think it'd be easier for me to have separate commits, one for each issue (e.g., "API: add cv_results_ to IncrementalSearchCV"). I think this would be easy to review: the diffs would be small, and would have a good message. I think that'd also work better with GitHub's UI (which defaults to showing diff from master?).
Creating separate commits will be required for both choices, so let's see how it works then move to separate PRs if need be.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#370 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIojbGn84My_X7hvr7BxtWiYpSJ8Xks5ukgZmgaJpZM4WvR3g>.
|
bf4df4f
to
1cba727
Compare
Whoops, already forced pushed. Is this a problem on your end? I'm not sure what "new changes UI" refers to. I could open another PR with these changes, and revert this PR back to it's previous state; would that help? |
not self.patience | ||
or next_time_step - current_time_step < self.scores_per_fit | ||
): | ||
next_time_step += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. It looks like this could probably use a test.
search = IncrementalSearchCV(model, params, n_initial_parameters=20, max_iter=10) | ||
search = IncrementalSearchCV( | ||
model, params, n_initial_parameters=20, max_iter=10, decay_rate=0 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? Why not also engage decay rate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because for this dataset and model, I had to make a change to the tests if I included decay_rate=1
:
- assert (
- search.cv_results_["test_score"][search.best_index_]
- >= search.cv_results_["test_score"]
- ).all()
- search.cv_results_["rank_test_score"][search.best_index_] == 1
+ assert (
+ search.cv_results_["test_score"][search.best_index_]
+ >= np.percentile(search.cv_results_["test_score"], 90)
+ )
+ search.cv_results_["rank_test_score"][search.best_index_] <= 3
(same change as in #370 (comment))
I interpret this change as meaning "random searches work best", which makes sense because the model/dataset are simple and not performant. Plus, this test is aimed at testing basic elements of IncrementalSearchCV
but also relies on the adaptive algorithm working well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is, without decay_rate=0
, this search wouldn't return the highest scoring model (with this small dataset and simple model, the search is too adaptive).
if k == best_model_id: | ||
break | ||
|
||
return results.models[best_model_id], best_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about the need for _get_best
generally. Why do we need to search for the best result throughout history. Is this information that _fit
could just give us directly? I suspect that we're spending time copying how sklearn does things when it might not be the best approach overall. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to search for the best result throughout history
We don't. The current implementation more or less does
score = {k: v[-1]["score"] for k, v in results.info.items()}
best_model_id = max(score, key=score.get)
The previous implementation picked the best model ID before searching through history.
Is this information that _fit could just give us directly?
I refactored _fit
to return information that's very useful in calculating that information, but not that information (the "best" model ID?) directly.
In this PR, _fit
gives us model history separated by model ID for every model (not only the final models _additional_calls
selects) in results.info
. This is really useful: it allows getting the final scores for every model with
info, models, history = fit(...)
final_scores = {k: v[-1]["score"] for k, v in info.items()}
best_model_id = max(final_scores, key=final_scores.get)
and change it in the docs too
* Rename history_results_ => history_ * Provide complete model history, and make it public (otherwise boilerplate needed to formulate model_history_ from history_, looping over items in history and putting in dict, {model_id: hist})
This mirrors scikit-learn's cv_results_, with a one important distinction: this implementation only test on 1 training set. This means that there's a `test_score` key, not `mean_test_score`, or `test_score0`.
Before, BaseIncrementalSearchCV assumed _additional_calls returned one model and returned that to the user. Now, BaseIncrementalSearchCV chooses the model with the highest score returned by _additional_calls. This matters if desired to do a random search, or if `max_iter` is hit.
* MAINT: cleaner separation with _adapt and _stop_on_plateau functions (separates complex adaptive algorithm and stopping on plateau, and allows for overwriting _adapt for other adaptive algorithms that want to stop on plateau) * TST: implement tests for patience and tolerance parameters * MAINT: define "patience" to be the number of partial_fit calls, not the number of score calls
DOC: Change warning note to "this class", not "IncrementalSearch"
instructions = self._adapt(info) | ||
|
||
out = self._stop_on_plateau(info, instructions) | ||
return out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I would prefer to keep things as a single method until a concrete need arises to separate them.
While I appreciate the desire to split things apart into smaller chunks I find this adds a cost of indirection, which makes code a bit harder to review and maintain in the future. I find that code that is finely split often feels better when first writing it, but ages poorly when you have to quickly understand what is going on after a year of not looking at it.
There is obviously a length limit to this (functions of several hundred lines are unpleasant), but I don't think that the _additional_calls
function is yet there.
I haven't yet reviewed the other parts of this commit that change the behavior of _additional_calls
, partly because it's hard to separate those changes from the refactor. If possible I'd prefer that we remove the refactor so that it's easier to review and discuss the algorithmic changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I would prefer to keep things as a single method until a concrete need arises to separate them.
I have a need for _adapt
in #221. I'll make the change there, and move it back to one function in this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate it. I think that including this change over there will probably help to motivate it.
@@ -817,6 +818,24 @@ class IncrementalSearchCV(BaseIncrementalSearchCV): | |||
>>> search = IncrementalSearchCV(model, params, random_state=0, | |||
... n_initial_parameters=1000, | |||
... patience=20, max_iter=100) | |||
|
|||
``tol`` and ``patience`` are for stopping training on a "plateau", or when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the term plateau
likely to be well understood by novice users? If yes then we should remove the quotes. If no then we might want to consider avoiding the term and instead just describe it as you do after using it. I might even start off this paragraph with
"Often when training we get to a situation where additional work leads to little to no gain. In these cases the tol
and patience
parameters can help us stop work early. ..."
|
||
``tol`` and ``patience`` are for stopping training on a "plateau", or when | ||
the hold-out score flattens and/or falls. For example, setting ``tol=0`` | ||
and ``patience=2`` will dictate that scores will always increase until |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not dictating that scores will increase (we don't determine that). We might need to change the language here.
eb9c608
to
8d7478f
Compare
This is where commit-by-commit reviewing becomes challenging. I'm going to ask that you rebase to keep reviewable chunks atomic. (separate PRs would also be welcome of course). |
I see. I'll try to split into separate 3 PRs:
How does that sound? Should the PRs be finer grained? |
Superseded by #404, #405, #406. Dependency tree:
|
This adds "CV" to
BaseIncrementalSearch
and it's children.BaseIncrementalSearchCV
does do some (very basic) cross validation.This PR
API: renameIncrementalSearch
toTimeDecaySearchCV
*Search
to*SearchCV
history_results_
tohistory_
cv_results_
toBaseAdaptiveSearchCV
._fit
, add all model history toinfo
IncrementalSearchCV
passive by default (or default todecay_rate=0
)MAINT: adds privateAdaptiveStopOnPlateauSearchCV
to all easier implementation of adaptive algorithm that want to stop on plateau (e.g., ENH: Hyperband implementation #221).DOC: clarify use of "adaptive" and "incremental" in docsI also tried to come up with a better name forIncrementalSearchCV
: it seems like a very general name (but I didn't think about it too hard).