Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT, BUG, TST: incremental API cleaning #406

Merged
merged 17 commits into from
Oct 19, 2018

Conversation

stsievert
Copy link
Member

@stsievert stsievert commented Oct 15, 2018

This PR

  • resolves a bug with IncrementalSearchCV where it would be stuck in an infinite loop when decay_rate=0.
  • defines patience to be the number of calls to partial_fit
  • implement a test for tol
  • add to the patience test

This PR depends on #404. A clean diff can be found at stsievert#2.

* Rename history_results_ => history_
* Provide complete model history, and make it public
  (otherwise boilerplate needed to formulate model_history_ from history_,
  looping over items in history and putting in dict, {model_id: hist})
This mirrors scikit-learn's cv_results_, with a one important distinction: this implementation only test on 1 training set.

This means that there's a `test_score` key, not `mean_test_score`,
or `test_score0`.
Before, BaseIncrementalSearchCV assumed _additional_calls
returned one model and returned that to the user.

Now, BaseIncrementalSearchCV chooses the model with the highest
score returned by _additional_calls.

This matters if desired to do a random search, or if `max_iter` is hit.
* MAINT: cleaner separation with _adapt and _stop_on_plateau functions
  (separates complex adaptive algorithm and stopping on plateau,
   and allows for overwriting _adapt for other adaptive algorithms
   that want to stop on plateau)
* TST: implement tests for patience and tolerance parameters
* MAINT: define "patience" to be the number of partial_fit calls, not
  the number of score calls
@TomAugspurger
Copy link
Member

Merged master.

@TomAugspurger
Copy link
Member

How should we proceed here? I'd like to do a release sooner rather than later, but I'm not sure how much time we'll have to review (let alone work) on this PR in the short-term.

Personally, I think that if we do decide to change the default search strategy, we'll be able to do it with a deprecation cycle (if we deem that necessary). So my main concern with releasing master as is, is if the default in master is "bad" and turns people off this estimator. I'm not especially worried about that though, so my vote would be to release soon, and update this after the relesae.

@stsievert
Copy link
Member Author

(if we deem that necessary).

There aren't many user-facing changes in this PR. I think it updating this could be a point release. It improves on the implementation (decay_rate=0 is allowed and defines patience to be what the user expects (but they're currently the same thing with default params).

The largest user facing issue is with the bug this PR solves (it removes an infinite loop if decay_rate=0).

@TomAugspurger
Copy link
Member

I think that infinite loop was fixed in #373

@stsievert
Copy link
Member Author

I think that infinite loop was fixed in #373

I don't think so. It's with decay_rate specifically, and involves the property that for all x > 0, 1 / x**0 == 1.

@TomAugspurger
Copy link
Member

TomAugspurger commented Oct 16, 2018 via email

@TomAugspurger
Copy link
Member

I don't think this PR fixes the decay_rate=0 bug. AFAICT, test_search_basic isn't really being run now, since the tests exists as soon as the test helper _test_search_basic is called. I believe you need to await it.

diff --git a/tests/model_selection/test_incremental.py b/tests/model_selection/test_incremental.py
index 5d828b7..c52b06a 100644
--- a/tests/model_selection/test_incremental.py
+++ b/tests/model_selection/test_incremental.py
@@ -196,9 +196,10 @@ def test_explicit(c, s, a, b):
 @gen_cluster(client=True)
 def test_search_basic(c, s, a, b):
     for decay_rate in {0, 1}:
-        _test_search_basic(decay_rate, c, s, a, b)
+        yield _test_search_basic(decay_rate, c, s, a, b)
 
 
+@gen.coroutine
 def _test_search_basic(decay_rate, c, s, a, b):
     X, y = make_classification(n_samples=1000, n_features=5, chunks=(100, 5))
     model = SGDClassifier(tol=1e-3, loss="log", penalty="elasticnet")

@TomAugspurger
Copy link
Member

Alrighty, I had a look and I think these changes look good. I like thinking about patience in terms of the number of partial_fit_calls.

Planning to merge this evening.

@TomAugspurger TomAugspurger merged commit 0860ae6 into dask:master Oct 19, 2018
@TomAugspurger
Copy link
Member

Thanks @stsievert!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants