-
-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make tree splitters public #488
Conversation
Those changes bring joy to my heart 😄 However, this PR is just a stub yet. Suggestions are welcomed. @MaxHalford and @jacobmontiel, thanks for the suggestions in #476. Is that what you had in mind? |
Looking good! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments regarding my own changes. Some of them are reminders to myself, others are questions for further discussion. As I see it, the splitters will need a boost in documentation to facilitate their usage by the users.
Since #471 is not ready yet, I think we can merge this PR when we are happy with the changes and wait for the Again, I intend to provide some benchmarks of the splitters using synthetic datasets. What do you think, @jacobmontiel and @MaxHalford? |
Hey @MaxHalford and @jacobmontiel, I added a page in the user guides that is a walkthrough in the tree module. I talked about the different tree models available, model inspection, memory management, and splitters. Hope that can help the users. Your feedback is welcomed.
|
The failing tests are related to classifier chains. I believe they were recently updated. Pinging @MaxHalford. |
All set. If you require more changes before merging, please let me know, @jacobmontiel and @MaxHalford. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, I mean it! It looks like you put in a lot of work. The code looks much nicer IMO. And the fact that the parameters are not duplicated everywhere in the docs feels better too.
river/tree/__init__.py
Outdated
@@ -7,11 +9,13 @@ | |||
from .label_combination_hoeffding_tree import LabelCombinationHoeffdingTreeClassifier | |||
|
|||
__all__ = [ | |||
"splitter", | |||
"HoeffdingTree", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we want to expose this base class, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of tricky. It has some important documentation, that's the only reason it's public right now. I'm not sure this is the best approach, but it was a way to avoid repetition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% but I think we can play with the __doc__
attribute of classes. So you could have some string that contains the doc, and just add it the __doc__
of the class you want to add the documentation to. I've never done it but I've seen it done in (serious) projects :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the documentation is important. However, it does not justify exposing a class that is not intended to be used as it would confuse the users and they could end up trying to use it directly. On the other hand, it is unlikely that people will find or search for the documentation in a different class. We need to define places for generic documentation/guidance to point the user to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Keep in mind, however, that we do have some other abstract classes that are exposed in the documentation. For example, optim.Optimizer
. I am not sure whether or not this was intended.
On the other hand, it is unlikely that people will find or search for the documentation in a different class.
That is a valid point. Currently, we point to tree.HoeffdingTree
from the other trees. Something like: "for more information check tree.HoeffdingTree
". We also state that HoeffdingTree
is the base class, but I agree that this might not be enough.
Again, I would rather keep the plethora of somewhat obscure (yet so useful) parameters documented in the base class than exposing them everywhere and confusing the user with too many choices to pick from. A page in the documentation could do the work, but I find it strange to document parameters outside of a class docstring. If we opt for the first option and replicate the parameter information everywhere, we can follow Max's suggestion.
matplotlib
also relies a lot on **kwargs
. Maybe we could provide some pieces of information about what the user can do and then point to the base class. Hence, we balance the amount of the "data overhead" for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxHalford, would it be possible to add this portion of the documentation at the "index" page of the tree module? We do not have that now, but it could be a viable solution.
I am also convinced that HoeffdingTree
should not be public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can and should documentation in the __init__.py
file if that's what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a plan. Probably I will not use the same notation, (the Parameters
section, for instance) but it will do the work!
Codecov Report
@@ Coverage Diff @@
## master #488 +/- ##
==========================================
+ Coverage 83.57% 84.37% +0.80%
==========================================
Files 285 290 +5
Lines 14467 14153 -314
==========================================
- Hits 12091 11942 -149
+ Misses 2376 2211 -165
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @smastelini
I really like how the new naming and structure tidies-up the code. Nice catch on those unnecessary parts of the code. The QO splitter refactoring is also nice. Not sure if it is listed in the PR description.
I left a couple of comments and changes. In general, the PR looks good.
Let me know if you need further information.
river/tree/__init__.py
Outdated
@@ -7,11 +9,13 @@ | |||
from .label_combination_hoeffding_tree import LabelCombinationHoeffdingTreeClassifier | |||
|
|||
__all__ = [ | |||
"splitter", | |||
"HoeffdingTree", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the documentation is important. However, it does not justify exposing a class that is not intended to be used as it would confuse the users and they could end up trying to use it directly. On the other hand, it is unlikely that people will find or search for the documentation in a different class. We need to define places for generic documentation/guidance to point the user to.
I updated the documentation in the last split to make Some screenshots: |
Disclaimer: the failing test is unrelated to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the last changes @smastelini
Issues found will be addressed in another PR.
Thanks for all your help and support, @jacobmontiel and @MaxHalford! |
As discussed in #476, currently the attribute observers, a.k.a. splitters, are picked by passing strings to the trees. Their parameters are set with dictionaries. This choice is not optimal and makes it difficult to document and add new splitters in the long run.
This PR makes the splitter selection similar to what is done with linear models' optimizers.
In summary, this PR brings to the table:
tree
module, e.g., fixmypy
issues and standardize some variable naming patterns.tree.splitter
and possible mentions to splitters in the trees' docstrings.What still needs to be done:
Create a page in the documentation explaining and comparing different splits in standalone trees and ensembles thereof.progressive_val_score
is a good option, of course, but we could also use Trackers (Benchmarks refactoring #471).tree
module should be refactored/streamlined.