Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tree splitters public #488

Merged
merged 32 commits into from
Mar 16, 2021
Merged

Make tree splitters public #488

merged 32 commits into from
Mar 16, 2021

Conversation

smastelini
Copy link
Member

@smastelini smastelini commented Feb 25, 2021

As discussed in #476, currently the attribute observers, a.k.a. splitters, are picked by passing strings to the trees. Their parameters are set with dictionaries. This choice is not optimal and makes it difficult to document and add new splitters in the long run.

This PR makes the splitter selection similar to what is done with linear models' optimizers.

In summary, this PR brings to the table:

  • Rename: Attribute observer (AO) -> splitter.
  • Make splitters public and standardize their naming scheme.
  • Add a property to differentiate classification and regression splitters (suggestions for a better property name are welcomed).
  • Splitters can be passed as a parameter to the trees. They are "deepcopied" to new leaves as they are created.
  • Remove redundant portions of the tree code.
  • Change indirectly impacted methods (ARF variants).
  • Apply misc/minor improvements to the tree module, e.g., fix mypy issues and standardize some variable naming patterns.
  • Improve documentation in tree.splitter and possible mentions to splitters in the trees' docstrings.

What still needs to be done:

  • Create a page in the documentation explaining and comparing different splits in standalone trees and ensembles thereof. progressive_val_score is a good option, of course, but we could also use Trackers (Benchmarks refactoring #471).
  • Decide whether or not other internal parts of the tree module should be refactored/streamlined.
  • Ensure the final documentation renders nicely. :D

@smastelini
Copy link
Member Author

Those changes bring joy to my heart 😄

However, this PR is just a stub yet. Suggestions are welcomed.

@MaxHalford and @jacobmontiel, thanks for the suggestions in #476. Is that what you had in mind?

@MaxHalford
Copy link
Member

Looking good!

Copy link
Member Author

@smastelini smastelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments regarding my own changes. Some of them are reminders to myself, others are questions for further discussion. As I see it, the splitters will need a boost in documentation to facilitate their usage by the users.

river/tree/_nodes/efdtc_nodes.py Outdated Show resolved Hide resolved
river/tree/_nodes/efdtc_nodes.py Outdated Show resolved Hide resolved
river/tree/_nodes/htr_nodes.py Outdated Show resolved Hide resolved
river/tree/base_hoeffding_tree.py Outdated Show resolved Hide resolved
river/tree/splitter/__init__.py Show resolved Hide resolved
river/tree/splitter/base_splitter.py Outdated Show resolved Hide resolved
river/tree/splitter/qo_splitter.py Outdated Show resolved Hide resolved
@smastelini smastelini marked this pull request as ready for review March 1, 2021 18:03
@smastelini smastelini changed the title [WIP] Make tree splitters public Make tree splitters public Mar 1, 2021
@smastelini
Copy link
Member Author

Since #471 is not ready yet, I think we can merge this PR when we are happy with the changes and wait for the Tracks to update our documentation page.

Again, I intend to provide some benchmarks of the splitters using synthetic datasets.

What do you think, @jacobmontiel and @MaxHalford?

@smastelini
Copy link
Member Author

smastelini commented Mar 8, 2021

Hey @MaxHalford and @jacobmontiel, I added a page in the user guides that is a walkthrough in the tree module. I talked about the different tree models available, model inspection, memory management, and splitters. Hope that can help the users.

Your feedback is welcomed.

(Later today I'll add some basic coverage tests for the splitters) I also added some coverage tests and brought back the tests concerning Mean, Var, Cov. Those were accidentally lost in an old commit.

@smastelini
Copy link
Member Author

The failing tests are related to classifier chains. I believe they were recently updated. Pinging @MaxHalford.

@smastelini
Copy link
Member Author

All set. If you require more changes before merging, please let me know, @jacobmontiel and @MaxHalford.

Copy link
Member

@MaxHalford MaxHalford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, I mean it! It looks like you put in a lot of work. The code looks much nicer IMO. And the fact that the parameters are not duplicated everywhere in the docs feels better too.

@@ -7,11 +9,13 @@
from .label_combination_hoeffding_tree import LabelCombinationHoeffdingTreeClassifier

__all__ = [
"splitter",
"HoeffdingTree",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we want to expose this base class, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of tricky. It has some important documentation, that's the only reason it's public right now. I'm not sure this is the best approach, but it was a way to avoid repetition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% but I think we can play with the __doc__ attribute of classes. So you could have some string that contains the doc, and just add it the __doc__ of the class you want to add the documentation to. I've never done it but I've seen it done in (serious) projects :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the documentation is important. However, it does not justify exposing a class that is not intended to be used as it would confuse the users and they could end up trying to use it directly. On the other hand, it is unlikely that people will find or search for the documentation in a different class. We need to define places for generic documentation/guidance to point the user to.

Copy link
Member Author

@smastelini smastelini Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Keep in mind, however, that we do have some other abstract classes that are exposed in the documentation. For example, optim.Optimizer. I am not sure whether or not this was intended.

On the other hand, it is unlikely that people will find or search for the documentation in a different class.

That is a valid point. Currently, we point to tree.HoeffdingTree from the other trees. Something like: "for more information check tree.HoeffdingTree". We also state that HoeffdingTree is the base class, but I agree that this might not be enough.

Again, I would rather keep the plethora of somewhat obscure (yet so useful) parameters documented in the base class than exposing them everywhere and confusing the user with too many choices to pick from. A page in the documentation could do the work, but I find it strange to document parameters outside of a class docstring. If we opt for the first option and replicate the parameter information everywhere, we can follow Max's suggestion.

matplotlib also relies a lot on **kwargs. Maybe we could provide some pieces of information about what the user can do and then point to the base class. Hence, we balance the amount of the "data overhead" for the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxHalford, would it be possible to add this portion of the documentation at the "index" page of the tree module? We do not have that now, but it could be a viable solution.

I am also convinced that HoeffdingTree should not be public.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can and should documentation in the __init__.py file if that's what you mean.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a plan. Probably I will not use the same notation, (the Parameters section, for instance) but it will do the work!

river/tree/_nodes/base.py Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Mar 11, 2021

Codecov Report

Merging #488 (a0ff412) into master (9a70300) will increase coverage by 0.80%.
The diff coverage is 83.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #488      +/-   ##
==========================================
+ Coverage   83.57%   84.37%   +0.80%     
==========================================
  Files         285      290       +5     
  Lines       14467    14153     -314     
==========================================
- Hits        12091    11942     -149     
+ Misses       2376     2211     -165     
Impacted Files Coverage Δ
river/tree/setup.py 0.00% <0.00%> (ø)
...iver/tree/_attribute_test/numeric_multiway_test.py 33.33% <33.33%> (ø)
river/tree/_nodes/arf_htr_nodes.py 81.81% <50.00%> (ø)
river/tree/_nodes/hatr_nodes.py 53.40% <50.00%> (-6.16%) ⬇️
river/tree/hoeffding_adaptive_tree_classifier.py 81.66% <50.00%> (ø)
river/tree/splitter/nominal_splitter_reg.py 25.00% <50.00%> (ø)
river/tree/extremely_fast_decision_tree.py 54.42% <55.55%> (+0.31%) ⬆️
river/tree/isoup_tree_regressor.py 73.77% <57.14%> (+0.33%) ⬆️
river/tree/_nodes/hatc_nodes.py 60.63% <60.00%> (-1.13%) ⬇️
river/tree/base_hoeffding_tree.py 66.81% <66.66%> (ø)
... and 163 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a70300...a0ff412. Read the comment docs.

Copy link
Contributor

@jacobmontiel jacobmontiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @smastelini

I really like how the new naming and structure tidies-up the code. Nice catch on those unnecessary parts of the code. The QO splitter refactoring is also nice. Not sure if it is listed in the PR description.

I left a couple of comments and changes. In general, the PR looks good.

Let me know if you need further information.

@@ -7,11 +9,13 @@
from .label_combination_hoeffding_tree import LabelCombinationHoeffdingTreeClassifier

__all__ = [
"splitter",
"HoeffdingTree",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the documentation is important. However, it does not justify exposing a class that is not intended to be used as it would confuse the users and they could end up trying to use it directly. On the other hand, it is unlikely that people will find or search for the documentation in a different class. We need to define places for generic documentation/guidance to point the user to.

river/tree/_attribute_test/__init__.py Outdated Show resolved Hide resolved
river/tree/_nodes/arf_htc_nodes.py Show resolved Hide resolved
river/tree/_nodes/arf_htc_nodes.py Outdated Show resolved Hide resolved
river/tree/_nodes/base.py Show resolved Hide resolved
river/tree/splitter/ebst_splitter.py Outdated Show resolved Hide resolved
river/tree/splitter/base_splitter.py Outdated Show resolved Hide resolved
river/tree/splitter/exhaustive_splitter.py Outdated Show resolved Hide resolved
river/tree/splitter/histogram_splitter.py Show resolved Hide resolved
river/tree/splitter/nominal_class_splitter.py Outdated Show resolved Hide resolved
@smastelini
Copy link
Member Author

smastelini commented Mar 16, 2021

I updated the documentation in the last split to make tree.HoeffdingTree hidden again. @jacobmontiel, please let me know if the documentation placement is satisfactory now :D

Some screenshots:

image

image

@smastelini
Copy link
Member Author

Disclaimer: the failing test is unrelated to this PR.

Copy link
Contributor

@jacobmontiel jacobmontiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the last changes @smastelini
Issues found will be addressed in another PR.

@smastelini smastelini merged commit b829a57 into master Mar 16, 2021
@smastelini smastelini deleted the splitters branch March 16, 2021 02:46
@smastelini
Copy link
Member Author

Thanks for all your help and support, @jacobmontiel and @MaxHalford!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants