Make tree splitters public #488

smastelini · 2021-02-25T20:09:47Z

As discussed in #476, currently the attribute observers, a.k.a. splitters, are picked by passing strings to the trees. Their parameters are set with dictionaries. This choice is not optimal and makes it difficult to document and add new splitters in the long run.

This PR makes the splitter selection similar to what is done with linear models' optimizers.

In summary, this PR brings to the table:

Rename: Attribute observer (AO) -> splitter.
Make splitters public and standardize their naming scheme.
Add a property to differentiate classification and regression splitters (suggestions for a better property name are welcomed).
Splitters can be passed as a parameter to the trees. They are "deepcopied" to new leaves as they are created.
Remove redundant portions of the tree code.
Change indirectly impacted methods (ARF variants).
Apply misc/minor improvements to the tree module, e.g., fix mypy issues and standardize some variable naming patterns.
Improve documentation in tree.splitter and possible mentions to splitters in the trees' docstrings.

What still needs to be done:

Create a page in the documentation explaining and comparing different splits in standalone trees and ensembles thereof. progressive_val_score is a good option, of course, but we could also use Trackers (Benchmarks refactoring #471).
Decide whether or not other internal parts of the tree module should be refactored/streamlined.
Ensure the final documentation renders nicely. :D

…itters

smastelini · 2021-02-25T20:12:45Z

Those changes bring joy to my heart 😄

However, this PR is just a stub yet. Suggestions are welcomed.

@MaxHalford and @jacobmontiel, thanks for the suggestions in #476. Is that what you had in mind?

MaxHalford · 2021-02-25T20:37:37Z

Looking good!

smastelini

I left some comments regarding my own changes. Some of them are reminders to myself, others are questions for further discussion. As I see it, the splitters will need a boost in documentation to facilitate their usage by the users.

river/tree/_nodes/efdtc_nodes.py

river/tree/_nodes/htr_nodes.py

river/tree/base_hoeffding_tree.py

river/tree/splitter/__init__.py

river/tree/splitter/base_splitter.py

river/tree/splitter/qo_splitter.py

smastelini · 2021-03-01T18:12:16Z

Since #471 is not ready yet, I think we can merge this PR when we are happy with the changes and wait for the Tracks to update our documentation page.

Again, I intend to provide some benchmarks of the splitters using synthetic datasets.

What do you think, @jacobmontiel and @MaxHalford?

…tters

smastelini · 2021-03-08T18:03:51Z

Hey @MaxHalford and @jacobmontiel, I added a page in the user guides that is a walkthrough in the tree module. I talked about the different tree models available, model inspection, memory management, and splitters. Hope that can help the users.

Your feedback is welcomed.

~~(Later today I'll add some basic coverage tests for the splitters)~~ I also added some coverage tests and brought back the tests concerning Mean, Var, Cov. Those were accidentally lost in an old commit.

smastelini · 2021-03-08T18:05:20Z

The failing tests are related to classifier chains. I believe they were recently updated. Pinging @MaxHalford.

…tters

smastelini · 2021-03-10T20:40:17Z

All set. If you require more changes before merging, please let me know, @jacobmontiel and @MaxHalford.

MaxHalford

Good job, I mean it! It looks like you put in a lot of work. The code looks much nicer IMO. And the fact that the parameters are not duplicated everywhere in the docs feels better too.

MaxHalford · 2021-03-11T00:39:09Z

river/tree/__init__.py

@@ -7,11 +9,13 @@
 from .label_combination_hoeffding_tree import LabelCombinationHoeffdingTreeClassifier

 __all__ = [
+    "splitter",
+    "HoeffdingTree",


Not sure we want to expose this base class, right?

It's kind of tricky. It has some important documentation, that's the only reason it's public right now. I'm not sure this is the best approach, but it was a way to avoid repetition.

I'm not 100% but I think we can play with the __doc__ attribute of classes. So you could have some string that contains the doc, and just add it the __doc__ of the class you want to add the documentation to. I've never done it but I've seen it done in (serious) projects :)

I agree that the documentation is important. However, it does not justify exposing a class that is not intended to be used as it would confuse the users and they could end up trying to use it directly. On the other hand, it is unlikely that people will find or search for the documentation in a different class. We need to define places for generic documentation/guidance to point the user to.

That's a good point. Keep in mind, however, that we do have some other abstract classes that are exposed in the documentation. For example, optim.Optimizer. I am not sure whether or not this was intended.

On the other hand, it is unlikely that people will find or search for the documentation in a different class.

That is a valid point. Currently, we point to tree.HoeffdingTree from the other trees. Something like: "for more information check tree.HoeffdingTree". We also state that HoeffdingTree is the base class, but I agree that this might not be enough.

Again, I would rather keep the plethora of somewhat obscure (yet so useful) parameters documented in the base class than exposing them everywhere and confusing the user with too many choices to pick from. A page in the documentation could do the work, but I find it strange to document parameters outside of a class docstring. If we opt for the first option and replicate the parameter information everywhere, we can follow Max's suggestion.

matplotlib also relies a lot on **kwargs. Maybe we could provide some pieces of information about what the user can do and then point to the base class. Hence, we balance the amount of the "data overhead" for the user.

@MaxHalford, would it be possible to add this portion of the documentation at the "index" page of the tree module? We do not have that now, but it could be a viable solution.

I am also convinced that HoeffdingTree should not be public.

Yes we can and should documentation in the __init__.py file if that's what you mean.

Sounds like a plan. Probably I will not use the same notation, (the Parameters section, for instance) but it will do the work!

river/tree/_nodes/base.py

codecov-io · 2021-03-11T21:04:47Z

Codecov Report

Merging #488 (a0ff412) into master (9a70300) will increase coverage by 0.80%.
The diff coverage is 83.87%.

@@            Coverage Diff             @@
##           master     #488      +/-   ##
==========================================
+ Coverage   83.57%   84.37%   +0.80%     
==========================================
  Files         285      290       +5     
  Lines       14467    14153     -314     
==========================================
- Hits        12091    11942     -149     
+ Misses       2376     2211     -165

Impacted Files	Coverage Δ
river/tree/setup.py	`0.00% <0.00%> (ø)`
...iver/tree/_attribute_test/numeric_multiway_test.py	`33.33% <33.33%> (ø)`
river/tree/_nodes/arf_htr_nodes.py	`81.81% <50.00%> (ø)`
river/tree/_nodes/hatr_nodes.py	`53.40% <50.00%> (-6.16%)`	⬇️
river/tree/hoeffding_adaptive_tree_classifier.py	`81.66% <50.00%> (ø)`
river/tree/splitter/nominal_splitter_reg.py	`25.00% <50.00%> (ø)`
river/tree/extremely_fast_decision_tree.py	`54.42% <55.55%> (+0.31%)`	⬆️
river/tree/isoup_tree_regressor.py	`73.77% <57.14%> (+0.33%)`	⬆️
river/tree/_nodes/hatc_nodes.py	`60.63% <60.00%> (-1.13%)`	⬇️
river/tree/base_hoeffding_tree.py	`66.81% <66.66%> (ø)`
... and 163 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a70300...a0ff412. Read the comment docs.

jacobmontiel

Awesome work @smastelini

I really like how the new naming and structure tidies-up the code. Nice catch on those unnecessary parts of the code. The QO splitter refactoring is also nice. Not sure if it is listed in the PR description.

I left a couple of comments and changes. In general, the PR looks good.

Let me know if you need further information.

jacobmontiel · 2021-03-12T03:12:28Z

river/tree/__init__.py

@@ -7,11 +9,13 @@
 from .label_combination_hoeffding_tree import LabelCombinationHoeffdingTreeClassifier

 __all__ = [
+    "splitter",
+    "HoeffdingTree",


I agree that the documentation is important. However, it does not justify exposing a class that is not intended to be used as it would confuse the users and they could end up trying to use it directly. On the other hand, it is unlikely that people will find or search for the documentation in a different class. We need to define places for generic documentation/guidance to point the user to.

river/tree/_attribute_test/__init__.py

river/tree/_nodes/arf_htc_nodes.py

river/tree/_nodes/base.py

river/tree/splitter/ebst_splitter.py

river/tree/splitter/base_splitter.py

river/tree/splitter/exhaustive_splitter.py

river/tree/splitter/histogram_splitter.py

river/tree/splitter/nominal_class_splitter.py

…tters

smastelini · 2021-03-16T02:25:35Z

I updated the documentation in the last split to make tree.HoeffdingTree hidden again. @jacobmontiel, please let me know if the documentation placement is satisfactory now :D

Some screenshots:

smastelini · 2021-03-16T02:40:49Z

Disclaimer: the failing test is unrelated to this PR.

jacobmontiel

Thanks for the last changes @smastelini
Issues found will be addressed in another PR.

smastelini · 2021-03-16T02:47:39Z

Thanks for all your help and support, @jacobmontiel and @MaxHalford!

smastelini added 4 commits February 24, 2021 17:38

Attribute Observer -> Splitter

0c2e03e

basic refactoring of the nodes

b5d1f8e

classification trees now comply with the new way of handling tree spl…

e05e102

…itters

first draft ready!

f6a31c9

smastelini self-assigned this Feb 25, 2021

smastelini requested review from MaxHalford and jacobmontiel February 25, 2021 20:09

smastelini added the Enhancement label Feb 25, 2021

smastelini commented Feb 25, 2021

View reviewed changes

smastelini added 3 commits March 1, 2021 09:40

misc improvements

d6931d9

documentation improvements

b857cfa

doc improvements

f09cb08

smastelini marked this pull request as ready for review March 1, 2021 18:03

smastelini changed the title ~~[WIP] Make tree splitters public~~ Make tree splitters public Mar 1, 2021

smastelini added 5 commits March 1, 2021 15:30

minor fixes

926e118

Merge branch 'master' of https://github.com/online-ml/river into spli…

efcca2f

…tters

On Hoeffding Trees: a guideline

fe8c1ba

Fix documentation issues and update pages

6b5dd04

apply pre-commit actions

dfcbe8e

smastelini added 6 commits March 8, 2021 17:55

Add coverage tests

d12c913

pre-commit actions

7810820

rerun notebook

9356e9c

Merge branch 'master' of https://github.com/online-ml/river into spli…

4302f75

…tters

Merge branch 'master' of https://github.com/online-ml/river into spli…

4157ef8

…tters

fix typos

1a352b3

MaxHalford reviewed Mar 11, 2021

View reviewed changes

smastelini added 4 commits March 11, 2021 09:54

rename: AttributeSplitSuggestion -> SplitSuggestion

2e63165

multi-way numeric splits + QO

5e807da

multi-way numeric splits + QO

6a1c738

pre-commit actions

a93ec28

jacobmontiel requested changes Mar 12, 2021

View reviewed changes

smastelini added 10 commits March 12, 2021 09:06

rename split tests

9aeda54

rename nominal splitters

b71f853

Rename internal Node classes (EBSTSplitter and ExhaustiveSplitter)

3b4fed1

improve the docstrings of ARF-based tree nodes

6430a0e

use list comprehension in property

a0ff412

rename parameter in cond_proba

1cc741c

Merge branch 'master' of https://github.com/online-ml/river into spli…

5fb480a

…tters

update docstrings

ed7edec

Merge branch 'master' of https://github.com/online-ml/river into spli…

8548728

…tters

update documentation and make HoeffdingTree private

1cb6cff

jacobmontiel approved these changes Mar 16, 2021

View reviewed changes

smastelini merged commit b829a57 into master Mar 16, 2021

smastelini deleted the splitters branch March 16, 2021 02:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make tree splitters public #488

Make tree splitters public #488

smastelini commented Feb 25, 2021 •

edited

Loading

smastelini commented Feb 25, 2021

MaxHalford commented Feb 25, 2021

smastelini left a comment

smastelini commented Mar 1, 2021

smastelini commented Mar 8, 2021 •

edited

Loading

smastelini commented Mar 8, 2021

smastelini commented Mar 10, 2021

MaxHalford left a comment

MaxHalford Mar 11, 2021

smastelini Mar 11, 2021

MaxHalford Mar 11, 2021

jacobmontiel Mar 12, 2021

smastelini Mar 12, 2021 •

edited

Loading

smastelini Mar 15, 2021

MaxHalford Mar 16, 2021

smastelini Mar 16, 2021

codecov-io commented Mar 11, 2021 •

edited

Loading

jacobmontiel left a comment

jacobmontiel Mar 12, 2021

smastelini commented Mar 16, 2021 •

edited

Loading

smastelini commented Mar 16, 2021

jacobmontiel left a comment

smastelini commented Mar 16, 2021

Make tree splitters public #488

Make tree splitters public #488

Conversation

smastelini commented Feb 25, 2021 • edited Loading

smastelini commented Feb 25, 2021

MaxHalford commented Feb 25, 2021

smastelini left a comment

Choose a reason for hiding this comment

smastelini commented Mar 1, 2021

smastelini commented Mar 8, 2021 • edited Loading

smastelini commented Mar 8, 2021

smastelini commented Mar 10, 2021

MaxHalford left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smastelini Mar 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Mar 11, 2021 • edited Loading

Codecov Report

jacobmontiel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smastelini commented Mar 16, 2021 • edited Loading

smastelini commented Mar 16, 2021

jacobmontiel left a comment

Choose a reason for hiding this comment

smastelini commented Mar 16, 2021

smastelini commented Feb 25, 2021 •

edited

Loading

smastelini commented Mar 8, 2021 •

edited

Loading

smastelini Mar 12, 2021 •

edited

Loading

codecov-io commented Mar 11, 2021 •

edited

Loading

smastelini commented Mar 16, 2021 •

edited

Loading