Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REF] Decision Tree Modularization #756

Merged
merged 207 commits into from
May 11, 2023
Merged

[REF] Decision Tree Modularization #756

merged 207 commits into from
May 11, 2023

Conversation

jbteves
Copy link
Collaborator

@jbteves jbteves commented Jul 15, 2021

Closes #403, closes #808, closes #809, closes #889, closes #892, closes #936, closes #931, closes #927
Supercedes #592
 
Changes proposed in this pull request:
See #592
This replaces the inflexible decision tree in tedica.py with a modular structure that will allow for multiple default and user-defined decision trees along with a more interpretable and flexible system for tracking and understanding the results.
 
Noteworthy implemented features / changes:

  • Main workflow in tedana.py is simplified and a separate workflow ica_reclassify.py can be used to manually change component classifications. (Removed 1/5 of the lines of code in tedana.py that are just used to handle the manual classification condition)
  • Minimal and kundu decision trees are fully functional
  • Moved towards using the terminology of “Component Selection” rather than “Decision Tree” to refer to the code that’s part of the selection process. “Decision Tree” is still used to more specifically to refer to the steps to classify components.
  • ComponentSelector object created to include common elements from the selection process including the component_table and information about what happens along every step of the decision tree. Additional information that will now be tracked and stored includes:
    • component_table has columns for classification_tags instead of rationale so that each decision tree can include multiple descriptive tasks per component. Also an executed tree should only include accepted or rejected components. ignored components are now accepted with classification_tags like Low variance or Accept borderline
    • cross_component_metrics will save values calculated across components, such as the kappa & rho elbows
    • component_status_table contains how component classifications changed along every node of the decision tree
    • tree contains the executed tree along with everything calculated or changed during execution
    • used_metrics a list of all metrics used when running the tree. (Can potentially be used to calculate a subset of metrics given an inputted tree)
  • The new class is defined in ./selection/component_selector.py, the functions that define each node of a decision tree are in ./section/selection_nodes.py and some key common functions used by selection_nodes are in ./selection/selection_utils.py
    • By convention, functions in selection_nodes.py that can change component classifications, begin with dec_ for decision and functions that calculate cross_component_metrics begin with calc_
    • A key function in selection_nodes.py is dec_left_op_right which can be used to change classifications based on the intersection of 1-3 boolean statements. This means most of the decision tree is modular functions that calculate cross_component_metrics and then tests of boolean conditional statements.
  • io.py is now used to output a registry (default is desc-tedana_registry.json) of all generated file names that can then be read in to load data rather than requiring follow-up programs, like RICA, to have multiple inputted file names.
  • New documentation changes include:
    • building decision trees.rst which is designed to explain the whole process in depth
    • outputs.rst was bloated and now links to two other files
      • classification_output_descriptions.rst is an explanation of the new outputs that's targetted towards a user rather than a potential developer
      • output_file_descriptions.rst is an expanded and updated explanation of all the file names that also explains how the filename registry.json is used
  • Some terminology changes, such as using component_table instead of comptable in code
  • integration tests now store testing data in .testing_data_cache and only download data if the data on OSF was updated more recently than the local data.
  • Project package management now uses pyproject.toml
  • 100% testing coverage on all new component selection functions and and a net increase in total coverage.
  • Minimum python version is now 3.8 and minimum pandas version is now 2.0
     
    Remaining work to do before merging
  • Improve documentation
    • Link building_decision_trees.rst to the rest of the documentation
    • building_decision_trees.rst is both a guide for developers and an explanation of outputs that any user might want. Make sure the user-guide parts are more accessible within the existing documentation
    • rationale and descriptions of rationale are in several places in the documentation. Remove all and replace with an explanation of classification_tags
    • Make sure documentation and function docstrings render correctly.
    • Proofread documentation and docstrings
  • Triple check the kundu tree in ./resources/decusion_trees/kundu.json matches the decision tree code in Main
  • Testing coverage is already very good. While this is fresh in memory, try to improve testing coverage more.
  • tedana reports
    • Add visualization of kappa and rho elbows to tedana reports
    • Add classification_tags to hover text
  • Take a look at the rho elbow for the minimal tree to see if it’s too aggressive (easier to do after updates to reports)
  • Change provisionalreject to unclassified in the kundu decision tree since those are accepted if not rejected by other criteria
  • Get the CLI working for ica_reclassify
  • @jbteves is working on mini-tools to directly compare the results in the current main with the results of this PR. This will include:
    • Script(s) and instructions to make it easier to run both main & this PR on the same data
    • Count and identify which components changed classifications between versions of the code and the % variance explained by those changes
    • ignored classification in main should be accepted classification in this PR with one of the following classification tags: “Low variance” “Accept borderline” “No provisional accept”
  • Add the names of the inputted files for each echo into desc-tedana_registry.json so that those file names can be automatically accessed
  • Make sure all written tsv files use "n/a" rather than an empty "" in rationale fields and other places because some programs don't like the empty fields.
  • Get RICA working with the changes from this PR
  • Make sure works with AFNI
  • Make sure works with fMRIPrep

Remaining work where we can really use more help

  • Once the mini-tools are created, have multiple people run on multiple datasets to make sure it runs, gives plausible results, and users with various levels of expertise understand what’s happening.
    • If you want to help with this, leave a comment or contact @handwerkerd
    • Does Main match the output of the kundu decision tree?
    • Is the minimal decision tree reliably more conservative? The minimal tree should accept some component that were rejected by the kundu tree. It’s possible the kundu tree will accept a few low variance components that are rejected by the minimal tree.
  • After documentation is cleaned up, we will need both developers and non-developer users to read them for clarity.

There are several improvements that aren’t necessary before merging this PR which are opened as stand alone issues:

Discussed but not going to open an issue unless others have specific use-cases planned for this:

  • ica_reclassify.py currently just works for manually changing accepted and rejected classifications. Either that function or another can be used to run a follow-up decision tree. Figure out if there are potential use-cases for this functionality and then update code.

@jbteves jbteves marked this pull request as draft July 15, 2021 22:58
Copy link
Member

@tsalo tsalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of initial comments. Also, it looks like tedana/selection/DecisionTree.py and tedana/selection/decision_tree_class.py are duplicates.

"info": "Following the full decision tree designed by Prantik Kundu",
"report": "This is based on the minimal criteria of the original MEICA decision tree without the more agressive noise removal steps",
"refs": "Kundu 2013",
"necessary_metrics": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just metrics would be cleaner here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts @handwerkerd ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two terms used in the code necessary_metrics and used_metrics. necessary_metrics are declared up-front. With this input, we can take a tree, make a list of all necessary metrics and calculate only those metrics. used_metricsis added to has metrics are used by the decision tree. At the end, there is a check to make sure no used metrics were undeclared innecessary_metrics`.

The other reason I used this terminology is, if we eventually do set up the code to calculate metrics based on a decision tree, then we can have necessary_metrics that are used in the tree and something like additional_metrics which should be calculated, but not used.

I'm not wed to this exact terminology, and am open to ideas for better descriptive terms, but I think metrics alone is insufficient.

@handwerkerd
Copy link
Member

@ME-ICA/tedana-devs Josh and I have been working on this and there's been lots of progress. I updated the initial comment to keep track of what's done and what still needs to get done. The big accomplishment is that the minimal decision tree is fully functional with all the structural and functionality changes that were recently discussed.
Function docstrings are similarly updated (though I'm sure the won't all render prefectly) and I wrote document explaining the whole process at: https://github.com/jbteves/tedana/blob/JT_DTM/docs/building_decision_trees.rst

As you can see, there's still a good bit to do, including making the functions underlying the kundu decision tree functional again. This obviously isn't ready for a full review, but feedback is welcome.

Copy link
Collaborator

@eurunuela eurunuela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just gone through the docs page as requested in our last meeting and have made some comments mostly related to typos.

The text is clear to me but I might biased after you explained the ideas behind the modularization in our last meeting.

I'll have a look again once the code is ready for reviews.

@handwerkerd
Copy link
Member

Here's the poll for scheduling a decision tree walk-through meeting sometime in the next two weeks: https://doodle.com/meeting/participate/id/9b6wn5Ld I've already limited to times I could be available.
@tsalo @eurunuela @notZaki & @jbteves all expressed interest in attending.

@eurunuela
Copy link
Collaborator

I was reading the docs again and I think it would be super helpful if every time we talk mention, e.g., necessary_metrics or functionname, we linked those keywords to where they are in the minimal decision tree. This way we direct users to an example, which will probably help them understand the docs better.

@jbteves
Copy link
Collaborator Author

jbteves commented Apr 12, 2022

@handwerkerd had to overwrite your changes for simplicity for mmix/black formatting, sorry.

@jbteves
Copy link
Collaborator Author

jbteves commented Apr 13, 2022

@eurunuela as an FYI I have created a new class, InputHarvester, that reads an OutputGenerator's "registry" of files from a previous run. You can use this to get all of the information you might want from a tedana run. See "tedana_reclassify.py" for an example of how this is done.

@jbteves
Copy link
Collaborator Author

jbteves commented Apr 13, 2022

More general FYI: this is a huge breaking change but basically the harder I tried to put --manacc into this framework, the worse it looked, so I gave up and created a new workflow, tedana_reclassify that has --manacc and --manrej options, it appears to work locally. I can't tell about CI because of the jinja issue.

@codecov
Copy link

codecov bot commented Aug 9, 2022

Codecov Report

Patch coverage: 96.18% and project coverage change: -4.34 ⚠️

Comparison is base (fb6e255) 93.30% compared to head (8dcdafa) 88.97%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #756      +/-   ##
==========================================
- Coverage   93.30%   88.97%   -4.34%     
==========================================
  Files          28       27       -1     
  Lines        2346     3373    +1027     
  Branches        0      617     +617     
==========================================
+ Hits         2189     3001     +812     
- Misses        157      226      +69     
- Partials        0      146     +146     
Impacted Files Coverage Δ
tedana/decomposition/pca.py 76.61% <ø> (-12.91%) ⬇️
tedana/utils.py 94.59% <ø> (-2.71%) ⬇️
tedana/reporting/html_report.py 91.39% <60.00%> (-8.61%) ⬇️
tedana/reporting/static_figures.py 96.34% <66.66%> (-2.45%) ⬇️
tedana/docs.py 77.35% <77.35%> (ø)
tedana/reporting/dynamic_figures.py 96.05% <81.25%> (-3.95%) ⬇️
tedana/workflows/tedana.py 80.95% <85.71%> (-8.68%) ⬇️
tedana/io.py 87.37% <87.35%> (-6.64%) ⬇️
tedana/workflows/ica_reclassify.py 97.79% <97.79%> (ø)
tedana/selection/component_selector.py 99.01% <99.01%> (ø)
... and 7 more

... and 14 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

handwerkerd and others added 2 commits August 10, 2022 14:37
Fixed rho threshold error and added elbows to reports
handwerkerd and others added 14 commits February 28, 2023 14:34
* Cleans up how testing datasets are downloaded within test_integration.py. In Main & the current JT_DTM each dataset is downloaded in a slightly different way and the five-echo data are downloaded twice.
* Added `data_for_testing_info` which gives the file hash location and local directory name for each of the four files we download. All tests are updated to use this function.
* The local copy of testing data will now go into the `.testing_data_cache` subdirectory
* The downloaded testing data will be in separate directories from the outputs so the downloaded directories can be completely static
* When `download_test_data` is called, it will first download the metadata json to see if the last updated copy on osf.io is newer than the downloaded version and will only download if osf has a newer file. Downloading the metadata will happen frequently, but it will hopefully be fast.
* The logger is now used to give a warning if osf.io cannot be accessed, but it will still run using cached data
* Added dec_reclassify_high_var_comps plus

* clarified diff btwn rho_kundu and _liberal thresh

* Clarified docs for minimal tree
* Update gitignore.

* Delete _version.py

* Adopt new packaging.

* Ignore the _version.py file.
* Base the cache on pyproject.toml, not setup.cfg.

* Also drop use of setup.py in publishing action.
* ica_reclassify docs now rendering in usage.html

* moves file parsing to ica_reclassify_workflow

* added error checks and tests
* add pandas version check >= 1.5.2 and mod behavior (#938)

* add version check and mod behavior if pandas >= 1.5.2 to prevent error in writing csv

* formatting

* adding P. Molfese

---------

Co-authored-by: Molfese <[email protected]>

* readded InputHarvester and expanduser

* fixed handler base_dir path

* mixing matrix file always in registry

---------

Co-authored-by: Peter J. Molfese <[email protected]>
Co-authored-by: Molfese <[email protected]>
* Drop Python 3.6 and 3.7 support.

* line_terminator --> lineterminator
* Some contributor updates

* Added doc to Marco
* Added flow charts and some text

* Finished flow charts and text.

Co-authored-by: marco7877 <[email protected]>

---------

Co-authored-by: marco7877 <[email protected]>
handwerkerd
handwerkerd previously approved these changes May 5, 2023
* Update docs.

* Update docs/building_decision_trees.rst

Co-authored-by: Dan Handwerker <[email protected]>

---------

Co-authored-by: Dan Handwerker <[email protected]>
* Output docs on one page

* added new multi-echo lectures
@handwerkerd
Copy link
Member

@ME-ICA/tedana-devs Today at 2:00PM EST (Your time zone), we are are planning to relase the last version of tedana before the major refactor (v0.0.13) and then merge this PR and release the more modularized version (v23.0.0).

Since this is way-too-many years in the making, we'll do this over zoom. Come join the "fun" at https://nih.zoomgov.com/j/1612837388?pwd%3DK1drdXVkK0xER1hEbkNzbUljQ0ZoUT09&sa=D&source=calendar&usd=2&usg=AOvVaw1cvaBfqiQE-iNPsBQwHmk5
Meeting ID: 161 283 7388
Passcode: 153769

Copy link
Member

@tsalo tsalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants