-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REF] Decision Tree Modularization #756
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of initial comments. Also, it looks like tedana/selection/DecisionTree.py
and tedana/selection/decision_tree_class.py
are duplicates.
"info": "Following the full decision tree designed by Prantik Kundu", | ||
"report": "This is based on the minimal criteria of the original MEICA decision tree without the more agressive noise removal steps", | ||
"refs": "Kundu 2013", | ||
"necessary_metrics": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just metrics
would be cleaner here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts @handwerkerd ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two terms used in the code necessary_metrics
and used_metrics
. necessary_metrics are declared up-front. With this input, we can take a tree, make a list of all necessary metrics and calculate only those metrics.
used_metricsis added to has metrics are used by the decision tree. At the end, there is a check to make sure no used metrics were undeclared in
necessary_metrics`.
The other reason I used this terminology is, if we eventually do set up the code to calculate metrics based on a decision tree, then we can have necessary_metrics
that are used in the tree and something like additional_metrics
which should be calculated, but not used.
I'm not wed to this exact terminology, and am open to ideas for better descriptive terms, but I think metrics
alone is insufficient.
@ME-ICA/tedana-devs Josh and I have been working on this and there's been lots of progress. I updated the initial comment to keep track of what's done and what still needs to get done. The big accomplishment is that the minimal decision tree is fully functional with all the structural and functionality changes that were recently discussed. As you can see, there's still a good bit to do, including making the functions underlying the kundu decision tree functional again. This obviously isn't ready for a full review, but feedback is welcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just gone through the docs page as requested in our last meeting and have made some comments mostly related to typos.
The text is clear to me but I might biased after you explained the ideas behind the modularization in our last meeting.
I'll have a look again once the code is ready for reviews.
Here's the poll for scheduling a decision tree walk-through meeting sometime in the next two weeks: https://doodle.com/meeting/participate/id/9b6wn5Ld I've already limited to times I could be available. |
I was reading the docs again and I think it would be super helpful if every time we talk mention, e.g., |
@handwerkerd had to overwrite your changes for simplicity for mmix/black formatting, sorry. |
@eurunuela as an FYI I have created a new class, InputHarvester, that reads an OutputGenerator's "registry" of files from a previous run. You can use this to get all of the information you might want from a tedana run. See "tedana_reclassify.py" for an example of how this is done. |
More general FYI: this is a huge breaking change but basically the harder I tried to put |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #756 +/- ##
==========================================
- Coverage 93.30% 88.97% -4.34%
==========================================
Files 28 27 -1
Lines 2346 3373 +1027
Branches 0 617 +617
==========================================
+ Hits 2189 3001 +812
- Misses 157 226 +69
- Partials 0 146 +146
☔ View full report in Codecov by Sentry. |
Fixed rho threshold error and added elbows to reports
* Cleans up how testing datasets are downloaded within test_integration.py. In Main & the current JT_DTM each dataset is downloaded in a slightly different way and the five-echo data are downloaded twice. * Added `data_for_testing_info` which gives the file hash location and local directory name for each of the four files we download. All tests are updated to use this function. * The local copy of testing data will now go into the `.testing_data_cache` subdirectory * The downloaded testing data will be in separate directories from the outputs so the downloaded directories can be completely static * When `download_test_data` is called, it will first download the metadata json to see if the last updated copy on osf.io is newer than the downloaded version and will only download if osf has a newer file. Downloading the metadata will happen frequently, but it will hopefully be fast. * The logger is now used to give a warning if osf.io cannot be accessed, but it will still run using cached data
* Added dec_reclassify_high_var_comps plus * clarified diff btwn rho_kundu and _liberal thresh * Clarified docs for minimal tree
* Update gitignore. * Delete _version.py * Adopt new packaging. * Ignore the _version.py file.
* ica_reclassify docs now rendering in usage.html * moves file parsing to ica_reclassify_workflow * added error checks and tests
* add pandas version check >= 1.5.2 and mod behavior (#938) * add version check and mod behavior if pandas >= 1.5.2 to prevent error in writing csv * formatting * adding P. Molfese --------- Co-authored-by: Molfese <[email protected]> * readded InputHarvester and expanduser * fixed handler base_dir path * mixing matrix file always in registry --------- Co-authored-by: Peter J. Molfese <[email protected]> Co-authored-by: Molfese <[email protected]>
* Drop Python 3.6 and 3.7 support. * line_terminator --> lineterminator
* Some contributor updates * Added doc to Marco
* Added flow charts and some text * Finished flow charts and text. Co-authored-by: marco7877 <[email protected]> --------- Co-authored-by: marco7877 <[email protected]>
* Update docs. * Update docs/building_decision_trees.rst Co-authored-by: Dan Handwerker <[email protected]> --------- Co-authored-by: Dan Handwerker <[email protected]>
* Output docs on one page * added new multi-echo lectures
@ME-ICA/tedana-devs Today at 2:00PM EST (Your time zone), we are are planning to relase the last version of tedana before the major refactor (v0.0.13) and then merge this PR and release the more modularized version (v23.0.0). Since this is way-too-many years in the making, we'll do this over zoom. Come join the "fun" at https://nih.zoomgov.com/j/1612837388?pwd%3DK1drdXVkK0xER1hEbkNzbUljQ0ZoUT09&sa=D&source=calendar&usd=2&usg=AOvVaw1cvaBfqiQE-iNPsBQwHmk5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Closes #403, closes #808, closes #809, closes #889, closes #892, closes #936, closes #931, closes #927
Supercedes #592
Changes proposed in this pull request:
See #592
This replaces the inflexible decision tree in tedica.py with a modular structure that will allow for multiple default and user-defined decision trees along with a more interpretable and flexible system for tracking and understanding the results.
Noteworthy implemented features / changes:
tedana.py
is simplified and a separate workflowica_reclassify.py
can be used to manually change component classifications. (Removed 1/5 of the lines of code intedana.py
that are just used to handle the manual classification condition)ComponentSelector
object created to include common elements from the selection process including the component_table and information about what happens along every step of the decision tree. Additional information that will now be tracked and stored includes:component_table
has columns forclassification_tags
instead ofrationale
so that each decision tree can include multiple descriptive tasks per component. Also an executed tree should only includeaccepted
orrejected
components.ignored
components are nowaccepted
withclassification_tags
likeLow variance
orAccept borderline
cross_component_metrics
will save values calculated across components, such as the kappa & rho elbowscomponent_status_table
contains how component classifications changed along every node of the decision treetree
contains the executed tree along with everything calculated or changed during executionused_metrics
a list of all metrics used when running the tree. (Can potentially be used to calculate a subset of metrics given an inputted tree)./selection/component_selector.py
, the functions that define each node of a decision tree are in./section/selection_nodes.py
and some key common functions used by selection_nodes are in./selection/selection_utils.py
dec_
for decision and functions that calculate cross_component_metrics begin withcalc_
dec_left_op_right
which can be used to change classifications based on the intersection of 1-3 boolean statements. This means most of the decision tree is modular functions that calculate cross_component_metrics and then tests of boolean conditional statements.io.py
is now used to output a registry (default isdesc-tedana_registry.json
) of all generated file names that can then be read in to load data rather than requiring follow-up programs, like RICA, to have multiple inputted file names.building decision trees.rst
which is designed to explain the whole process in depthoutputs.rst
was bloated and now links to two other filesclassification_output_descriptions.rst
is an explanation of the new outputs that's targetted towards a user rather than a potential developeroutput_file_descriptions.rst
is an expanded and updated explanation of all the file names that also explains how the filename registry.json is usedcomponent_table
instead ofcomptable
in code.testing_data_cache
and only download data if the data on OSF was updated more recently than the local data.pyproject.toml
Remaining work to do before merging
building_decision_trees.rst
to the rest of the documentationbuilding_decision_trees.rst
is both a guide for developers and an explanation of outputs that any user might want. Make sure the user-guide parts are more accessible within the existing documentationrationale
and descriptions of rationale are in several places in the documentation. Remove all and replace with an explanation ofclassification_tags
./resources/decusion_trees/kundu.json
matches the decision tree code in Mainclassification_tags
to hover textprovisionalreject
tounclassified
in the kundu decision tree since those are accepted if not rejected by other criteriaignored
classification in main should beaccepted
classification in this PR with one of the following classification tags: “Low variance” “Accept borderline” “No provisional accept”desc-tedana_registry.json
so that those file names can be automatically accessedRemaining work where we can really use more help
There are several improvements that aren’t necessary before merging this PR which are opened as stand alone issues:
component_table
vscomptable
and reducing places where tags with capitalization differences can cause problemsDiscussed but not going to open an issue unless others have specific use-cases planned for this:
ica_reclassify.py
currently just works for manually changing accepted and rejected classifications. Either that function or another can be used to run a follow-up decision tree. Figure out if there are potential use-cases for this functionality and then update code.