Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modularize the component selection decision tree #403

Closed
handwerkerd opened this issue Sep 27, 2019 · 14 comments · Fixed by #756
Closed

Modularize the component selection decision tree #403

handwerkerd opened this issue Sep 27, 2019 · 14 comments · Fixed by #756
Labels
discussion issues that still need to be discussed TE-dependence issues related to TE dependence metrics and component selection

Comments

@handwerkerd
Copy link
Member

We are planning to modularize the code for the metrics calculated on each component and the decision tree for accepting/rejecting/ignoring components at the 2019 Hackathon #373 This issue is a place to focus any discussion on how we'll do this.

Next Steps

  • Make a general framework for how the modularization will work
  • Reorganize 58% of the code in tedana
@handwerkerd handwerkerd added the discussion issues that still need to be discussed label Sep 27, 2019
@handwerkerd
Copy link
Member Author

I was talking about this with @afni-dglen and he made a suggestion for one issue that was bothering me. I wanted to document that suggestion, while it's on my mind. Most of the metrics can easily be modularly calculated for a list of components. Some thresholds and steps in the decision tree use what I call ranking metrics, which means taking another metric, ranking all the components by some value, and then making a threshold decision based on their ranks. This gets tricky to modularize because sometimes these ranks are calculated for all metrics and sometimes on a subset of metrics depending on what was calculated earlier in the decision tree - thus they can't be completely modularized away from what will happen in the decision tree.

The suggestion is that all threshold and ranking functions have an input to specify which components to operate on based on classification. That is, calculate threshold on "all components," "only unclassified components", etc. To do this, we might need to add some intermediate classification labels (i.e. "high variance components"), but that might not be a bad thing.

@handwerkerd
Copy link
Member Author

Another comment from @afni-dglen was that, in the current decision tree, we are only recording the final decision in the tree before accept/reject/ignore. This becomes an issue if people can make their own decision tree using the more modularized functions. For this case, we might want to track each node in a decision tree that was touched by a component instead of only the final node. Attaching an ID to each node wouldn't be hard, but swapping out the current final label with a path of steps might be a bit more work.

@handwerkerd handwerkerd changed the title Modularize the component selection decision Tree Modularize the component selection decision tree Sep 27, 2019
@afni-dglen
Copy link

First, both suggestions formally came from @mrneont (Paul Taylor). The suggestion actually was rank all components at beginning before any decisions. Then at the time of any decisions in the tree, operate only on the rankings that have some other required properties. The rankings do not need to be recomputed because they are still relative to each other within the list. So if looking at the list of (provisionally) accepted components below as an example, and one wants the top two components, then comp4 and comp5 are accepted.

comp4 rank=1 accept
comp5 rank=3 accept
comp6 rank=2 reject
comp7 rank=4 accept

@jbteves
Copy link
Collaborator

jbteves commented Sep 27, 2019

@afni-dglen and @handwerkerd: So to clarify, the discussion centered around tracking the classification globally (and adding some classifications) so that we have some states or codes associated with certain properties, as well as the accept/reject/ignore decision? So for a simpler example, we might have:

Item Classification Decision
1 Ice Cream Eat
2 Mud Don't Eat
3 Rocks Don't Eat

Should we also then add additional classifications, so that an item can be multiple types? This seems like a plausible long-term problem. So then we might want the ability to have something like

Item Classification(s) Decision
1 Ice Cream, Vanilla Don't Eat
2 Ice Cream, Chocolate Eat
3 Mud Don't Eat
4 Rocks Don't Eat
5 Mud, Rocks Don't Eat

(I hope if I use this example people are more likely to read, digest, and understand than if I say components and use real classifications).

I like this idea if I'm understanding it, because then it could be easier to change which classifications should lead to acceptance/rejection and permute different choices to see their outcomes easily.

@handwerkerd
Copy link
Member Author

@jbteves That's about right. One minus of this approach is that every function that has the ability to alter the decision for an item would need a unique identifier. This might over-extend the current number based system (The codes in: https://tedana.readthedocs.io/en/latest/outputs.html#component-tables ), but I don't see an option that wouldn't potentially over-extend that system.

@jbteves
Copy link
Collaborator

jbteves commented Sep 27, 2019

I don't think that would over-extend, it just makes things messy. We could use enums or named tuples or something to make the code legible. It just means that in addition to adding a test, any classifier function also needs to add to the allowable values in the enum/tuple that this new function might create. That way an invalid value will also trigger an interpreter error, which would help any debugging around those methods. Thoughts @emdupre @tsalo ?

@afni-dglen
Copy link

The unique codes in the component tables may be useful as a simple counting system, but I think they won't generalize well with even small additions or changes to the processing. Named processes, as @jbteves suggests, seems like a better idea. This isn't required for processing, but keeping a history or provenance that way may be useful for later processing or record keeping.

@tsalo
Copy link
Member

tsalo commented Oct 1, 2019

Multiple metrics are calculated on subsets of components at certain stages in the decision tree. Is the proposed method able to work with those?

@jbteves
Copy link
Collaborator

jbteves commented Oct 1, 2019

Yes, I think the idea is that you could basically select components in certain "states" and only use those. So for example, you could take all components classified as "noise," and analyze only those. It's just that we're proposing to do the following:

  1. Make each function that performs classification add a state to components that meet its classification spec
  2. Make a component table which can hold multiple states for each component
  3. Name the states rather than using "codes." Though you can use an enum to have it both ways, really.

@tsalo tsalo added denoising TE-dependence issues related to TE dependence metrics and component selection and removed denoising labels Oct 4, 2019
@tsalo
Copy link
Member

tsalo commented Oct 12, 2019

Okay, does this pseudocode fit with the proposed approach, or are there elements that I'm missing?

def selection_function(comptable, data, mixing_matrix):
    """Identifies "hazelnut" items, limited to vanilla ice creams.
    Hazelnut items shall be rejected, as they exhibit TE-independence.
    """
    __required_states = ['ice cream', 'vanilla']
    __selection_states = ['hazelnut']

    node_index = comptable.loc[
      any(ss in comptable['states'].split(', ') for ss in __required_states)].index

    node_comptable = comptable.loc[node_index]
    node_mixing_matrix = mixing_matrix[node_index, :]
    updated_comptable = comptable.copy()

    # internal metric calculation step
    # In this case we're getting the total value across each component's time series,
    # which is nonsense since they're all normalized and the sum would be 0.
    node_metric = np.sum(node_mixing_matrix, axis=1)

    # selection/classification step based on component table
    sel_index = node_comptable.loc[node_metric > 5.].index

    # update states and decisions
    updated_comptable.loc[sel_index, 'states'] += ', hazelnut'
    updated_comptable.loc[sel_index, 'classification'] = 'rejected'
    return updated_comptable

EDIT: And I guess the selection trees would be something like:

def selection_tree_01(comptable, mmix, catd):
    # these functions happen to calculate metrics
    comptable = selection_function_01(comptable, mmix, catd)
    comptable = selection_function_02(comptable, mmix, catd)
    # this function only uses states to make decisions
    comptable = selection_function_03(comptable)

@handwerkerd
Copy link
Member Author

I’m trying to get things ready for the modularization effort at the hackathon. I’ve sketched out issues and ideas for the full decision tree pipeline. This narrative document is at: https://docs.google.com/document/d/1zhbDGxNVyzSBIsncBDIO7l15VqXREIlgHb5il7RGrPg/edit?usp=sharing
I also took my two decision tree flow charts and added the locations in the code where each step happens (see attached). The above document isn't pretty, and I’m going to try to more explicitly sketch out pseudocode before the hackathon, but this is the point where I figured it would be useful to share what I’ve been working on with @tsalo and others.
Tedana_ICAandComparisonMetrics_Flowchart_v2_WithCurrentCodeLocations.pdf
Tedana_ICADecisionTree_Flowchart_WithCurrentCodeLocations.pdf

@stale
Copy link

stale bot commented Jan 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions to tedana:tada: !

@stale stale bot added the stale label Jan 27, 2020
@handwerkerd handwerkerd removed the stale label Jan 27, 2020
@jbteves
Copy link
Collaborator

jbteves commented Jan 27, 2020

We're working on it, stale-bot, we promise.

@stale
Copy link

stale bot commented Apr 26, 2020

This issue has been automatically marked as stale because it has not had any activity in 90 days. It will be closed in 600 days if no further activity occurs. Thank you for your contributions to tedana:tada: !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion issues that still need to be discussed TE-dependence issues related to TE dependence metrics and component selection
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants