Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tensorflow hub extractor #433

Merged
merged 51 commits into from
Jan 19, 2021
Merged

Conversation

rbroc
Copy link
Collaborator

@rbroc rbroc commented Dec 7, 2020

First draft of a (hierarchy of?) Tensorflow Hub extractor(s).
In current implementation, a generic TFHub extractor is implemented which is agnostic to stimulus type + two subclasses for embedding models and classification models - at the moment, these are pretty much the only two task types for which input and output seem fairly standardized (see https://www.tensorflow.org/hub/common_saved_model_apis/images and https://www.tensorflow.org/hub/common_saved_model_apis/text).

Closes #428.

@rbroc rbroc changed the title add tensorflow hub extractor WIP: add tensorflow hub extractor Dec 8, 2020
@rbroc rbroc requested review from adelavega and tyarkoni December 11, 2020 12:19
@rbroc
Copy link
Collaborator Author

rbroc commented Dec 11, 2020

@adelavega @tyarkoni the current implementation could be one way to go about this.

There's a generic extractor (TFHubExtractor) where you can in principle pass whatever TFHub model, it will pack the output into a number of feature columns with generic names or into a columns with custom names if you pass them (if only one name is passed, it packs everything into a column). It's a class that allows you to use any model, but at your own risk as it remains rather abstract in terms of input type, input preprocessing and output postprocessing.

Then, there's are two modality-specific extractors, one for ImageStim and one for TextStim, for which input format in TFHub is more standardized (see https://www.tensorflow.org/hub/common_saved_model_apis and following sections). The modality-specific extractors do the type of preprocessing needed to pass the data from each Stim type to the models, and the postprocessing needed to pass the output to the ExtractorResult object.

This should work for most image classification and embedding models and text embedding models. In terms of output, these extractors are also pretty flexible. If you do not pass any labels/feature names, the extractor will just split the output in a number of features/columns with generic names ('feature_0', 'feature_1', ..., 'feature_n') according to the dimensionality of the model output. If you pass feature names, it will use those. If you pass a single feature name (e.g. 'embedding'), it will pack everything into one feature/column with that name.

We could also decide to be more specific and always constrain the dimensionality of the input to n_classes for classification, and to one feature (an embedding vector) for embedding models. We could do so by creating subclasses of TFHubImageExtractor for classification and embedding tasks - where all we do is forcing the feature name attribute to be an n_classes-long list (of either generic labels or user-specified labels) in the classification case, and one-dimensional feature list (e.g. ['embedding']) in the embedding case. I think all methods could be inherited as-is from the parent classes.

Let me know what you think - I'll go ahead and write tests if you like the current approach.

@rbroc
Copy link
Collaborator Author

rbroc commented Dec 11, 2020

one more note: right now the generic extractor crashes (TypeError: Can't instantiate abstract class TFHubExtractor with abstract methods _input_type) because _input_type is not specified, even when one specifies _optional_input_type (which in my understanding would be the most suitable thing here, as it'd make it possible to accept a any Stim depending on which model you pass).

Is this expected behavior? In my understanding, specifying _optional_input_type should override the need to specify _input_type.

If, though, this is expected behavior and it's me misunderstanding something - shall I create modality-specific classes for audio and video where _input_type is set? Or are there other ways to allow the generic extractor to accept any input, if we want to go for this option?

Copy link
Collaborator

@tyarkoni tyarkoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor comments I leave to your discretion, otherwise, looks great!

url_or_path (str): url or path to TFHub model. You can
browse models at https://tfhub.dev/.
task (str): model task/domain identifier
features (optional): list of labels (for classification)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might need more explanation; some examples might be helpful, especially if the semantics depend on the model being loaded.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more explanation in new commits

pliers/extractors/models.py Show resolved Hide resolved

def __init__(self, url_or_path, features=None, task=None,
transform_out=None, **kwargs):
verify_dependencies(['tensorflow', 'tensorflow_hub',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of these dependencies needed here? At least from the code, it looks like it might only be tensorflow_hub (for KerasLayer).

Copy link
Collaborator Author

@rbroc rbroc Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only hub is required, indeed.
Removed tensorflow dependency, and moved attempt to import tensorflow_text to the text extractor.
It is only needed for some models and there's no way to know if it is needed until the model is called, so I've added a warning at initialization when import fails.

@rbroc rbroc changed the title WIP: add tensorflow hub extractor add tensorflow hub extractor Jan 5, 2021
@rbroc
Copy link
Collaborator Author

rbroc commented Jan 5, 2021

@tyarkoni @adelavega I've added test and implemented Tal's suggestions so is ready for new review. Tests pass locally but they are not triggered automatically on Travis, and I'm not sure if I can start them manually and how. As for the BERT extractor, I expect these tests may cause some memory issues on Travis (each extractor is tested on more than one model) - but let's see what happens.

By the way, there is probably also some updating to do to the BERT extractors, I can take care of that in a separate PR.

Copy link
Collaborator

@tyarkoni tyarkoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one minor suggestion.

self.transform_inp = transform_inp
super().__init__()

def get_feature_names(self, out):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently publicly exposed, so we might want to move the check for self.features inside here (instead of doing it in _extract) and use that if available. Otherwise a user might naively call get_feature_names expecting to get the stored feature names, and instead they'll get the naive enumeration of feature_*.

Alternatively, if it's not meant to be public, maybe rename to _get_feature_names.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the feedback and suggestion, Tal! Moved check for self.features inside get_feature_names.

@adelavega
Copy link
Member

adelavega commented Jan 15, 2021

@rbroc I think when running using forked mode, the pytest print out becomes very ugly and it looks like it hangs, but it doesn't.
Happened to me as well.

It looks like there is one failure in 3.7 and 3.8 which is:
worker 'gw1' crashed while running 'pliers/tests/extractors/test_model_extractors.py::test_tfhub_text'
It's possible that single extractor is using way too much RAM and crashing regardless.

I pushed a commit that:

  • removes -n auto from the pytest call. That might make the printout nicer and less likely two memory intensive things run at once
  • Mark test_tfhub_text to fork

I'm curious to see how much slower it will be with -n auto (it was ~12 minutes before). If tests still don't pass then problematic tests such as test_tfhub_text may have to be skipped unless a variable such as RUN_HIGH_MEM = True. Then you could set that variable to True after the rest of the tests run and run in its own line with --forked. If that still doesn't work then it may have to be skipped altogether on CI.

@adelavega
Copy link
Member

Looks like -n auto works when the forked and non forked tests are called separately. This takes the total execution time from ~28-21 minutes, with the first tests completing in only about 12-14 mins (Python 3.6 takes way less for some odd reason).

@rbroc it's a bit out of the context of this PR so perhaps could merge this and deal with this later, but only thing left is to change the MetricExtractor test to use a more basic extractor.

@rbroc
Copy link
Collaborator Author

rbroc commented Jan 19, 2021

thank you, @adelavega! 🙏 I've opened a separate issue for MetricExtractor and will look into it asap.
Tests are passing, so I assume this one is ready to merge?

@adelavega
Copy link
Member

Sounds good, let's merge!

@adelavega adelavega merged commit e33e707 into PsychoinformaticsLab:master Jan 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support TF Hub models
4 participants