add tensorflow hub extractor #433

rbroc · 2020-12-07T17:20:24Z

First draft of a (hierarchy of?) Tensorflow Hub extractor(s).
In current implementation, a generic TFHub extractor is implemented which is agnostic to stimulus type + two subclasses for embedding models and classification models - at the moment, these are pretty much the only two task types for which input and output seem fairly standardized (see https://www.tensorflow.org/hub/common_saved_model_apis/images and https://www.tensorflow.org/hub/common_saved_model_apis/text).

Closes #428.

rbroc · 2020-12-11T12:39:00Z

@adelavega @tyarkoni the current implementation could be one way to go about this.

There's a generic extractor (TFHubExtractor) where you can in principle pass whatever TFHub model, it will pack the output into a number of feature columns with generic names or into a columns with custom names if you pass them (if only one name is passed, it packs everything into a column). It's a class that allows you to use any model, but at your own risk as it remains rather abstract in terms of input type, input preprocessing and output postprocessing.

Then, there's are two modality-specific extractors, one for ImageStim and one for TextStim, for which input format in TFHub is more standardized (see https://www.tensorflow.org/hub/common_saved_model_apis and following sections). The modality-specific extractors do the type of preprocessing needed to pass the data from each Stim type to the models, and the postprocessing needed to pass the output to the ExtractorResult object.

This should work for most image classification and embedding models and text embedding models. In terms of output, these extractors are also pretty flexible. If you do not pass any labels/feature names, the extractor will just split the output in a number of features/columns with generic names ('feature_0', 'feature_1', ..., 'feature_n') according to the dimensionality of the model output. If you pass feature names, it will use those. If you pass a single feature name (e.g. 'embedding'), it will pack everything into one feature/column with that name.

We could also decide to be more specific and always constrain the dimensionality of the input to n_classes for classification, and to one feature (an embedding vector) for embedding models. We could do so by creating subclasses of TFHubImageExtractor for classification and embedding tasks - where all we do is forcing the feature name attribute to be an n_classes-long list (of either generic labels or user-specified labels) in the classification case, and one-dimensional feature list (e.g. ['embedding']) in the embedding case. I think all methods could be inherited as-is from the parent classes.

Let me know what you think - I'll go ahead and write tests if you like the current approach.

rbroc · 2020-12-11T12:59:19Z

one more note: right now the generic extractor crashes (TypeError: Can't instantiate abstract class TFHubExtractor with abstract methods _input_type) because _input_type is not specified, even when one specifies _optional_input_type (which in my understanding would be the most suitable thing here, as it'd make it possible to accept a any Stim depending on which model you pass).

Is this expected behavior? In my understanding, specifying _optional_input_type should override the need to specify _input_type.

If, though, this is expected behavior and it's me misunderstanding something - shall I create modality-specific classes for audio and video where _input_type is set? Or are there other ways to allow the generic extractor to accept any input, if we want to go for this option?

tyarkoni

Very minor comments I leave to your discretion, otherwise, looks great!

tyarkoni · 2020-12-15T22:59:38Z

pliers/extractors/models.py

+        url_or_path (str): url or path to TFHub model. You can
+            browse models at https://tfhub.dev/.
+        task (str): model task/domain identifier
+        features (optional): list of labels (for classification) 


This might need more explanation; some examples might be helpful, especially if the semantics depend on the model being loaded.

added more explanation in new commits

pliers/extractors/models.py

tyarkoni · 2020-12-15T23:03:19Z

pliers/extractors/models.py

+
+    def __init__(self, url_or_path, features=None, task=None, 
+                 transform_out=None, **kwargs):
+        verify_dependencies(['tensorflow', 'tensorflow_hub', 


Are all of these dependencies needed here? At least from the code, it looks like it might only be tensorflow_hub (for KerasLayer).

only hub is required, indeed.
Removed tensorflow dependency, and moved attempt to import tensorflow_text to the text extractor.
It is only needed for some models and there's no way to know if it is needed until the model is called, so I've added a warning at initialization when import fails.

…d warning for tensorflow_text

…xtractors

rbroc · 2021-01-05T11:43:58Z

@tyarkoni @adelavega I've added test and implemented Tal's suggestions so is ready for new review. Tests pass locally but they are not triggered automatically on Travis, and I'm not sure if I can start them manually and how. As for the BERT extractor, I expect these tests may cause some memory issues on Travis (each extractor is tested on more than one model) - but let's see what happens.

By the way, there is probably also some updating to do to the BERT extractors, I can take care of that in a separate PR.

tyarkoni

Looks good, just one minor suggestion.

tyarkoni · 2021-01-11T15:47:55Z

pliers/extractors/models.py

+        self.transform_inp = transform_inp
+        super().__init__()
+
+    def get_feature_names(self, out):


This is currently publicly exposed, so we might want to move the check for self.features inside here (instead of doing it in _extract) and use that if available. Otherwise a user might naively call get_feature_names expecting to get the stored feature names, and instead they'll get the naive enumeration of feature_*.

Alternatively, if it's not meant to be public, maybe rename to _get_feature_names.

thanks for the feedback and suggestion, Tal! Moved check for self.features inside get_feature_names.

adelavega · 2021-01-15T21:49:37Z

@rbroc I think when running using forked mode, the pytest print out becomes very ugly and it looks like it hangs, but it doesn't.
Happened to me as well.

It looks like there is one failure in 3.7 and 3.8 which is:
worker 'gw1' crashed while running 'pliers/tests/extractors/test_model_extractors.py::test_tfhub_text'
It's possible that single extractor is using way too much RAM and crashing regardless.

I pushed a commit that:

removes -n auto from the pytest call. That might make the printout nicer and less likely two memory intensive things run at once
Mark test_tfhub_text to fork

I'm curious to see how much slower it will be with -n auto (it was ~12 minutes before). If tests still don't pass then problematic tests such as test_tfhub_text may have to be skipped unless a variable such as RUN_HIGH_MEM = True. Then you could set that variable to True after the rest of the tests run and run in its own line with --forked. If that still doesn't work then it may have to be skipped altogether on CI.

adelavega · 2021-01-19T19:34:59Z

Looks like -n auto works when the forked and non forked tests are called separately. This takes the total execution time from ~28-21 minutes, with the first tests completing in only about 12-14 mins (Python 3.6 takes way less for some odd reason).

@rbroc it's a bit out of the context of this PR so perhaps could merge this and deal with this later, but only thing left is to change the MetricExtractor test to use a more basic extractor.

rbroc · 2021-01-19T19:58:33Z

thank you, @adelavega! 🙏 I've opened a separate issue for MetricExtractor and will look into it asap.
Tests are passing, so I assume this one is ready to merge?

adelavega · 2021-01-19T20:12:20Z

Sounds good, let's merge!

rbroc added 3 commits December 7, 2020 18:14

add extractor

14f2c50

fix return statement

495b8eb

add comment

012c40b

rbroc changed the title ~~add tensorflow hub extractor~~ WIP: add tensorflow hub extractor Dec 8, 2020

rbroc added 4 commits December 10, 2020 15:25

restructure class hierarchy

72a8442

add comment

a384379

remove redundant import

5d76ea5

finalize extractors

8d9ceae

rbroc requested review from adelavega and tyarkoni December 11, 2020 12:19

remove input type spec

a960383

add Stim as input_type

954aed6

tyarkoni approved these changes Dec 15, 2020

View reviewed changes

rbroc added 10 commits January 4, 2021 16:22

amend features argument description + move verify dependencies and ad…

83d0973

…d warning for tensorflow_text

move tests for TFKerasApplication to specific test suite for models e…

60c3f0e

…xtractors

remove file

b7d339a

start adding tfhub tests

75a625a

finalize TFHubTextExtractor tests

39cc13b

test fewer complextextstim elements

7f22dba

add electra warning

d158144

add comments / plan for further tests

8795f93

add image tests

3942dbb

add tests for generic extractor

d8f14cb

rbroc changed the title ~~WIP: add tensorflow hub extractor~~ add tensorflow hub extractor Jan 5, 2021

minor fix

0efabe0

tyarkoni reviewed Jan 11, 2021

View reviewed changes

move check for feature names inside get_feature_names

21647b6

rbroc added 3 commits January 15, 2021 12:19

actually, mark all tf tests as forked

630d76d

remove unused line

d6ca623

revert

fadbfa5

adelavega and others added 15 commits January 15, 2021 15:51

Remove -n auto and mark tf_hub test as forked

df166e7

try marking forked again

b57707c

mark only text

642629f

fork electra only

efdf4b2

try fork other

a03eb6f

do not run test extractor

e9b54de

test

2e5299e

add auto

dbca27b

skip both tests

086562f

split heavy tests

8741c14

try remove electra again

f2693e5

remove auto and forked

4fbf840

remove text tests

4dbdd17

restore all tests and mark as forked

8be195e

Skip all high mem test

ba5c2ae

adelavega force-pushed the add_tfhub branch from 02963d0 to ba5c2ae Compare January 18, 2021 21:03

adelavega added 2 commits January 18, 2021 15:37

Import environ error

0cf2f00

Split model extractor test (i.e. high memory) into seperate run

954fdf3

adelavega force-pushed the add_tfhub branch from 4d90878 to 954fdf3 Compare January 19, 2021 03:31

adelavega added 3 commits January 19, 2021 09:57

Fix import error and lint

bc39f6f

Try -n auto for non high memory tests

f9b7229

Try -n auto with forked tests

9125ff6

adelavega merged commit e33e707 into PsychoinformaticsLab:master Jan 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add tensorflow hub extractor #433

add tensorflow hub extractor #433

rbroc commented Dec 7, 2020 •

edited

Loading

rbroc commented Dec 11, 2020

rbroc commented Dec 11, 2020

tyarkoni left a comment

tyarkoni Dec 15, 2020

rbroc Jan 5, 2021

tyarkoni Dec 15, 2020

rbroc Jan 5, 2021 •

edited

Loading

rbroc commented Jan 5, 2021

tyarkoni left a comment

tyarkoni Jan 11, 2021

rbroc Jan 12, 2021

adelavega commented Jan 15, 2021 •

edited

Loading

adelavega commented Jan 19, 2021

rbroc commented Jan 19, 2021

adelavega commented Jan 19, 2021

add tensorflow hub extractor #433

add tensorflow hub extractor #433

Conversation

rbroc commented Dec 7, 2020 • edited Loading

rbroc commented Dec 11, 2020

rbroc commented Dec 11, 2020

tyarkoni left a comment

Choose a reason for hiding this comment

tyarkoni Dec 15, 2020

Choose a reason for hiding this comment

rbroc Jan 5, 2021

Choose a reason for hiding this comment

tyarkoni Dec 15, 2020

Choose a reason for hiding this comment

rbroc Jan 5, 2021 • edited Loading

Choose a reason for hiding this comment

rbroc commented Jan 5, 2021

tyarkoni left a comment

Choose a reason for hiding this comment

tyarkoni Jan 11, 2021

Choose a reason for hiding this comment

rbroc Jan 12, 2021

Choose a reason for hiding this comment

adelavega commented Jan 15, 2021 • edited Loading

adelavega commented Jan 19, 2021

rbroc commented Jan 19, 2021

adelavega commented Jan 19, 2021

rbroc commented Dec 7, 2020 •

edited

Loading

rbroc Jan 5, 2021 •

edited

Loading

adelavega commented Jan 15, 2021 •

edited

Loading