Language Models from Transformers Lib #5187

dakshvar22 · 2020-02-04T14:57:27Z

Proposed changes:

Create a new NLP component - HFTransformersNLP which tokenizes and featurizes incoming messages using the Transformers Library.
Create LanguageModelTokenizers and LanguageModelFeaturizers which use the information from HFTransformersNLP and sets them correctly for message object
Architectures supported: Bert, OpenAIGPT, GPT-2, XLNet, DistilBert, Roberta

Part of https://github.com/RasaHQ/research/issues/62

Status (please check what you already did):

added some tests for the functionality
updated the documentation
updated the changelog (please check changelog for instructions)
reformat files using black (please check Readme for instructions)

Ghostvv

Looks good! Have a couple of comments. Also, does it makes sense to create 3 files for hf? Why don't we put all these helpers into 1 file?

rasa/nlu/constants.py

rasa/nlu/utils/hugging_face/hf_transformers.py

Co-Authored-By: Vladimir Vlasov <[email protected]>

tabergma

We need to add some tests for the components.

rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py

rasa/nlu/constants.py

rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py

rasa/nlu/utils/hugging_face/hf_transformers.py

tabergma · 2020-02-04T16:33:09Z

Can you also create a changelog entry and add some documentation? E.g. add the new components to https://rasa.com/docs/rasa/nlu/components/.

dakshvar22 · 2020-02-04T16:36:12Z

@tabergma Yes, tests and documentation are to be added. That wasn't ready. :)
@Ghostvv I feel readability is better. From a maintenance perspective, we know that registry.py and transformers_pre_post_processors.py is where bulk of it would happen since new models can basically be added by editing those two files. IMO the three files help in that sense. What do you think?

…rmers_lm

rasa/nlu/constants.py

Ghostvv

Are we going to add when to use what in a separate PR?

changelog/5187.feature.rst

docs/nlu/components.rst

Co-Authored-By: Vladimir Vlasov <[email protected]>

…rmers_lm

tests/nlu/training/test_train.py

rasa/nlu/tokenizers/lm_tokenizer.py

rasa/nlu/utils/hugging_face/hf_transformers.py

tests/nlu/extractors/test_crf_entity_extractor.py

tabergma

Apart from the comments I already made, it looks good 🚀 Great work!

dakshvar22 added 4 commits February 4, 2020 13:21

first implementation ready.

5a71c57

tested all available models. implementation works

d5a1b85

refactored class name

205c7bd

remove print statement

7eb475c

dakshvar22 requested review from tabergma and Ghostvv February 4, 2020 14:59

Ghostvv reviewed Feb 4, 2020

View reviewed changes

Apply suggestions from code review

cc55dfc

Co-Authored-By: Vladimir Vlasov <[email protected]>

tabergma reviewed Feb 4, 2020

View reviewed changes

dakshvar22 added 3 commits February 4, 2020 17:41

quick review comments. Tests WIP

576d8f4

Merge branch 'transformers_lm' of github.com:RasaHQ/rasa into transfo…

fab9122

…rmers_lm

fix imports

4c3f218

dakshvar22 commented Feb 4, 2020

View reviewed changes

rasa/nlu/constants.py Outdated Show resolved Hide resolved

dakshvar22 added 6 commits February 5, 2020 01:54

bug fix to swap seq and sentence embeddings

49e9a15

tests for tokenizers are in

e990e8f

added featurizer tests

99bad36

added documentation

01c7de5

add changelog, move common method out of class

3259788

refactor spacy doc name

d10d73f

dakshvar22 requested review from Ghostvv and tabergma February 5, 2020 17:10

Ghostvv approved these changes Feb 6, 2020

View reviewed changes

changelog/5187.feature.rst Outdated Show resolved Hide resolved

changelog/5187.feature.rst Outdated Show resolved Hide resolved

docs/nlu/components.rst Outdated Show resolved Hide resolved

dakshvar22 and others added 4 commits February 6, 2020 11:58

Apply suggestions from code review

452368d

Co-Authored-By: Vladimir Vlasov <[email protected]>

added new components to test pipelines

38b6a01

Merge branch 'transformers_lm' of github.com:RasaHQ/rasa into transfo…

7c654fd

…rmers_lm

created new pipeline for failing tests

7ccafc3

Ghostvv reviewed Feb 10, 2020

View reviewed changes

tests/nlu/training/test_train.py Outdated Show resolved Hide resolved

separate pipeline for convert as well

6d9c886