GH-512: Minor embedding fixes #520

alanakbik · 2019-02-19T14:11:39Z

This PR fixes a few minor issues in preparation for 0.4.1 release.

DocumentLSTMEmbeddings are marked as deprecated but not removed so that serialized models still work. The deprecation warning points to the new DocumentRNNEmbeddings class as class to be used instead.
BytePairEmbeddings threw an error for empty words. Now, a null vector is used for empty words instead.
the load_text_classification_corpus() methods of the NLPTaskDataFetcher previously tokenized all corpora by default. But some text classification corpora, such as TREC_6 are already tokenized. So we added an option to turn tokenization off.
added Turian embeddings which are very small and will hopefully speed up integration tests. Load with WordEmbeddings('turian')

…rror of BytePairEmbeddings

aakbik added 4 commits February 19, 2019 15:05

GH-512: mark DocumentLSTMEmbeddings as deprecated and fix null word e…

697fe3f

…rror of BytePairEmbeddings

GH-481: add tokenization option to text classification dataset loader

6afe569

GH-481: add tokenization option to text classification dataset loader

89cdb5f

GH-481: add small turian embeddings to speed up integration tests

b183539

alanakbik merged commit d2682f8 into release-0.4.1 Feb 19, 2019

alanakbik deleted the GH-512-embeddings branch February 19, 2019 15:06

alanakbik restored the GH-512-embeddings branch February 19, 2019 15:08

alanakbik deleted the GH-512-embeddings branch February 19, 2019 18:47

Provide feedback