Is there any way I can continue training the language model on a specific domain #121

dongfang91 · 2018-09-24T19:08:14Z

Hi,

The language model is trained on 1-billion word corpus, I want to continue train them on my specific domain corpus, can I do that in flair?

Thanks!

alanakbik · 2018-09-24T19:33:38Z

Hello @dongfang91,

yes that is possible. You can do this by loading a saved language model and passing this model to the language model trainer, e.g.:

model = LanguageModel.load_language_model('your/saved/model.pt')
# make sure to use the same dictionary from saved model
dictionary = model.dictionary

# load your new corpus
corpus = Corpus('path/to/your/corpus', dictionary, forward, character_level=True)

# pass corpus and pre-trained language model to trainer
trainer = LanguageModelTrainer(language_model, corpus)

# train with your favorite parameters
trainer.train('resources/taggers/language_model', learning_rate=5)

You may need to experiment with different learning rates. I think a corpus-switch will confuse the learning so the first epochs might be very unstable. You could try a learning rate of 5 or even lower.

We actually never tried switching corpora, so please let us know how well this works!

dongfang91 · 2018-09-24T19:56:56Z

Yes, sure! Thanks a lot!

aronszanto · 2019-04-19T17:11:59Z

@dongfang91 I'm about to do this as well, continuing training on the LMs associated with the Forward/Backward Flair Embeddings with another corpus of about 800M words. Did you find anything of note? Especially interested in re learning rate or other tuning params.

Thanks!

alanakbik · 2019-04-26T08:58:25Z

@aronszanto sounds interesting! Will you share your results / experience? This could help others that want to do a similar thing.

MarcioPorto · 2019-07-09T15:19:57Z

@alanakbik am I correct in assuming that I can only use the method you described above if there are no previously unseen words in the specific domain corpus? If that is correct, is there anything I can do if there are some words in my new corpus that don't show up in the original corpus the model was trained on?

alanakbik · 2019-07-09T15:29:44Z

Yeah that is generally correct, but we train our models at the character-level. So the only way you would not be able to handle unseen words is if they consisted of previously unseen characters. For instance, if you trained on Arabic text with a language model that was trained with a dictionary of Latin characters. But new words with the same characters would be ok.

MarcioPorto · 2019-07-09T16:13:08Z

@alanakbik Is there a way I can initialize a LanguageModel from an existing embedding like WordEmbeddings('en-crawl')? It's not immediately clear to me where the 'your/saved/model.pt' file is coming from.

alanakbik · 2019-07-10T07:05:14Z

@MarcioPorto language models are trained at the character-level in our case so you cannot initialize with word embeddings. You can either train your own language model from scratch by following these instructions which will produce the model file to load.

Or you can use an existing language model that is shipped with Flair, by accessing the model in the FlairEmbeddings, like this:

model: LanguageModel = FlairEmbeddings('news-forward').lm

MarcioPorto · 2019-07-19T14:52:22Z

@alanakbik Does flair currently support a way to fine-tune BERT embeddings natively, or would I have to follow the procedure described in the huggingface/pytorch-transformers documentation?

alanakbik · 2019-07-19T16:51:51Z

@MarcioPorto we don't currently. We will add a native method for fine-tuning FlairEmbeddings soon. Maybe with the new pytorch-transformers library, we can also add such options for other embeddings in the future.

alanakbik pushed a commit that referenced this issue Sep 26, 2018

GH-121: load GPU language models on CPU if required

7cbff02

alanakbik pushed a commit that referenced this issue Sep 27, 2018

GH-121: load GPU language models on CPU if required

3d93884

tabergma added question Further information is requested language model Related to language model labels Oct 1, 2018

tabergma closed this as completed Nov 8, 2018

alanakbik mentioned this issue Mar 23, 2019

Augmenting pre-trained word embeddings with domain specific data #622

Closed

MarcioPorto mentioned this issue Jul 9, 2019

AttributeError when trying to fine-tune existing language model #869

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any way I can continue training the language model on a specific domain #121

Is there any way I can continue training the language model on a specific domain #121

dongfang91 commented Sep 24, 2018

alanakbik commented Sep 24, 2018

dongfang91 commented Sep 24, 2018

aronszanto commented Apr 19, 2019

alanakbik commented Apr 26, 2019

MarcioPorto commented Jul 9, 2019

alanakbik commented Jul 9, 2019

MarcioPorto commented Jul 9, 2019

alanakbik commented Jul 10, 2019

MarcioPorto commented Jul 19, 2019

alanakbik commented Jul 19, 2019

Is there any way I can continue training the language model on a specific domain #121

Is there any way I can continue training the language model on a specific domain #121

Comments

dongfang91 commented Sep 24, 2018

alanakbik commented Sep 24, 2018

dongfang91 commented Sep 24, 2018

aronszanto commented Apr 19, 2019

alanakbik commented Apr 26, 2019

MarcioPorto commented Jul 9, 2019

alanakbik commented Jul 9, 2019

MarcioPorto commented Jul 9, 2019

alanakbik commented Jul 10, 2019

MarcioPorto commented Jul 19, 2019

alanakbik commented Jul 19, 2019