Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any way I can continue training the language model on a specific domain #121

Closed
dongfang91 opened this issue Sep 24, 2018 · 10 comments
Labels
language model Related to language model question Further information is requested

Comments

@dongfang91
Copy link

Hi,

The language model is trained on 1-billion word corpus, I want to continue train them on my specific domain corpus, can I do that in flair?

Thanks!

@alanakbik
Copy link
Collaborator

Hello @dongfang91,

yes that is possible. You can do this by loading a saved language model and passing this model to the language model trainer, e.g.:

model = LanguageModel.load_language_model('your/saved/model.pt')
# make sure to use the same dictionary from saved model
dictionary = model.dictionary

# load your new corpus
corpus = Corpus('path/to/your/corpus', dictionary, forward, character_level=True)

# pass corpus and pre-trained language model to trainer
trainer = LanguageModelTrainer(language_model, corpus)

# train with your favorite parameters
trainer.train('resources/taggers/language_model', learning_rate=5)

You may need to experiment with different learning rates. I think a corpus-switch will confuse the learning so the first epochs might be very unstable. You could try a learning rate of 5 or even lower.

We actually never tried switching corpora, so please let us know how well this works!

@dongfang91
Copy link
Author

Yes, sure! Thanks a lot!

@aronszanto
Copy link

@dongfang91 I'm about to do this as well, continuing training on the LMs associated with the Forward/Backward Flair Embeddings with another corpus of about 800M words. Did you find anything of note? Especially interested in re learning rate or other tuning params.

Thanks!

@alanakbik
Copy link
Collaborator

@aronszanto sounds interesting! Will you share your results / experience? This could help others that want to do a similar thing.

@MarcioPorto
Copy link
Contributor

@alanakbik am I correct in assuming that I can only use the method you described above if there are no previously unseen words in the specific domain corpus? If that is correct, is there anything I can do if there are some words in my new corpus that don't show up in the original corpus the model was trained on?

@alanakbik
Copy link
Collaborator

Yeah that is generally correct, but we train our models at the character-level. So the only way you would not be able to handle unseen words is if they consisted of previously unseen characters. For instance, if you trained on Arabic text with a language model that was trained with a dictionary of Latin characters. But new words with the same characters would be ok.

@MarcioPorto
Copy link
Contributor

@alanakbik Is there a way I can initialize a LanguageModel from an existing embedding like WordEmbeddings('en-crawl')? It's not immediately clear to me where the 'your/saved/model.pt' file is coming from.

@alanakbik
Copy link
Collaborator

@MarcioPorto language models are trained at the character-level in our case so you cannot initialize with word embeddings. You can either train your own language model from scratch by following these instructions which will produce the model file to load.

Or you can use an existing language model that is shipped with Flair, by accessing the model in the FlairEmbeddings, like this:

model: LanguageModel = FlairEmbeddings('news-forward').lm

@MarcioPorto
Copy link
Contributor

@alanakbik Does flair currently support a way to fine-tune BERT embeddings natively, or would I have to follow the procedure described in the huggingface/pytorch-transformers documentation?

@alanakbik
Copy link
Collaborator

@MarcioPorto we don't currently. We will add a native method for fine-tuning FlairEmbeddings soon. Maybe with the new pytorch-transformers library, we can also add such options for other embeddings in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language model Related to language model question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants