-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multilingual Language Models #614
Comments
Hi @stefan-it this is awesome, looking forward to integrating this!! Is there a paper on your results? |
@alanakbik I was just thinking about submitting a workshop paper. But I just got a ACL rejection, because we just included the supplementary material inside the paper... this is a really demotivating factor 🤣 |
Yeah I think we all know the feeling :/ But there's always another conference on the horizon :) Any plans on putting the paper on archiv? |
Thank you very much @stefan-it . Coming here from #179 . If you don't mind can you please answer the following questions regarding Arabic LM:
The reason I ask is that I am training an LM over a 1.5B word Arabic corpus. And I would like to get some pointers on when to stop training. It's been 2 days on a K80 and after 1 epoch, I am looking at a perplexity of roughly 3.6 and the learning rate has dropped to 5. Maybe, even @alanakbik could share his experience over training over huge corpora. Thanks |
Hello @zeeshansayyed we generally train for about 2 weeks. One thing I note is that your learning rate has already annealed to 5 after 2 days which indicates that your patience may be too low. Try doubling the patience so that you learn with a learning rate of 20 for a few more days. |
Yes. On the 4th day, the learning rate fell to 1.25. In the code, the default patience seems to be at 10. But elsewhere on the internet, I have seen it be 3 or 5. Do you think a 'patience' of 20 would do the trick? I will stop and restart training. |
Yes, 20 will be better. You could even go higher if you have enough time (the higher the patience the longer it trains). |
Hi @zeeshansayyed :) to answer your questions: I trained all models for one epoch. Initial learning rate was 20. In my experiments I used a training corpus split size < 20, so that the learning rate never decreased. |
@alanakbik @stefan-it I see that this issue has already been addressed with #761. However, do the old language models still remain available? For reproducibility, it would be good if one can chose between the different versions of pre-trained embeddings. |
This was my fault 😅 I've talked with @alanakbik about that (I voted for overriding the old models), but I think then we need to discuss a kind of "versioning schema". E.g. A more complex use case: 🤔 |
@stefan-it I think it would be worth opening an issue for this. I would be strongly in favor of keeping the "old" models for reproducibility. We might consider something similar to the spaCy model versioning scheme. They do in fact support the latter use case you mention (i.e., |
Hello @stefan-it @jantrienes yes this makes a lot of sense. Another question would be what happens if two groups independently contribute models for a language, for instance we have LMs for Polish from different groups. In this case, one model is not an improved version of the other, but they were simply trained on different data with different parameters. How would we distinguish between the two, and also how would we choose which one to point to if |
GH-614: re-added older LMs with version number
@stefan-it Congratulations for the great work. Is there a paper for your pre-trained model. how to cite this if i want. thank you |
Hi @abeermohamed1 , unfortunately, there's no paper available. But you could cite the |
@stefan-it Thank you. this a great work and effort. "But I just got a ACL rejection, because we just included the supplementary material inside the paper" |
@stefan-it can you please tell how to get fast model , like I have seen u added bg-X-fast |
@stefan-it can you please let me know the size of corpus required to fine tune news-forward model on Social media data. |
Hi @codemaster-22 , I did not train the |
@alanakbik can you please help me out with this asap ? I am curious to start fine tuning on english tweets. |
Hi @codemaster-22 I trained the news model on the 1 billion word corpus, which in fact is about 800 million tokens of text. |
@alanakbik I want to fine tune the news-forward model on English tweets , so any suggestions on number of tokens I should use ? |
If you have a large enough corpus of tweets, you might consider training a new model from scratch instead of fine-tuning the news model. The language is very different in style, also you might need a different character dictionary for emojis etc. If you have around 100 million tokens of tweet text that should be good but more is obviously always better. |
Like I have lot of tweets , but I want to immediately experiment with fine tuning . Did you mean 100 million tokens for fine tuning or to train from scratch? |
100 million to train from scratch. I didn't do much fine-tuning of already trained flair embeddings so I'm not really sure how much you need (probably less) and what the best parameters are. But be careful to set a low learning rate when fine-tuning. |
Hi @alanakbik @stefan-it , I started finetuning with Learning rate 5 , and below are the plots |
Hi,
I trained language models for 16 languages on Wikipedia dumps + OPUS, that can be integrated into
flair
:) This is the result of ~ 2 months work.Language models
Training data are (a) recent Wikipedia dump and (b) corpora from OPUS. Training was done for one epoch over the full training corpus.
Download links:
Hyperparameters:
hidden_size
nlayers
sequence_length
mini_batch_size
Instead of using
common_chars
, all characters (from the training corpus) are used as vocabulary for language model training.PoS Tagging on Universal Dependencies (v1.2)
To test the new language models on a downstream task, results for PoS tagging on Universal Dependencies (v1.2) are reported (with comparisons to other papers).
Hyperparameters:
hidden_size
512
learning_rate
0.1
mini_batch_size
8
max_epochs
500
Results on Universal Dependencies show new SOTA, except for Arabic and Indonesian.
The text was updated successfully, but these errors were encountered: