Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual Language Models #614

Closed
stefan-it opened this issue Mar 17, 2019 · 25 comments · Fixed by #761
Closed

Multilingual Language Models #614

stefan-it opened this issue Mar 17, 2019 · 25 comments · Fixed by #761

Comments

@stefan-it
Copy link
Member

stefan-it commented Mar 17, 2019

Hi,

I trained language models for 16 languages on Wikipedia dumps + OPUS, that can be integrated into flair :) This is the result of ~ 2 months work.

Language models

Training data are (a) recent Wikipedia dump and (b) corpora from OPUS. Training was done for one epoch over the full training corpus.

Language (Code) Tokens (training) Forward ppl Backward ppl
Arabic (ar) 736,512,400 3.39 3.45
Bulgarian (bg) 111,336,781 2.46 2.47
Czech (cs) 442,892,103 2.89 2.90
Danish (da) 325,816,384 2.62 2.68
Basque (eu) 36,424,055 2.64 2.31
Persian (fa) 146,619,206 3.68 3.66
Finnish (fi) 427,194,262 2.63 2.65
Hebrew (he) 502,949,245 3.84 3.87
Hindi (hi) 28,936,996 2.87 2.86
Croatian (hr) 625,084,958 3.13 3.20
Indonesian (id) 174,467,241 2.80 2.74
Italian (it) 1,549,430,560 2.62 2.63
Dutch (nl) 1,275,949,108 2.43 2.55
Norwegian (no) 156,076,225 3.01 3.01
Polish (pl) 1,428,604,528 2.95 2.84
Slovenian (sl) 419,744,423 2.88 2.91
Swedish (sv) 671,922,632 6.82 2.25

Download links:

wget https://schweter.eu/cloud/flair-lms/lm-ar-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-ar-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-bg-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-bg-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-cs-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-cs-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-da-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-da-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-eu-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-eu-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-fa-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-fa-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-fi-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-fi-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-he-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-he-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-hi-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-hi-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-hr-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-hr-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-id-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-id-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-it-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-it-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-nl-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-nl-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-no-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-no-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-pl-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-pl-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sl-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sl-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sv-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sv-opus-large-backward-v0.1.pt

Hyperparameters:

Parameter Value
hidden_size 2048
nlayers 1
sequence_length 250
mini_batch_size 100

Instead of using common_chars, all characters (from the training corpus) are used as vocabulary for language model training.

PoS Tagging on Universal Dependencies (v1.2)

To test the new language models on a downstream task, results for PoS tagging on Universal Dependencies (v1.2) are reported (with comparisons to other papers).

Language (Code) Yu et. al (2017) Plank et. al (2016) Yasunaga et. al (2017) Flair Δ
Arabic (ar) 99.00 98.91 n.a. 98.86 -0.14
Bulgarian (bg) 98.20 98.23 98.53 99.18 0.65
Czech (cs) 98.79 98.24 98.81 99.14 0.33
Danish (da) 95.92 96.35 96.74 98.48 1.74🔥
Basque (eu) 94.94 95.51 94.71 97.30 1.79🔥
Persian (fa) 97.12 97.60 97.51 98.15 0.55
Finnish (fi) 95.31 95.85 95.40 98.11 2.26🔥
Hebrew (he) 96.04 96.96 97.43 97.67 0.24
Hindi (hi) 96.96 97.10 97.21 97.85 0.64
Croatian (hr) 95.05 96.82 96.32 97.43 0.61
Indonesian (id) 93.44 93.41 94.03 93.85 -0.18
Dutch (nl) 93.11 93.82 93.09 94.03 0.21
Norwegian (no) 97.65 98.06 98.08 98.73 0.65
Polish (pl) 96.83 97.63 97.57 98.81 1.18🔥
Slovenian (sl) 97.16 96.97 98.11 99.02 0.91
Swedish (sv) 96.28 96.69 96.70 98.54 1.84🔥

Hyperparameters:

Parameter Value
hidden_size 512
learning_rate 0.1
mini_batch_size 8
max_epochs 500

Results on Universal Dependencies show new SOTA, except for Arabic and Indonesian.

@alanakbik
Copy link
Collaborator

Hi @stefan-it this is awesome, looking forward to integrating this!!

Is there a paper on your results?

@stefan-it
Copy link
Member Author

stefan-it commented Mar 24, 2019

@alanakbik I was just thinking about submitting a workshop paper. But I just got a ACL rejection, because we just included the supplementary material inside the paper... this is a really demotivating factor 🤣

@alanakbik
Copy link
Collaborator

Yeah I think we all know the feeling :/ But there's always another conference on the horizon :) Any plans on putting the paper on archiv?

@stefan-it stefan-it mentioned this issue May 24, 2019
@zeeshansayyed
Copy link

Thank you very much @stefan-it . Coming here from #179 . If you don't mind can you please answer the following questions regarding Arabic LM:

  1. How many epochs did you train for?
  2. What was your initial learning rate? Can you provide a rough idea of how your learning rate dropped over the epochs?

The reason I ask is that I am training an LM over a 1.5B word Arabic corpus. And I would like to get some pointers on when to stop training. It's been 2 days on a K80 and after 1 epoch, I am looking at a perplexity of roughly 3.6 and the learning rate has dropped to 5.

Maybe, even @alanakbik could share his experience over training over huge corpora.

Thanks
Zeeshan

@alanakbik
Copy link
Collaborator

Hello @zeeshansayyed we generally train for about 2 weeks. One thing I note is that your learning rate has already annealed to 5 after 2 days which indicates that your patience may be too low. Try doubling the patience so that you learn with a learning rate of 20 for a few more days.

@zeeshansayyed
Copy link

Yes. On the 4th day, the learning rate fell to 1.25. In the code, the default patience seems to be at 10. But elsewhere on the internet, I have seen it be 3 or 5. Do you think a 'patience' of 20 would do the trick? I will stop and restart training.

@alanakbik
Copy link
Collaborator

Yes, 20 will be better. You could even go higher if you have enough time (the higher the patience the longer it trains).

@stefan-it
Copy link
Member Author

Hi @zeeshansayyed :)

to answer your questions: I trained all models for one epoch. Initial learning rate was 20. In my experiments I used a training corpus split size < 20, so that the learning rate never decreased.

@jantrienes
Copy link

@alanakbik @stefan-it I see that this issue has already been addressed with #761. However, do the old language models still remain available? For reproducibility, it would be good if one can chose between the different versions of pre-trained embeddings.

@stefan-it
Copy link
Member Author

stefan-it commented May 29, 2019

This was my fault 😅

I've talked with @alanakbik about that (I voted for overriding the old models), but I think then we need to discuss a kind of "versioning schema". E.g. nl-forward would always point to the latest version of trained flair embeddings. nl-forward-v0 or nl-forward-v1 would point to previous versions.

A more complex use case:
When you want to quote the latest version (e.g. that you used in a paper), it would be good to have a kind of "symlink" that points from nl-forward to nl-forward-v2, so you can quote nl-forward-v2 in a paper (in case of nl-forward will be updated to v3).

🤔

@jantrienes
Copy link

@stefan-it I think it would be worth opening an issue for this. I would be strongly in favor of keeping the "old" models for reproducibility.

We might consider something similar to the spaCy model versioning scheme. They do in fact support the latter use case you mention (i.e., nl-forward points to latest model).

@alanakbik
Copy link
Collaborator

alanakbik commented May 29, 2019

Hello @stefan-it @jantrienes yes this makes a lot of sense. Another question would be what happens if two groups independently contribute models for a language, for instance we have LMs for Polish from different groups. In this case, one model is not an improved version of the other, but they were simply trained on different data with different parameters. How would we distinguish between the two, and also how would we choose which one to point to if pl-forward is selected?

@abeermohamed1
Copy link

@stefan-it Congratulations for the great work. Is there a paper for your pre-trained model. how to cite this if i want. thank you

@stefan-it
Copy link
Member Author

Hi @abeermohamed1 ,

unfortunately, there's no paper available. But you could cite the flair-lms repo:

https://github.com/flairNLP/flair-lms

@abeermohamed1
Copy link

@stefan-it Thank you. this a great work and effort.
Can I ask you what you mean by the below I didn't understand why your paper is got rejected? Can you please explain for me as I am begginer :) in research.

"But I just got a ACL rejection, because we just included the supplementary material inside the paper"

@codemaster-22
Copy link

@stefan-it can you please tell how to get fast model , like I have seen u added bg-X-fast

@codemaster-22
Copy link

@stefan-it can you please let me know the size of corpus required to fine tune news-forward model on Social media data.

@stefan-it
Copy link
Member Author

Hi @codemaster-22 , I did not train the news-forward model, but I'm sure that @alanakbik can help with the size of training corpus!

@codemaster-22
Copy link

@alanakbik can you please help me out with this asap ? I am curious to start fine tuning on english tweets.

@alanakbik
Copy link
Collaborator

Hi @codemaster-22 I trained the news model on the 1 billion word corpus, which in fact is about 800 million tokens of text.

@codemaster-22
Copy link

@alanakbik I want to fine tune the news-forward model on English tweets , so any suggestions on number of tokens I should use ?

@alanakbik
Copy link
Collaborator

If you have a large enough corpus of tweets, you might consider training a new model from scratch instead of fine-tuning the news model. The language is very different in style, also you might need a different character dictionary for emojis etc. If you have around 100 million tokens of tweet text that should be good but more is obviously always better.

@codemaster-22
Copy link

Like I have lot of tweets , but I want to immediately experiment with fine tuning . Did you mean 100 million tokens for fine tuning or to train from scratch?

@alanakbik
Copy link
Collaborator

100 million to train from scratch. I didn't do much fine-tuning of already trained flair embeddings so I'm not really sure how much you need (probably less) and what the best parameters are. But be careful to set a low learning rate when fine-tuning.

@codemaster-22
Copy link

codemaster-22 commented Jul 13, 2021

Hi @alanakbik @stefan-it , I started finetuning with Learning rate 5 , and below are the plots
Screenshot 2021-07-13 at 6 10 51 PM
sequence_length=100,
mini_batch_size=100,
learning_rate=5,
patience=10
these are the parameters I kept , can you please suggest me something by which I can decrease loss (make it below 1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants