Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistency in the notation / term epoch #89

Closed
iamyihwa opened this issue Aug 23, 2018 · 6 comments
Closed

inconsistency in the notation / term epoch #89

iamyihwa opened this issue Aug 23, 2018 · 6 comments
Labels
enhancement Improving of an existing feature language model Related to language model

Comments

@iamyihwa
Copy link

iamyihwa commented Aug 23, 2018

Hello,
I am training language model,
Before testing with whole text, I was running it with smaller text.

From my understanding epoch vs. batch vs. minibatch is: (From a post in stackexchange)
image

However when train the language model with following parameters,

# train your language model
trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/taggers/language_model_es_forward',
              sequence_length=250,
              mini_batch_size=100,
              max_epochs=5)

I get following output.
Currently there are 51 input files. ( I see that the mini_batch_size is larger than the number of the training files. So this might be a problem)
However, in the output, the epoch number doesn't change .. only it says end of split (1/ 51) and it finishes after (5 / 51) ..
I wonder if it is due to different use of terminology?
Or in this case it only goes through 5 files and stops ?
If I want to go through my entire dataset 100 times for example, I have to do 100 * number_of_train_files as max_epochs?

(The Screenshot might be confusing, but the training finished after 5 / 51.. )
image

The original dataset that I have (wikidump es) are currently in 2300 files (each file about 1MB).

I am intending to put about 5 files together and make it into validation set, and put about 5 files together and make it into test set.

The rest of the files (about 2290 files), I will use to train the model.

If I want to pass multiple times what value should I use for epochs?

What are the good number of passes of data, when you trained your language model for English and German models?

@alanakbik
Copy link
Collaborator

Yes, you are right that notation is inconsistent here, since the parameter "epochs" is used to count training data splits which is not intuitive. So until we fix this you are correct: use 100 * number_of_train_files as max_epochs if you want to do 100 epochs.

Generally, our advice is to set the max epochs to an extremely high number and run the training until the learning rate has annealed twice. The learning rate starts annealing when training yields few improvements, so when it has annealed a few times the model is as good as it can get. Also, we would recommend grouping the training files so that you have about 20-50 files so that you do not lose too much time validating at the end of each split, and set patience to perhaps half the number of your training splits!

@iamyihwa
Copy link
Author

Thanks @alanakbik for the clarification.
I will do as you suggest!

At the moment I am seeing the loss not decreasing so quick, and ppl also got stuck somehow ..
Is there anything I should be doing?
image

@iamyihwa
Copy link
Author

Have ran even further .. but ppl and loss seems to get stuck .. This happened after 1.3 epochs or so ... )
loss: 1.22, ppl : 3.37
image

@alanakbik
Copy link
Collaborator

Hi, yes it looks like the learning rate has annealed too quickly. The learning rate is 0.00 in the output.

This happens because your training splits are too small giving the learning too many opportunities to anneal. Either increase the size of your training splits or increase the patience. Or even better: both :)

Try:

trainer.train('resources/taggers/language_model_es_forward',
              sequence_length=250,
              mini_batch_size=100,
              max_epochs=2000, 
              patience=100)

@iamyihwa
Copy link
Author

@alanakbik Yes I have tried it with the larger data size, (also increased the hidden neuron numbers, just in case..) and it already seems better!

image

Thanks @alanakbik for the suggestion, I will try with patience = 100!

BTW, does the language model require specific input ? I have used each one sentence input. In addition to that do I need to separate each token or do any normalization?

@tabergma tabergma added the enhancement Improving of an existing feature label Oct 1, 2018
@tabergma tabergma added language model Related to language model release-0.3 labels Oct 11, 2018
alanakbik pushed a commit that referenced this issue Oct 11, 2018
@alanakbik
Copy link
Collaborator

Term/epoch notation fixed in release-0.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improving of an existing feature language model Related to language model
Projects
None yet
Development

No branches or pull requests

3 participants