-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get NAN loss after 35k steps #4
Comments
@StevenLOL I see this happen sometimes too -- seems to be a very common problem with Tensorflow training in general. |
I've been having this problem and I decreased the learning rate as per various discussions on SO and that seemed to work. After a while I tried increasing it by 0.01 and started getting NaN's again. I've tried restoring the checkpoint and re-running with the lower learning rate, but I'm still seeing NaN. Does this mean my checkpoint is useless? |
I am also getting NaN. Found out the culprit to be |
@Rahul-Iisc Is your workaround to filter out cases where
|
@hate5six I'm still thinking about an appropriate solution. Each dist is a tensor shape (batch_size, extended_vsize), so I am not sure if |
Trying to convert NaNs to 0 for now. Need to further look for the cause of 0s coming up in the distribution. @abisee @StevenLOL reopen the issue?
UPDATE: This didn't work. Loss ended up 0 instead of NaN. |
The below change worked for me. Add the below code to
|
final_dists = [tf.clip_by_value(dist,1e-10,1.) for dist in final_dists] |
@lizaigaoge550 did this work for you? |
@jamesposhtiger |
can we restore the already trained model after it starts getting NaN ? |
@apoorv001 probably not. This is where the concurrent In any case, I know the NaN thing is very annoying. I haven't had time recently, but I intend to look at the bug, understand what's going wrong, and fix it. In any case @Rahul-Iisc's solution appears to be working for several people currently. |
Thanks @abisee for the clarification, however, I have 2 differents runs failed due to NaN after training for days, it would be a great favor to us if you could also upload the trained model along with code. |
@Rahul-Iisc I've had another look at the code. I see your point about
However, in theory those zero entries in So I think there must be something else wrong, either:
I think the second one seems more likely. I can try to investigate the problem but it's tricky because sometimes you need to run for hours before you can replicate the error. |
@abisee If fast replicating is desired, I recommend train with extremely short sequence pair, such as 10-2, and NaN should occur when training loss reach 3. |
@abisee Thanks. But can I use the saved checkpoints from eval for continuing my training after NaN occurred? I have removed everything in
|
I've looked further into this and still don't understand where the NaNs are coming from. I changed the code to detect when a NaN occurs, then dump the attention distribution, vocabulary distribution, final distribution and some other stuff to file. Looking at the dump file, I find that attn_dists and vocab_dists are both all NaN, on every decoder step, for every example in the batch and across the encoder timesteps (for This is different than what I was expecting. I was expecting to find zero values in Given this information, I don't see why adding epsilon to |
Thanks for your update! The strange thing is, it does work @abisee. By adding epsilon I have not encountered NaNs again. But it does affect the training convergence a bit. By how much, I dunno. |
@bluemonk482 I tried adding epsilon as mentioned in the previous discussions above. I still get the NaN loss after one day of training. Can you tell me what was the learning rate you used for your experiments? I tried using 0.1. |
@shahbazsyed Well I used as big a learning rate as 0.15, and had no NaN error after adding epsilon. How did you add the epsilon? Like this (as @Rahul-Iisc has suggested):
|
@bluemonk482 Yes, I added epsilon just as @Rahul-Iisc suggested. I tried with a higher learning rate of 0.2 , but am still getting the same error. Did you change any other parameters ? Mine are the following :
|
@shahbazsyed I see you have changed the default parameter setting. I used 128 for |
@bluemonk482 Still got the NaN with lower learning rate after 42000 steps. I tried decoding with the model trained so far, it was just a bunch of UNKs. |
Use tfdbg. |
Hello everyone, and thanks for your patience. We've made a few changes that help with the NaN issue.
The README also contains a section about NaNs. Edit: We've also provided a pretrained model, which we trained using this new version of the code. See the README. |
I encountered the same problem and found that it is because of the zeros examples in the batch. If the training corpus contains I just add a line next to line 329 in batcher: if len(b) != self._hps.batch_size:
continue May this help a little. |
There might be another cause for NANs. There are some stories in the dataset which only contain highlights and not the article itself. That causes a sequence of [PAD] tokens with attention distribution of all zeros which causes the probability of [PAD] to be zero in final distribution (because of scatter function). And since the only target for that sequence is [PAD] with zero probability it will generate NAN. I removed those articles and it seems to be working. |
I also encounter NaNs when pgen is 1 or 0. This means that either |
when I add this model to transformer, I get nan, when I train model, I try many ways to solve this problem, but it still nan, if anyone interested in it, we can discuss. |
Please I'm trying your code of pointer generator built on LCSTS dataset. I changed the dataset with amazon reviews for my academic work purpose; I'm getting all UNKs in output..Please help if you have solved it |
Anyone got NAN ?
The text was updated successfully, but these errors were encountered: