Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get NAN loss after 35k steps #4

Open
StevenLOL opened this issue May 5, 2017 · 31 comments
Open

Get NAN loss after 35k steps #4

StevenLOL opened this issue May 5, 2017 · 31 comments
Labels

Comments

@StevenLOL
Copy link

Anyone got NAN ?
selection_059

@abisee
Copy link
Owner

abisee commented May 5, 2017

@StevenLOL I see this happen sometimes too -- seems to be a very common problem with Tensorflow training in general.

@hate5six
Copy link

hate5six commented May 7, 2017

I've been having this problem and I decreased the learning rate as per various discussions on SO and that seemed to work. After a while I tried increasing it by 0.01 and started getting NaN's again. I've tried restoring the checkpoint and re-running with the lower learning rate, but I'm still seeing NaN. Does this mean my checkpoint is useless?

@QK-Rahul
Copy link

QK-Rahul commented May 8, 2017

I am also getting NaN. Found out the culprit to be
line #227: log_dists = [tf.log(dist) for dist in final_dists]
in model.py

@hate5six
Copy link

hate5six commented May 8, 2017

@Rahul-Iisc Is your workaround to filter out cases where dist == 0, for example:

log_dists = [tf.log(dist) for dist in final_dists if dist != 0] ?

@QK-Rahul
Copy link

QK-Rahul commented May 8, 2017

@hate5six I'm still thinking about an appropriate solution. Each dist is a tensor shape (batch_size, extended_vsize), so I am not sure if dist != 0 will work. Also, I want the log_dists length to be the same as final_dists

@QK-Rahul
Copy link

QK-Rahul commented May 8, 2017

Trying to convert NaNs to 0 for now. Need to further look for the cause of 0s coming up in the distribution. @abisee @StevenLOL reopen the issue?

def _change_nan_to_number(self, tensor, number=1):
    return tf.where(tf.logical_not(tf.is_finite(tensor)), tf.ones_like(tensor) * number, tensor)

log_dists = [tf.log(self._change_nan_to_number(dist)) for dist in final_dists]

UPDATE: This didn't work. Loss ended up 0 instead of NaN.

@abisee abisee reopened this May 8, 2017
@abisee abisee mentioned this issue May 8, 2017
Closed
@QK-Rahul
Copy link

QK-Rahul commented May 9, 2017

The below change worked for me. Add the below code to def _calc_final_dist(self, vocab_dists, attn_dists). Info is in comments.

      # OOV part of vocab is max_art_oov long. Not all the sequences in a batch will have max_art_oov tokens.
      # That will cause some entries to be 0 in the distribution, which will result in NaN when calulating log_dists
      # Add a very small number to prevent that.

      def add_epsilon(dist, epsilon=sys.float_info.epsilon):
        epsilon_mask = tf.ones_like(dist) * epsilon
        return dist + epsilon_mask

      final_dists = [add_epsilon(dist) for dist in final_dists]
      
      return final_dists

@lizaigaoge550
Copy link

final_dists = [tf.clip_by_value(dist,1e-10,1.) for dist in final_dists]

@jamesposhtiger
Copy link

@lizaigaoge550 did this work for you?
final_dists = [tf.clip_by_value(dist,1e-10,1.) for dist in final_dists]
Could you let me know which line you put this at?
Many thanks

@lizaigaoge550
Copy link

@jamesposhtiger
after finishing the final_dists

pedrobalage added a commit to pedrobalage/pointer-generator that referenced this issue May 30, 2017
pedrobalage added a commit to pedrobalage/pointer-generator that referenced this issue May 31, 2017
@apoorv001
Copy link

can we restore the already trained model after it starts getting NaN ?

@abisee
Copy link
Owner

abisee commented Jun 6, 2017

@apoorv001 probably not. This is where the concurrent eval job is useful: it saves the 3 best checkpoints (according to dev set) at any time. So in theory it should never save a NaN model. This is what we used to recover from NaN problems.

In any case, I know the NaN thing is very annoying. I haven't had time recently, but I intend to look at the bug, understand what's going wrong, and fix it. In any case @Rahul-Iisc's solution appears to be working for several people currently.

@abisee abisee added the bug label Jun 6, 2017
@apoorv001
Copy link

apoorv001 commented Jun 14, 2017

Thanks @abisee for the clarification, however, I have 2 differents runs failed due to NaN after training for days, it would be a great favor to us if you could also upload the trained model along with code.

@abisee
Copy link
Owner

abisee commented Jun 28, 2017

@Rahul-Iisc I've had another look at the code. I see your point about

OOV part of vocab is max_art_oov long. Not all the sequences in a batch will have max_art_oov tokens. That will cause some entries to be 0 in the distribution, which will result in NaN when calulating log_dists. Add a very small number to prevent that.

However, in theory those zero entries in final_dists i.e. NaN entries in log_dists should never be used because the losses = tf.gather_nd(-log_dist, indices) line, which is supposed to locate -log P(correct word) (equation 6 here) in log_dists, should only pick out source-text-OOVs that are actually in the training example.

So I think there must be something else wrong, either:

  1. Those NaN entries are getting picked out by the tf.gather_nd line (even though they shouldn't), or
  2. Some of the other entries of final_dists are zero (which they shouldn't be, because both vocab_dists and attn_dists are result of softmax functions and they get combined using p_gen which is the result of a sigmoid function). Perhaps due to an underflow problem.

I think the second one seems more likely. I can try to investigate the problem but it's tricky because sometimes you need to run for hours before you can replicate the error.

@tianjianjiang
Copy link

@abisee If fast replicating is desired, I recommend train with extremely short sequence pair, such as 10-2, and NaN should occur when training loss reach 3.
I find it concerning that epsilon-added version seems easily got stuck with a training loss around 3, too.

@bwang482
Copy link

bwang482 commented Jul 18, 2017

This is where the concurrent eval job is useful: it saves the 3 best checkpoints (according to dev set) at any time. So in theory it should never save a NaN model. This is what we used to recover from NaN problems.

@abisee Thanks. But can I use the saved checkpoints from eval for continuing my training after NaN occurred? I have removed everything in log_root/train and copied all necessary files from log_root/eval to log_root/train, and adjusted the filenames and what is in the checkpoint file. Now I have an error showing:

NotFoundError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to find any matching files for log_root/cnndm/train/model.ckpt-78447

falcondai added a commit to falcondai/pointer-generator that referenced this issue Jul 28, 2017
@abisee
Copy link
Owner

abisee commented Aug 5, 2017

I've looked further into this and still don't understand where the NaNs are coming from. I changed the code to detect when a NaN occurs, then dump the attention distribution, vocabulary distribution, final distribution and some other stuff to file.

Looking at the dump file, I find that attn_dists and vocab_dists are both all NaN, on every decoder step, for every example in the batch and across the encoder timesteps (for attn_dists) and across the vocabulary (for vocab_dists). Consequently final_dists contains NaNs, therefore log_dists does and the final loss is NaN too.

This is different than what I was expecting. I was expecting to find zero values in final_dists and therefore NaNs in log_dists, but it seems that the problem occurs earlier, somehow causing NaNs everywhere in attn_dists and vocab_dists.

Given this information, I don't see why adding epsilon to final_dists works as a solution, because if attn_dists and vocab_dists contain NaNs, it shouldn't work.

@bwang482
Copy link

bwang482 commented Aug 7, 2017

Thanks for your update! The strange thing is, it does work @abisee. By adding epsilon I have not encountered NaNs again. But it does affect the training convergence a bit. By how much, I dunno.

@shahbazsyed
Copy link

@bluemonk482 I tried adding epsilon as mentioned in the previous discussions above. I still get the NaN loss after one day of training. Can you tell me what was the learning rate you used for your experiments? I tried using 0.1.

@bwang482
Copy link

@shahbazsyed Well I used as big a learning rate as 0.15, and had no NaN error after adding epsilon. How did you add the epsilon? Like this (as @Rahul-Iisc has suggested):

      def add_epsilon(dist, epsilon=sys.float_info.epsilon):
        epsilon_mask = tf.ones_like(dist) * epsilon
        return dist + epsilon_mask

      final_dists = [add_epsilon(dist) for dist in final_dists]

@shahbazsyed
Copy link

shahbazsyed commented Aug 12, 2017

@bluemonk482 Yes, I added epsilon just as @Rahul-Iisc suggested. I tried with a higher learning rate of 0.2 , but am still getting the same error. Did you change any other parameters ? Mine are the following :

hidden_dim = 300
emb_dim = 256
coverage = true
lr = 0.2

@bwang482
Copy link

@shahbazsyed I see you have changed the default parameter setting. I used 128 for emb_dim. And why did you increase your learning rate to 0.2 when you had NaN error at 0.1? You can try a smaller learning rate.

@shahbazsyed
Copy link

@bluemonk482 Still got the NaN with lower learning rate after 42000 steps. I tried decoding with the model trained so far, it was just a bunch of UNKs.

@QK-Rahul
Copy link

Use tfdbg.

@abisee
Copy link
Owner

abisee commented Aug 16, 2017

Hello everyone, and thanks for your patience. We've made a few changes that help with the NaN issue.

  • We changed the way the log of the final distribution is calculated. This seemed to result in many fewer NaNs (but we still encountered some).
  • We provide a script to let you directly inspect the checkpoint file, to see whether it's corrupted by NaNs or not.
  • New flag to allow you to restore a best model from the eval directory.
  • Train job now halts when it encounters non-finite loss.
  • The train job now keeps 3 checkpoints at a time -- useful for restoring after NaN.
  • New flag to run Tensorflow Debugger.

The README also contains a section about NaNs.

Edit: We've also provided a pretrained model, which we trained using this new version of the code. See the README.

@eduOS
Copy link

eduOS commented Jan 11, 2018

I encountered the same problem and found that it is because of the zeros examples in the batch. If the training corpus contains $k * batch_size - a$ where $a<k$ examples/samples there would be $k - a$ zero examples. (single pass)

I just add a line next to line 329 in batcher:

if len(b) != self._hps.batch_size:
    continue

May this help a little.

@mahnazkoupaee
Copy link

There might be another cause for NANs. There are some stories in the dataset which only contain highlights and not the article itself. That causes a sequence of [PAD] tokens with attention distribution of all zeros which causes the probability of [PAD] to be zero in final distribution (because of scatter function). And since the only target for that sequence is [PAD] with zero probability it will generate NAN. I removed those articles and it seems to be working.

@ygorg
Copy link

ygorg commented Feb 6, 2019

I also encounter NaNs when pgen is 1 or 0. This means that either vocab_dists or attn_dists will be 0-filled and final_dists will have 0s in it (for words only in vocabulary or only in input). This will cause the loss to be inf (and the backprop will put NaNs in layers) if a word in reference has a 0 probability.
I will add espilon if pgen is 0 and remove epsilon if pgen is 1.
But as @tianjianjiang, i am concerned that this will bias the model.

@xiongma
Copy link

xiongma commented May 8, 2019

when I add this model to transformer, I get nan, when I train model, I try many ways to solve this problem, but it still nan, if anyone interested in it, we can discuss.

@GaneshDoosa
Copy link

when I add this model to transformer, I get nan, when I train model, I try many ways to solve this problem, but it still nan, if anyone interested in it, we can discuss.

Please I'm trying your code of pointer generator built on LCSTS dataset. I changed the dataset with amazon reviews for my academic work purpose; I'm getting all UNKs in output..Please help if you have solved it

@tohidarehman1988
Copy link

Anyone got NAN ?
selection_059

can you please help me how you resolved this? how you print this loss graph?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests