Get NAN loss after 35k steps #4

StevenLOL · 2017-05-05T01:41:58Z

Anyone got NAN ?

abisee · 2017-05-05T01:47:30Z

@StevenLOL I see this happen sometimes too -- seems to be a very common problem with Tensorflow training in general.

hate5six · 2017-05-07T15:21:28Z

I've been having this problem and I decreased the learning rate as per various discussions on SO and that seemed to work. After a while I tried increasing it by 0.01 and started getting NaN's again. I've tried restoring the checkpoint and re-running with the lower learning rate, but I'm still seeing NaN. Does this mean my checkpoint is useless?

QK-Rahul · 2017-05-08T03:53:45Z

I am also getting NaN. Found out the culprit to be
line #227: log_dists = [tf.log(dist) for dist in final_dists]
in model.py

hate5six · 2017-05-08T03:56:19Z

@Rahul-Iisc Is your workaround to filter out cases where dist == 0, for example:

log_dists = [tf.log(dist) for dist in final_dists if dist != 0] ?

QK-Rahul · 2017-05-08T04:04:21Z

@hate5six I'm still thinking about an appropriate solution. Each dist is a tensor shape (batch_size, extended_vsize), so I am not sure if dist != 0 will work. Also, I want the log_dists length to be the same as final_dists

QK-Rahul · 2017-05-08T05:23:43Z

Trying to convert NaNs to 0 for now. Need to further look for the cause of 0s coming up in the distribution. @abisee @StevenLOL reopen the issue?

def _change_nan_to_number(self, tensor, number=1):
    return tf.where(tf.logical_not(tf.is_finite(tensor)), tf.ones_like(tensor) * number, tensor)

log_dists = [tf.log(self._change_nan_to_number(dist)) for dist in final_dists]

UPDATE: This didn't work. Loss ended up 0 instead of NaN.

QK-Rahul · 2017-05-09T01:15:46Z

The below change worked for me. Add the below code to def _calc_final_dist(self, vocab_dists, attn_dists). Info is in comments.

      # OOV part of vocab is max_art_oov long. Not all the sequences in a batch will have max_art_oov tokens.
      # That will cause some entries to be 0 in the distribution, which will result in NaN when calulating log_dists
      # Add a very small number to prevent that.

      def add_epsilon(dist, epsilon=sys.float_info.epsilon):
        epsilon_mask = tf.ones_like(dist) * epsilon
        return dist + epsilon_mask

      final_dists = [add_epsilon(dist) for dist in final_dists]
      
      return final_dists

lizaigaoge550 · 2017-05-09T03:18:59Z

final_dists = [tf.clip_by_value(dist,1e-10,1.) for dist in final_dists]

jamesposhtiger · 2017-05-25T06:15:57Z

@lizaigaoge550 did this work for you?
final_dists = [tf.clip_by_value(dist,1e-10,1.) for dist in final_dists]
Could you let me know which line you put this at?
Many thanks

lizaigaoge550 · 2017-05-27T08:12:51Z

@jamesposhtiger
after finishing the final_dists

apoorv001 · 2017-06-01T05:12:51Z

can we restore the already trained model after it starts getting NaN ?

abisee · 2017-06-06T16:50:59Z

@apoorv001 probably not. This is where the concurrent eval job is useful: it saves the 3 best checkpoints (according to dev set) at any time. So in theory it should never save a NaN model. This is what we used to recover from NaN problems.

In any case, I know the NaN thing is very annoying. I haven't had time recently, but I intend to look at the bug, understand what's going wrong, and fix it. In any case @Rahul-Iisc's solution appears to be working for several people currently.

apoorv001 · 2017-06-14T05:17:32Z

Thanks @abisee for the clarification, however, I have 2 differents runs failed due to NaN after training for days, it would be a great favor to us if you could also upload the trained model along with code.

abisee · 2017-06-28T22:01:24Z

@Rahul-Iisc I've had another look at the code. I see your point about

OOV part of vocab is max_art_oov long. Not all the sequences in a batch will have max_art_oov tokens. That will cause some entries to be 0 in the distribution, which will result in NaN when calulating log_dists. Add a very small number to prevent that.

However, in theory those zero entries in final_dists i.e. NaN entries in log_dists should never be used because the losses = tf.gather_nd(-log_dist, indices) line, which is supposed to locate -log P(correct word) (equation 6 here) in log_dists, should only pick out source-text-OOVs that are actually in the training example.

So I think there must be something else wrong, either:

Those NaN entries are getting picked out by the tf.gather_nd line (even though they shouldn't), or
Some of the other entries of final_dists are zero (which they shouldn't be, because both vocab_dists and attn_dists are result of softmax functions and they get combined using p_gen which is the result of a sigmoid function). Perhaps due to an underflow problem.

I think the second one seems more likely. I can try to investigate the problem but it's tricky because sometimes you need to run for hours before you can replicate the error.

tianjianjiang · 2017-06-29T00:04:52Z

@abisee If fast replicating is desired, I recommend train with extremely short sequence pair, such as 10-2, and NaN should occur when training loss reach 3.
I find it concerning that epsilon-added version seems easily got stuck with a training loss around 3, too.

bwang482 · 2017-07-18T18:25:19Z

This is where the concurrent eval job is useful: it saves the 3 best checkpoints (according to dev set) at any time. So in theory it should never save a NaN model. This is what we used to recover from NaN problems.

@abisee Thanks. But can I use the saved checkpoints from eval for continuing my training after NaN occurred? I have removed everything in log_root/train and copied all necessary files from log_root/eval to log_root/train, and adjusted the filenames and what is in the checkpoint file. Now I have an error showing:

NotFoundError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to find any matching files for log_root/cnndm/train/model.ckpt-78447

abisee · 2017-08-05T08:01:06Z

I've looked further into this and still don't understand where the NaNs are coming from. I changed the code to detect when a NaN occurs, then dump the attention distribution, vocabulary distribution, final distribution and some other stuff to file.

Looking at the dump file, I find that attn_dists and vocab_dists are both all NaN, on every decoder step, for every example in the batch and across the encoder timesteps (for attn_dists) and across the vocabulary (for vocab_dists). Consequently final_dists contains NaNs, therefore log_dists does and the final loss is NaN too.

This is different than what I was expecting. I was expecting to find zero values in final_dists and therefore NaNs in log_dists, but it seems that the problem occurs earlier, somehow causing NaNs everywhere in attn_dists and vocab_dists.

Given this information, I don't see why adding epsilon to final_dists works as a solution, because if attn_dists and vocab_dists contain NaNs, it shouldn't work.

bwang482 · 2017-08-07T20:31:19Z

Thanks for your update! The strange thing is, it does work @abisee. By adding epsilon I have not encountered NaNs again. But it does affect the training convergence a bit. By how much, I dunno.

shahbazsyed · 2017-08-10T12:14:56Z

@bluemonk482 I tried adding epsilon as mentioned in the previous discussions above. I still get the NaN loss after one day of training. Can you tell me what was the learning rate you used for your experiments? I tried using 0.1.

bwang482 · 2017-08-10T16:51:31Z

@shahbazsyed Well I used as big a learning rate as 0.15, and had no NaN error after adding epsilon. How did you add the epsilon? Like this (as @Rahul-Iisc has suggested):

      def add_epsilon(dist, epsilon=sys.float_info.epsilon):
        epsilon_mask = tf.ones_like(dist) * epsilon
        return dist + epsilon_mask

      final_dists = [add_epsilon(dist) for dist in final_dists]

shahbazsyed · 2017-08-12T09:13:53Z

@bluemonk482 Yes, I added epsilon just as @Rahul-Iisc suggested. I tried with a higher learning rate of 0.2 , but am still getting the same error. Did you change any other parameters ? Mine are the following :

hidden_dim = 300
emb_dim = 256
coverage = true
lr = 0.2

bwang482 · 2017-08-12T13:35:04Z

@shahbazsyed I see you have changed the default parameter setting. I used 128 for emb_dim. And why did you increase your learning rate to 0.2 when you had NaN error at 0.1? You can try a smaller learning rate.

shahbazsyed · 2017-08-14T11:08:08Z

@bluemonk482 Still got the NaN with lower learning rate after 42000 steps. I tried decoding with the model trained so far, it was just a bunch of UNKs.

QK-Rahul · 2017-08-14T13:14:45Z

Use tfdbg.

abisee · 2017-08-16T22:04:11Z

Hello everyone, and thanks for your patience. We've made a few changes that help with the NaN issue.

We changed the way the log of the final distribution is calculated. This seemed to result in many fewer NaNs (but we still encountered some).
We provide a script to let you directly inspect the checkpoint file, to see whether it's corrupted by NaNs or not.
New flag to allow you to restore a best model from the eval directory.
Train job now halts when it encounters non-finite loss.
The train job now keeps 3 checkpoints at a time -- useful for restoring after NaN.
New flag to run Tensorflow Debugger.

The README also contains a section about NaNs.

Edit: We've also provided a pretrained model, which we trained using this new version of the code. See the README.

eduOS · 2018-01-11T11:22:05Z

I encountered the same problem and found that it is because of the zeros examples in the batch. If the training corpus contains $k * batch_size - a$ where $a<k$ examples/samples there would be $k - a$ zero examples. (single pass)

I just add a line next to line 329 in batcher:

if len(b) != self._hps.batch_size:
    continue

May this help a little.

mahnazkoupaee · 2018-02-01T08:06:13Z

There might be another cause for NANs. There are some stories in the dataset which only contain highlights and not the article itself. That causes a sequence of [PAD] tokens with attention distribution of all zeros which causes the probability of [PAD] to be zero in final distribution (because of scatter function). And since the only target for that sequence is [PAD] with zero probability it will generate NAN. I removed those articles and it seems to be working.

ygorg · 2019-02-06T08:56:41Z

I also encounter NaNs when pgen is 1 or 0. This means that either vocab_dists or attn_dists will be 0-filled and final_dists will have 0s in it (for words only in vocabulary or only in input). This will cause the loss to be inf (and the backprop will put NaNs in layers) if a word in reference has a 0 probability.
I will add espilon if pgen is 0 and remove epsilon if pgen is 1.
But as @tianjianjiang, i am concerned that this will bias the model.

xiongma · 2019-05-08T11:53:14Z

when I add this model to transformer, I get nan, when I train model, I try many ways to solve this problem, but it still nan, if anyone interested in it, we can discuss.

GaneshDoosa · 2020-03-29T10:29:56Z

when I add this model to transformer, I get nan, when I train model, I try many ways to solve this problem, but it still nan, if anyone interested in it, we can discuss.

Please I'm trying your code of pointer generator built on LCSTS dataset. I changed the dataset with amazon reviews for my academic work purpose; I'm getting all UNKs in output..Please help if you have solved it

tohidarehman1988 · 2020-12-12T15:49:57Z

Anyone got NAN ?

can you please help me how you resolved this? how you print this loss graph?

StevenLOL closed this as completed May 5, 2017

abisee reopened this May 8, 2017

abisee mentioned this issue May 8, 2017

Nan #5

Closed

pedrobalage added a commit to pedrobalage/pointer-generator that referenced this issue May 30, 2017

Fixes issue abisee#4 for NaN loss

b3cae0e

pedrobalage added a commit to pedrobalage/pointer-generator that referenced this issue May 31, 2017

Fixes issue abisee#4 for NaN loss

434d696

hate5six mentioned this issue Jun 6, 2017

Failure to replicate results #16

Closed

abisee added the bug label Jun 6, 2017

falcondai added a commit to falcondai/pointer-generator that referenced this issue Jul 28, 2017

fix abisee#4 nan during training

65e2f9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get NAN loss after 35k steps #4

Get NAN loss after 35k steps #4

StevenLOL commented May 5, 2017

abisee commented May 5, 2017

hate5six commented May 7, 2017

QK-Rahul commented May 8, 2017

hate5six commented May 8, 2017

QK-Rahul commented May 8, 2017 •

edited

Loading

QK-Rahul commented May 8, 2017 •

edited

Loading

QK-Rahul commented May 9, 2017 •

edited

Loading

lizaigaoge550 commented May 9, 2017

jamesposhtiger commented May 25, 2017

lizaigaoge550 commented May 27, 2017

apoorv001 commented Jun 1, 2017

abisee commented Jun 6, 2017

apoorv001 commented Jun 14, 2017 •

edited

Loading

abisee commented Jun 28, 2017

tianjianjiang commented Jun 29, 2017

bwang482 commented Jul 18, 2017 •

edited

Loading

abisee commented Aug 5, 2017

bwang482 commented Aug 7, 2017 •

edited

Loading

shahbazsyed commented Aug 10, 2017

bwang482 commented Aug 10, 2017

shahbazsyed commented Aug 12, 2017 •

edited

Loading

bwang482 commented Aug 12, 2017

shahbazsyed commented Aug 14, 2017

QK-Rahul commented Aug 14, 2017

abisee commented Aug 16, 2017 •

edited

Loading

eduOS commented Jan 11, 2018 •

edited

Loading

mahnazkoupaee commented Feb 1, 2018

ygorg commented Feb 6, 2019

xiongma commented May 8, 2019

GaneshDoosa commented Mar 29, 2020

tohidarehman1988 commented Dec 12, 2020

Get NAN loss after 35k steps #4

Get NAN loss after 35k steps #4

Comments

StevenLOL commented May 5, 2017

abisee commented May 5, 2017

hate5six commented May 7, 2017

QK-Rahul commented May 8, 2017

hate5six commented May 8, 2017

QK-Rahul commented May 8, 2017 • edited Loading

QK-Rahul commented May 8, 2017 • edited Loading

QK-Rahul commented May 9, 2017 • edited Loading

lizaigaoge550 commented May 9, 2017

jamesposhtiger commented May 25, 2017

lizaigaoge550 commented May 27, 2017

apoorv001 commented Jun 1, 2017

abisee commented Jun 6, 2017

apoorv001 commented Jun 14, 2017 • edited Loading

abisee commented Jun 28, 2017

tianjianjiang commented Jun 29, 2017

bwang482 commented Jul 18, 2017 • edited Loading

abisee commented Aug 5, 2017

bwang482 commented Aug 7, 2017 • edited Loading

shahbazsyed commented Aug 10, 2017

bwang482 commented Aug 10, 2017

shahbazsyed commented Aug 12, 2017 • edited Loading

bwang482 commented Aug 12, 2017

shahbazsyed commented Aug 14, 2017

QK-Rahul commented Aug 14, 2017

abisee commented Aug 16, 2017 • edited Loading

eduOS commented Jan 11, 2018 • edited Loading

mahnazkoupaee commented Feb 1, 2018

ygorg commented Feb 6, 2019

xiongma commented May 8, 2019

GaneshDoosa commented Mar 29, 2020

tohidarehman1988 commented Dec 12, 2020

QK-Rahul commented May 8, 2017 •

edited

Loading

QK-Rahul commented May 8, 2017 •

edited

Loading

QK-Rahul commented May 9, 2017 •

edited

Loading

apoorv001 commented Jun 14, 2017 •

edited

Loading

bwang482 commented Jul 18, 2017 •

edited

Loading

bwang482 commented Aug 7, 2017 •

edited

Loading

shahbazsyed commented Aug 12, 2017 •

edited

Loading

abisee commented Aug 16, 2017 •

edited

Loading

eduOS commented Jan 11, 2018 •

edited

Loading