nan during training. #11

logodeeplearning · 2018-12-30T01:22:08Z

Hi @songdejia, thanks for trying to port EAST from tensorflow. But while trying to train this model on COCO 2014 or Oxford syn text, I get nan during training. Any ideas?

Please see below training Log:

Cross point does not exist
point dist to line raise Exception
point dist to line raise Exception
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
point dist to line raise Exception
point dist to line raise Exception
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
point dist to line raise Exception
point dist to line raise Exception
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
point dist to line raise Exception
point dist to line raise Exception
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Exception continue
Exception in getitem, and choose another index:4393
EAST <==> TRAIN <==> Epoch: [0][1/227] Loss 0.0231 Avg Loss 0.0250)

EAST <==> TRAIN <==> Epoch: [0][2/227] Loss 0.0282 Avg Loss 0.0260)

EAST <==> TRAIN <==> Epoch: [0][3/227] Loss 0.0313 Avg Loss 0.0273)

EAST <==> TRAIN <==> Epoch: [0][4/227] Loss 0.0271 Avg Loss 0.0273)

EAST <==> TRAIN <==> Epoch: [0][5/227] Loss 0.0206 Avg Loss 0.0262)

EAST <==> TRAIN <==> Epoch: [0][6/227] Loss 0.0300 Avg Loss 0.0267)

EAST <==> TRAIN <==> Epoch: [0][7/227] Loss 0.0239 Avg Loss 0.0264)

EAST <==> TRAIN <==> Epoch: [0][8/227] Loss 0.0271 Avg Loss 0.0265)

EAST <==> TRAIN <==> Epoch: [0][9/227] Loss 0.0284 Avg Loss 0.0266)

EAST <==> TRAIN <==> Epoch: [0][10/227] Loss 0.0197 Avg Loss 0.0260)

EAST <==> TRAIN <==> Epoch: [0][11/227] Loss nan Avg Loss nan)

EAST <==> TRAIN <==> Epoch: [0][12/227] Loss nan Avg Loss nan)

logodeeplearning · 2018-12-30T16:54:48Z

@Caius-Lu @songdejia has it occurred to you too? I am trying to debug. suggestions welcome.

viig99 · 2019-01-12T04:51:42Z

Getting the same issue

BYJRK · 2019-03-31T12:25:58Z

I guess due to some sort of issues caused by data augmentation, some data became unpredictably wrong, and causes the loss of this batch become nan. Seeking which specific training images may be the reason can be tedious, so Mm solution is to check if the loss is nan before back propagation, and if so, skip this batch without any updates.

Specifically, I modified the code in main.py as:

loss_check = loss1.cpu().detach().numpy()
if np.any(np.isnan(loss_check)):
    print('loss = nan, skip this batch')
    optimizer.zero_grad()
    continue

saharudra · 2019-04-01T18:42:26Z

@BYJRK What were your results on the ICDAR dataset.

BYJRK · 2019-04-09T12:43:14Z

@saharudra I can at most achieve 0.7 hmean after modifying the thresholds in eval.py on ICDAR 2015 after like 400 epochs. TBH, I don't think this will reproduce the performance mentioned in the paper. Anyway, still trying to figure out the difference from the tensorflow version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nan during training. #11

nan during training. #11

logodeeplearning commented Dec 30, 2018

logodeeplearning commented Dec 30, 2018

viig99 commented Jan 12, 2019

BYJRK commented Mar 31, 2019

saharudra commented Apr 1, 2019

BYJRK commented Apr 9, 2019

nan during training. #11

nan during training. #11

Comments

logodeeplearning commented Dec 30, 2018

logodeeplearning commented Dec 30, 2018

viig99 commented Jan 12, 2019

BYJRK commented Mar 31, 2019

saharudra commented Apr 1, 2019

BYJRK commented Apr 9, 2019