Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan during training. #11

Open
logodeeplearning opened this issue Dec 30, 2018 · 6 comments
Open

nan during training. #11

logodeeplearning opened this issue Dec 30, 2018 · 6 comments

Comments

@logodeeplearning
Copy link

Hi @songdejia, thanks for trying to port EAST from tensorflow. But while trying to train this model on COCO 2014 or Oxford syn text, I get nan during training. Any ideas?

Please see below training Log:

Cross point does not exist
point dist to line raise Exception
point dist to line raise Exception
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
point dist to line raise Exception
point dist to line raise Exception
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
point dist to line raise Exception
point dist to line raise Exception
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
point dist to line raise Exception
point dist to line raise Exception
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Cross point does not exist
Exception continue
Exception in getitem, and choose another index:4393
EAST <==> TRAIN <==> Epoch: [0][1/227] Loss 0.0231 Avg Loss 0.0250)

EAST <==> TRAIN <==> Epoch: [0][2/227] Loss 0.0282 Avg Loss 0.0260)

EAST <==> TRAIN <==> Epoch: [0][3/227] Loss 0.0313 Avg Loss 0.0273)

EAST <==> TRAIN <==> Epoch: [0][4/227] Loss 0.0271 Avg Loss 0.0273)

EAST <==> TRAIN <==> Epoch: [0][5/227] Loss 0.0206 Avg Loss 0.0262)

EAST <==> TRAIN <==> Epoch: [0][6/227] Loss 0.0300 Avg Loss 0.0267)

EAST <==> TRAIN <==> Epoch: [0][7/227] Loss 0.0239 Avg Loss 0.0264)

EAST <==> TRAIN <==> Epoch: [0][8/227] Loss 0.0271 Avg Loss 0.0265)

EAST <==> TRAIN <==> Epoch: [0][9/227] Loss 0.0284 Avg Loss 0.0266)

EAST <==> TRAIN <==> Epoch: [0][10/227] Loss 0.0197 Avg Loss 0.0260)

EAST <==> TRAIN <==> Epoch: [0][11/227] Loss nan Avg Loss nan)

EAST <==> TRAIN <==> Epoch: [0][12/227] Loss nan Avg Loss nan)

@logodeeplearning
Copy link
Author

@Caius-Lu @songdejia has it occurred to you too? I am trying to debug. suggestions welcome.

@viig99
Copy link

viig99 commented Jan 12, 2019

Getting the same issue

@BYJRK
Copy link

BYJRK commented Mar 31, 2019

I guess due to some sort of issues caused by data augmentation, some data became unpredictably wrong, and causes the loss of this batch become nan. Seeking which specific training images may be the reason can be tedious, so Mm solution is to check if the loss is nan before back propagation, and if so, skip this batch without any updates.

Specifically, I modified the code in main.py as:

loss_check = loss1.cpu().detach().numpy()
if np.any(np.isnan(loss_check)):
    print('loss = nan, skip this batch')
    optimizer.zero_grad()
    continue

@saharudra
Copy link

@BYJRK What were your results on the ICDAR dataset.

@BYJRK
Copy link

BYJRK commented Apr 9, 2019

@saharudra I can at most achieve 0.7 hmean after modifying the thresholds in eval.py on ICDAR 2015 after like 400 epochs. TBH, I don't think this will reproduce the performance mentioned in the paper. Anyway, still trying to figure out the difference from the tensorflow version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@viig99 @BYJRK @saharudra @logodeeplearning and others