Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why the recognition accuracy different from paper? #4

Open
zobeirraisi opened this issue Mar 21, 2020 · 12 comments
Open

Why the recognition accuracy different from paper? #4

zobeirraisi opened this issue Mar 21, 2020 · 12 comments

Comments

@zobeirraisi
Copy link

zobeirraisi commented Mar 21, 2020

I applied the pre-trained model on ICDAR15 datasets, but the results are different from the reported ones in the paper?

@Jyouhou
Copy link

Jyouhou commented Mar 22, 2020

Hi @zobeirraisi

I am also interested in this work. It'd be greatly appreciated if you can post the results on datasets that you have tried.

@zobeirraisi
Copy link
Author

zobeirraisi commented Mar 22, 2020

Hi @zobeirraisi

I am also interested in this work. It'd be greatly appreciated if you can post the results on datasets that you have tried.


Hi @Jyouhou
This is my results for ICDAR15 dataset:
Link

@Jyouhou
Copy link

Jyouhou commented Mar 22, 2020

Thanks @zobeirraisi
So the actual accuracy is ~71%
We can wait for responses from the authors

@fengxinjie
Copy link
Owner

fengxinjie commented Mar 22, 2020

There are label nosie in IC15 test set, and I have relabeled.

@fengxinjie
Copy link
Owner

Hi @zobeirraisi
I am also interested in this work. It'd be greatly appreciated if you can post the results on datasets that you have tried.

Hi @Jyouhou
This is my results for ICDAR15 dataset:
Link

I checked my prediction result, I don't know why our result different. For example, word_26_00.png##Kappa##Kappa##
word_27_00.png##CAUTION##CAUTION##
word_50_00.png##l:HOU##:HOU##
... are all corrent in my prediction.

@fengxinjie
Copy link
Owner

Hi @zobeirraisi
I am also interested in this work. It'd be greatly appreciated if you can post the results on datasets that you have tried.

Hi @Jyouhou
This is my results for ICDAR15 dataset:
Link

I think you should crop the test image by coords.txt first, then predict.

@li10141110
Copy link

@Jyouhou @zobeirraisi Hi, can you tell us more about your pretrained model

@delveintodetail
Copy link

According to my guess, the performance of this implementation should be 85% on IIIT-5K.

@delveintodetail
Copy link

@delveintodetail have you trained.
the developer didnot reply clearly in the matter of training.
whether he crops the icdar words, or what....

It is not because of the data preprocessing, the evaluation of this code is wrong.

@li10141110
Copy link

@delveintodetail Is there wrong in the predict.py file?

@gussmith
Copy link

gussmith commented Apr 21, 2020

I have been training this model on the ICDAR 2015 Word Recognition dataset (IC15) with no relabeling of the mislabeled data using the code provided.

In order to recognized all the characters in the datasets, the vocab used was:
vocab = "<=,.+:;-!?$%#&*' ()@éÉ/\[]0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ>"+'"'+"´"+"΅"

If one keeps training and relies only on the loss on the test dataset, the model will overfit and I have obtained different models with 100% on the test dataset.
This means that even the mislabeled data is reproduced exactly just as the human labeled it with the errors.
(Note: the model is only trained on the training dataset! Never on the test dataset! Yet, the best models on the inference on the test datasets were saved as training progressed).

Typically, such models may have relatively poor performance on the training data itself:
On testing data:
Summary: # wrong: 0 # total: 2077 wrong 0.0%
On training data:
Summary: # wrong: 1959 # total: 4468 wrong 43.85%

Starting from scratch, training and saving only the models that improves both the inference performance on both the test data and the training data, then one can get results like this after 1533 epochs using batch_size = 64:
on test data:
Summary: #wrong: 11 #total: 2077 wrong 0.5%
on training data:
Summary: #wrong: 620 #total: 4468 wrong 13.9%

Inspection shows that some of these models give the same answer as the human on some of the mislabeled data, at least on the test dataset.

As training progresses and new models are saved, the inference performance particularly improves on the training dataset while more slowly improving on the testing dataset.

Thus this models seems an overkill on the ICDAR 2015 dataset and the mislabeling makes comparison difficult.


Update: The model continued training and these are the results:
loss for test during training: 0.006546
loss for training data during training: 0.027809

inference on test data:
Summary: #wrong: 0 #total: 2077 wrong 0.0%
inference on training data:
Summary: #wrong: 129 #total: 4468 wrong 2.887%

Other training and tests with synthetic images suggest that it does not generalize so well.

@gussmith
Copy link

The results above were obtained with the code provided as is.
Since then, I realized from my results and reading others that
nevertheless, there is apparently an error in the code, which essentially trains the network when the validation is run. It is part of the initial code provided in the Annotated Transformer that the authors refer to.
see issue testloss would lead to model update on eval mode #7
#7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants