-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entire lines of text missing. Different missing when psm = 3, 6, 11 #1339
Comments
Supplement: On the first image, the 2nd line remains ignored even if I masked the 1st or the 3rd line. (not cropped nor resized) |
གཉིས་པ་རྒྱུ་མཚན་ཉིད་ཀྱི་ཐེག་པར་ལམ་གྱི་གཞི་མ་སོ་ཐར་སྐོར་ལ། སོ་ཐར་སྡོམ་བརྒྱུད་གསོལ་འདེབས་ཀུན་མཁྱེན་ལྔ་པ་ཆེན་པོའི་གསུང་ལྡེབ།༣ དགེ་བསྟེན་གྱི་སྡོམ་པའི་རྣམ་ |
I think the page segmentation is not working because the text lines are too close and the diacritics are merging with previous/next line. There is a config variable which can be tried for this -
It works well with your cropped first image - I think the slight white border around the image helps too. I used the following command:
Here is the output:
If most text is like this, it should be added to bod.config file, otherwise just use config variable as part of command. |
Hi, @Shreeshrii Thanks for your kind reply! I tried your solution. Actually, the cropped picture worked even without this configuration. |
For this picture, the Again, thanks a lot! |
Sorry. Don't know how the page layout analysis works. |
@zdenop Label with 4.0x |
I don't know if this is the right place, but I get missing words, and even several words at once, in the German text, processing with tessdata_best. |
@yurytch Yes, please provide the image so that we can test with the latest version. |
Fine, only I can't find where do I attach the files here. |
Please try with the latest commit. |
@yurytch The image is 6MB+ jp2 file, yet the clarity in image is not there. I converted to png for testing, since I havent built leptonica with jp2 support. @amitdo Tried with latest commit from yesterday. OCRed files attached. ZeitschriftFuerHistorischeWaffenkunde5_0122-tessdata_best-deu-1.txt ZeitschriftFuerHistorischeWaffenkunde5_0122-tessdata_fast-deu-1.txt |
@Shreeshrii Yes, thank you very much. Have completed the test run with the today's git C/O right now (with the JP2-enabled leptonica, as before). Those drop-outs are gone now. |
Possible, because I used a different version of image. Though there are too many differences. I will install jp2 library and try again. |
@yurytch I am attaching the png version that I used. |
Not successful in building leptonica with jp2. So trying to install from ppa ... @AlexanderP Your ppa has both openjpeg2 and leptonlib. I installed liblept and libleptonica-dev from there as well as libopenjp2-7 libopenjp2-7-dev. But tesseract is not showing jp2 support. What else do I need to do for it?
|
@stweil Please see Is it possible to get different results from same traineddata and image? |
I would not say no as I can imagine reasons why the same Tesseract version with same traineddata could give different results for the same image. If we can confirm such differences, that is clearly something which needs to get fixed. Results must be reproducible. |
If you have leptonica with jp2 support please try with the image linked in #1339 (comment) And compare your result to https://yadi.sk/i/bAykSKIW3TsjBS I had converted the image to png, so it is not the exact same image, those results with deu from best and fast, as well as the image are also there in this thread. |
png is lossless, so it should be the same image and make no difference in the OCR result. I'll try the example myself later. |
@Shreeshrii |
Hey guys, the tesseract versions on MY side WERE different. I was following the initial advice by @Shreeshrii. |
Oh, I see. I've posted only the results with today's git and JP2. |
@yurytch please confirm which tessdata did you use? Tessdata_fast? Also was it with default psm? |
Alex,
Please see
https://github.com/DanBloomberg/leptonica/blob/master/src/environ.h
/*-------------------------------------------------------------------------*
* Leptonica supports OpenJPEG 2.0+. If you have a version of openjpeg *
* (HAVE_LIBJP2K == 1) that is >= 2.0, set the path to the openjpeg.h *
* header in angle brackets here. *
*-------------------------------------------------------------------------*/
#define LIBJP2K_HEADER <openjpeg-2.3/openjpeg.h>
Though, setting have_libjp2k =1 in environmental.h did not work for me when
I tried to build with it.
…On Thu 29 Mar, 2018, 8:40 PM Alexander Pozdnyakov, ***@***.***> wrote:
@Shreeshrii <https://github.com/Shreeshrii>
As I understand.
Need openjpeg version 2.3 and higher
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1339 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o-mynoctMSe2HRZ-uT4C5yp7_YZmks5tjPlcgaJpZM4SNpit>
.
|
Sorry, I see in the change log now
Modified jpeg2000 header to use openjpeg 2.3.
…On Thu 29 Mar, 2018, 9:13 PM ShreeDevi Kumar, ***@***.***> wrote:
Alex,
Please see
https://github.com/DanBloomberg/leptonica/blob/master/src/environ.h
/*-------------------------------------------------------------------------*
* Leptonica supports OpenJPEG 2.0+. If you have a version of openjpeg *
* (HAVE_LIBJP2K == 1) that is >= 2.0, set the path to the openjpeg.h *
* header in angle brackets here. *
*-------------------------------------------------------------------------*/
#define LIBJP2K_HEADER <openjpeg-2.3/openjpeg.h>
Though, setting have_libjp2k =1 in environmental.h did not work for me
when I tried to build with it.
On Thu 29 Mar, 2018, 8:40 PM Alexander Pozdnyakov, <
***@***.***> wrote:
> @Shreeshrii <https://github.com/Shreeshrii>
> As I understand.
> Need openjpeg version 2.3 and higher
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1339 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AE2_o-mynoctMSe2HRZ-uT4C5yp7_YZmks5tjPlcgaJpZM4SNpit>
> .
>
|
@Shreeshrii yes, right, the default PSM and tessdata_best. I get poor-ish results with tessdata_fast, so don't even keep it on disk. Linux 64 bit, FWIW. |
@Shreeshrii I compiled the leptonica by means of cmake. |
@AlexanderP Thank you for following up. I went back to autotools because the cmake version was too slow on my pc (I run WSL on windows 10). I built openjpeg from source and leptonica build was able to find it.
I am finding quite a bit of difference between the recognized text on my PC vs the ones by @yurytch using the same traineddata and same images with same tesseract code. However, the hardware and o/s and leptonica version maybe different. Locale may also be different. I am hoping that @stweil will be able to investigate and figure it out. |
whether there is a sense to add to ppa - openjpeg-2.3? |
Yes, I think it will be helpful to add openjpeg-2.3 to PPA. Thanks! |
Opened a new issue for the recent discussion on this thread. Original issue of complete lines being dropped during recognition still exists. |
Sorry that it took some time. Now I have done some tests with that images. Both jp2 and png images don't include resolution information. That explains why earlier versions I get the same results from the original jp2 image and from a png image made from the jp2 by using convert. That confirms my earlier statement that the image format should not make a difference when both formats are lossless. @Shreeshrii, your png image differs from mine:
I get the same results as you with your png image, but those results differ from the jp2 / grayscale png image results. So one interesting result is that obviously Tesseract gets different results from grayscale and color images, even when both look exactly the same. This needs more investigations. @yurytch, I cannot reproduce your results. Could you try the latest Debian packages for tesseract-ocr? In my tests, those Debian packages and latest Git master show the same results. |
@stweil Could different results be because of different versions of leptonica library? #1339 (comment) |
I used 1.75.3-4 from Debian testing, but can repeat the test with an older version. |
Answering several days' worth of msgs at once, the results of OCR'ing the initial JP2: https://yadi.sk/d/G2scDhj53TsU52 with fresh versions: Leptonica 1.75.3, Tesseract Open Source OCR Engine v4.0.0-beta.1-203-g45bb are here: There are no obvious dropouts here, however dropouts (1-3 letters at a time) still happen, with that version, too. Is it possible to make tesseract output some kind of placeholder or tag into OCR'ed text? Same for the bogus letters introduced into the text, like in the example, 'güuülden' or 'fruünflich'. Could tesseract be made to output anything not recognised reliably enough, as some kind of 'empty glyph' or tag? Format is irrelevant for the result, and DPI setting seems to be ignored by tesseract. I've put the density field into TIFF file by ImageMagick's 'convert', and it is there, verified by 'identify' tool, but tesseract still goes 'estimating the resolution'. |
See report regarding missing line in Persian |
This is very annoying... Irregularly for same image document format, few lines are missing (sometimes). |
@bhaveshvyas007: no we can not - we do not have crystal balls to determine your image, tesseract version, command you run, OS you use and other must information. |
@zdenop @amitdo Ubuntu version : Ubuntu 18.04.3 LTS tesseract 4.1.0 Command I run : Btw, I even tried -l eng, and psm 3 or 4 but same line is missing always. Btw that missing line is just a address line with city, state and zip : |
Samples : Command : Tesseract Version Details : OS Version : Ubuntu 18.04.3 LTS Issue: sometimes the line which says Account #: is also missing. Btw these issues are happening only with few samples, it works fine for most of them. What I tried Question : |
Hey , Guys You can see the image attached below for reference. Let's say if i am using psm 6 the first line date prepared is missing and you will see policy numbers are not correct. Can you guys tell me how can i solve this issue. |
Most likely this is caused by Tesseract's otsu thresholding. |
From the command line help:
It does not make sense to use psm 6 on your image, which has multiple blocks. You misleading Tesseract. Did you try to use psm 3? |
Hey @amitdo psm 3 works great. I didn't remember why i have changed default psm. |
https://user-images.githubusercontent.com/29722986/116717573-a6c66180-a9f6-11eb-85af-1d364de7e3ee.png |
You are right in your assumption. Tesseract's layout analysis can't cope with connected lines. |
Thanks for confirming.can you tell me there is any way we can handle this. |
If you can find another tool that will correctly segment the lines, you can then run Tesseract on each line. |
@amitdo hey can't we train tesseract to identify the lines bounding box. As we are giving line-level bounding box information at the time of training tesseract on own images. |
No, the layout analysis part is not trainable. |
Environment
Current Behavior:
Brief description:
6.1. psm 3 and psm 6 skip different parts of text based on font size #538 psm 3 and psm 6 skip different parts of text based on font size
6.2. LSTM: Words dropped during recognition #681 LSTM: Words dropped during recognition (tried the solution, does not fix this problem)
6.3. Page Layout Issues #1319 Page Layout Issues
Test image:
https://user-images.githubusercontent.com/15245190/36480676-2820ca12-1748-11e8-9964-7c45a86426a5.png
Recognized with tessdata_best/bod.traineddata.
First 3 lines:
PSM==6
01 ༄༅། །ཕམ་གྱི་གསུང་ལྡེབ། ༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ། ༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ། ༢ ༈
02 (2nd line missing)
03 (3rd line missing)
PSM==11
All lines are complete but some are shattered and more inaccurate.
PSM==3
01 ༄༅། །ཕམ་གྱི་གསུང་ལྡེབ།༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ལམ་གྱི་གསུང་ལྡེབ།༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ རྣ
02 (2nd line missing)
03 པར་བཞག་པ་ཚིགས་སུ་བཅད་པ་སློབ་དཔོན་པུངྱེ་ཤྲཱིའི་གསུང་ལྡེབ།༡ སློབ་དཔོན་ཆེན་པོ་རྡོ་རྗེ་གདན་པ་ཇོ་བོ་པུའངྱེ་ཤྲིས་མཛད་པའི་དགེ་བསྟེན་སྡོམ་པའི་རྣམ་པར་བཞག་པ་ཚིགས་སུ་
PSM==3, same image but slightly rotated and cropped
https://user-images.githubusercontent.com/15245190/36482692-13cdb550-174f-11e8-9378-b8617342594c.png
01 ༄༅།། །ཕམ་གྱི་གསུང་ལྡེབ།༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ རྣ
02 གཉིས་པ་རྒྱུ་མཚན་ཉིད་ཀྱི་ཐེག་པར་ལམ་གྱི་གཞི་མ་སོ་ཐར་སྐོར་ལ། སོ་ཐར་སྡོམ་བརྒྱུད་གསོལ་འདེབས་ཀུན་མཁྱེན་ལྔ་པ་ཆེན་པོའི་གསུང་ལྡེབ།༣ དགེ་བསྟེན་གྱི་སྡོམ་པའི་རྣམ་
03 པར་བཞག་པ་ཚིགས་སུ་བཅད་པ་སློབ་དཔོན་པུྱེ་ཤྲིའི་གསུང་ལྡེབ།༡
Another test image with its fourth line missing:
https://user-images.githubusercontent.com/15245190/36481051-87d49898-1749-11e8-9fb0-cfa4334d2445.png
Do you have any idea? or any suggestion what I should do? Thanks a lot! @Shreeshrii @amitdo
The text was updated successfully, but these errors were encountered: