Entire lines of text missing. Different missing when psm = 3, 6, 11 #1339

zc813 · 2018-02-21T13:39:45Z

Environment

Tesseract Version: 4.0.0 (master, not the latest)
Platform: Win 7 (x64)

Current Behavior:

Brief description:

One or more entire lines are missing when recognizing Tibetan.
Different lines are missing when psm = 3, 6, or 11.
If the image is slightly rotated or cropped, the missing line might come back.
When compiling from source after the latest commit Don't drop words with low certainty #1264 yesterday, missing lines remain the same, while recognized lines are more complete.
When using a specially trained model, the lines that are missing might differ.
Similar issue:
6.1. psm 3 and psm 6 skip different parts of text based on font size #538 psm 3 and psm 6 skip different parts of text based on font size
6.2. LSTM: Words dropped during recognition #681 LSTM: Words dropped during recognition (tried the solution, does not fix this problem)
6.3. Page Layout Issues #1319 Page Layout Issues

Test image:

https://user-images.githubusercontent.com/15245190/36480676-2820ca12-1748-11e8-9964-7c45a86426a5.png

Recognized with tessdata_best/bod.traineddata.
First 3 lines:

PSM==6
01 ༄༅། །ཕམ་གྱི་གསུང་ལྡེབ། ༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ། ༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ། ༢ ༈
02 (2nd line missing)
03 (3rd line missing)

PSM==11
All lines are complete but some are shattered and more inaccurate.

PSM==3
01 ༄༅། །ཕམ་གྱི་གསུང་ལྡེབ།༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ལམ་གྱི་གསུང་ལྡེབ།༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ རྣ
02 (2nd line missing)
03 པར་བཞག་པ་ཚིགས་སུ་བཅད་པ་སློབ་དཔོན་པུངྱེ་ཤྲཱིའི་གསུང་ལྡེབ།༡ སློབ་དཔོན་ཆེན་པོ་རྡོ་རྗེ་གདན་པ་ཇོ་བོ་པུའངྱེ་ཤྲིས་མཛད་པའི་དགེ་བསྟེན་སྡོམ་པའི་རྣམ་པར་བཞག་པ་ཚིགས་སུ་

PSM==3, same image but slightly rotated and cropped
https://user-images.githubusercontent.com/15245190/36482692-13cdb550-174f-11e8-9378-b8617342594c.png

01 ༄༅།། །ཕམ་གྱི་གསུང་ལྡེབ།༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ རྣ
02 གཉིས་པ་རྒྱུ་མཚན་ཉིད་ཀྱི་ཐེག་པར་ལམ་གྱི་གཞི་མ་སོ་ཐར་སྐོར་ལ། སོ་ཐར་སྡོམ་བརྒྱུད་གསོལ་འདེབས་ཀུན་མཁྱེན་ལྔ་པ་ཆེན་པོའི་གསུང་ལྡེབ།༣ དགེ་བསྟེན་གྱི་སྡོམ་པའི་རྣམ་
03 པར་བཞག་པ་ཚིགས་སུ་བཅད་པ་སློབ་དཔོན་པུྱེ་ཤྲིའི་གསུང་ལྡེབ།༡

Another test image with its fourth line missing:
https://user-images.githubusercontent.com/15245190/36481051-87d49898-1749-11e8-9fb0-cfa4334d2445.png

Do you have any idea? or any suggestion what I should do? Thanks a lot! @Shreeshrii @amitdo

zc813 · 2018-02-21T14:53:12Z

Supplement: On the first image, the 2nd line remains ignored even if I masked the 1st or the 3rd line. (not cropped nor resized)

Shreeshrii · 2018-02-21T16:21:00Z

Tibetan1-Line2.txt

གཉིས་པ་རྒྱུ་མཚན་ཉིད་ཀྱི་ཐེག་པར་ལམ་གྱི་གཞི་མ་སོ་ཐར་སྐོར་ལ། སོ་ཐར་སྡོམ་བརྒྱུད་གསོལ་འདེབས་ཀུན་མཁྱེན་ལྔ་པ་ཆེན་པོའི་གསུང་ལྡེབ།༣ དགེ་བསྟེན་གྱི་སྡོམ་པའི་རྣམ་

Shreeshrii · 2018-02-21T16:27:28Z

I think the page segmentation is not working because the text lines are too close and the diacritics are merging with previous/next line.

There is a config variable which can be tried for this -

# extra space to allow for diacritics above and below the characters
textord_min_linesize 2.5

It works well with your cropped first image - I think the slight white border around the image helps too.

I used the following command:

tesseract Tibetan1.png Tibetan -l bod --psm 6 -c textord_min_linesize=2.5

Here is the output:

༄༅། །ཕམ་གྱི་གསུང་ལྡེབ།༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ རྣ
གཉིས་པ་རྒྱུ་མཚན་ཉིད་ཀྱི་ཐེག་པར་ལམ་གྱི་གཞི་མ་སོ་ཐར་སྐོར་ལ། སོ་ཐར་སྡོམ་བརྒྱུད་གསོལ་འདེབས་ཀུན་མཁྱེན་ལྔ་པ་ཆེན་པོའི་གསུང་ལྡེབ།༣ དགེ་བསྟེན་གྱི་སྡོམ་པའི་རྣམ་
པར་བཞག་པ་ཚིགས་སུ་བཅད་པ་སློབ་དཔོན་པུྱེ་ཤྲིའི་གསུང་ལྡེབ།༡ སློབ་དཔོན་ཆེན་ཕོ་རྡོ་རྗེ་གདན་པ་ཇོ་བོ་པུྱེ་ཤྲིས་མཛད་པའི་དགེ་བསྟེན་སྡོམ་པའི་རྣམ་པར་བཞག་པ་ཚིགས་སུ་
བཅད་པ་བརྒྱད་པ་འཆད་པ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ ཁྱིམ་པ་རྗེས་སུ་འཛིན་པ་སོ་སོར་ཐར་པའི་སྡོམ་པའི་ཆོ་ག་ཐར་ལམ་རབ་གསལ་འགྱུར་མེད་རྡོ་རྗེའི་གསུང་ལྡེབ།༡ ༢ འཕགས་པ་
གཞི་ཐམས་ཅད་ཡོད་པར་སྨྲ་བའི་དགེ་ཚུལ་གྱི་ཚིག་ལེའུར་བྱས་པ་ལྡེབ།° དགེ་སློང་ཕའི་སོ་སོར་ཐར་པའི་མདོ་རྩ་བ་ལྡེབ།༣༦ རབ་བྱུང་གི་གཞིའི་ཆོ་ག་རིན་ཆེན་ཐེམ་སྐས་དྷརྨ་ཤྲིའི་
གསུང་ལྡེབ།༤༣ ལོ་མིང་རེའུ་མིག་ལྡེབ།༡ སྡོམ་པ་འབུལ་ཆོག་ལྡེབ།༢ བསླབ་པ་ཡོངས་སུ་སྦྱོང་བ་གཞི་གསུམ་གྱི་ཆོ་ག་ཐར་གླིང་དུ་བགྲོད་པའི་གྲུ་ཆེན་དུས་བརྗོད་རེའུ་མིག་བཅས་

If most text is like this, it should be added to bod.config file, otherwise just use config variable as part of command.

zc813 · 2018-02-22T01:38:03Z

Hi, @Shreeshrii Thanks for your kind reply! I tried your solution. Actually, the cropped picture worked even without this configuration.
When using the uncropped picture, setting this config variable worked only when textord_min_linesize is exactly 0.82. Neither 0.81 or 0.83 works. And this value depends on the picture.
Do you have any idea? Thanks very much!

zc813 · 2018-02-22T01:57:24Z

For this picture, the textord_min_linesize has to be set to a number between 0.96 and 0.99. Neither smaller or greater values work:
1_page_075
Greater values cause incomplete results, while smaller values lead to wrong recognition.

Again, thanks a lot!

Shreeshrii · 2018-02-22T04:00:21Z

And this value depends on the picture.
Do you have any idea?

Sorry. Don't know how the page layout analysis works.

Shreeshrii · 2018-03-27T05:31:03Z

@zdenop Label with

4.0x
Accuracy

yurytch · 2018-03-29T07:52:27Z

I don't know if this is the right place, but I get missing words, and even several words at once, in the German text, processing with tessdata_best.
I can provide the scan in question if necessary, it's in public domain.

Shreeshrii · 2018-03-29T09:00:52Z

@yurytch Yes, please provide the image so that we can test with the latest version.

yurytch · 2018-03-29T09:42:38Z

Fine, only I can't find where do I attach the files here.
So I've put the image and text OCR'ed from it on the cloud.
The tesseract was built from source from git checkout 2018-01-06, used with tessdata_best.
The '***' in the .TXT were added by hand, to mark where the letters or complete words were dropped out without any indication from tesseract.
https://yadi.sk/d/G2scDhj53TsU52
https://yadi.sk/i/JTh7Ixnv3TsTxY

amitdo · 2018-03-29T10:03:45Z

Please try with the latest commit.

Shreeshrii · 2018-03-29T10:15:48Z

@yurytch The image is 6MB+ jp2 file, yet the clarity in image is not there. I converted to png for testing, since I havent built leptonica with jp2 support.

@amitdo Tried with latest commit from yesterday. OCRed files attached.

ZeitschriftFuerHistorischeWaffenkunde5_0122-tessdata_best-deu-1.txt

ZeitschriftFuerHistorischeWaffenkunde5_0122-tessdata_fast-deu-1.txt

yurytch · 2018-03-29T11:30:54Z

@Shreeshrii Yes, thank you very much. Have completed the test run with the today's git C/O right now (with the JP2-enabled leptonica, as before). Those drop-outs are gone now.
'My' results are different from 'yours' (was that to be expected?), not always to the good.
For the reference:
https://yadi.sk/i/bAykSKIW3TsjBS

Shreeshrii · 2018-03-29T12:21:55Z

'My' results are different from 'yours' (was that to be expected?), not always to the good.

Possible, because I used a different version of image. Though there are too many differences.

I will install jp2 library and try again.

Shreeshrii · 2018-03-29T12:23:24Z

@yurytch I am attaching the png version that I used.

Shreeshrii · 2018-03-29T14:03:47Z

I will install jp2 library and try again.

Not successful in building leptonica with jp2. So trying to install from ppa ...

@AlexanderP Your ppa has both openjpeg2 and leptonlib. I installed liblept and libleptonica-dev from there as well as libopenjp2-7 libopenjp2-7-dev.

But tesseract is not showing jp2 support. What else do I need to do for it?


tesseract 4.0.0-beta.1-59-g2cc4
 leptonica-1.75.3
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.0

 Found AVX
 Found SSE

Shreeshrii · 2018-03-29T14:23:22Z

@stweil Please see
#1339 (comment)

Is it possible to get different results from same traineddata and image?

stweil · 2018-03-29T14:32:41Z

I would not say no as I can imagine reasons why the same Tesseract version with same traineddata could give different results for the same image.

If we can confirm such differences, that is clearly something which needs to get fixed. Results must be reproducible.

Shreeshrii · 2018-03-29T14:48:58Z

If you have leptonica with jp2 support please try with the image linked in #1339 (comment)

And compare your result to

https://yadi.sk/i/bAykSKIW3TsjBS

I had converted the image to png, so it is not the exact same image, those results with deu from best and fast, as well as the image are also there in this thread.

stweil · 2018-03-29T15:09:52Z

png is lossless, so it should be the same image and make no difference in the OCR result. I'll try the example myself later.

AlexanderP · 2018-03-29T15:10:03Z

@Shreeshrii
As I understand.
Need openjpeg version 2.3 and higher

yurytch · 2018-03-29T15:12:12Z

Hey guys, the tesseract versions on MY side WERE different. I was following the initial advice by @Shreeshrii.
The 1st text I posted here was generated in 2018-01-06_git tesseract, the 2nd one - in today's_git tesseract. Leptonica 1.74.1, same version in both cases.

yurytch · 2018-03-29T15:19:13Z

Oh, I see. I've posted only the results with today's git and JP2.
I'm now posting the results for today's git and @Shreeshrii's PNG.
https://yadi.sk/i/Xf0LOw2g3TtJwP

Shreeshrii · 2018-03-29T15:30:10Z

@yurytch please confirm which tessdata did you use? Tessdata_fast?

Also was it with default psm?

Shreeshrii · 2018-03-29T15:44:08Z

Alex, Please see https://github.com/DanBloomberg/leptonica/blob/master/src/environ.h /*-------------------------------------------------------------------------* * Leptonica supports OpenJPEG 2.0+. If you have a version of openjpeg * * (HAVE_LIBJP2K == 1) that is >= 2.0, set the path to the openjpeg.h * * header in angle brackets here. * *-------------------------------------------------------------------------*/ #define LIBJP2K_HEADER <openjpeg-2.3/openjpeg.h> Though, setting have_libjp2k =1 in environmental.h did not work for me when I tried to build with it.

…

On Thu 29 Mar, 2018, 8:40 PM Alexander Pozdnyakov, ***@***.***> wrote: @Shreeshrii <https://github.com/Shreeshrii> As I understand. Need openjpeg version 2.3 and higher — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1339 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o-mynoctMSe2HRZ-uT4C5yp7_YZmks5tjPlcgaJpZM4SNpit> .

Shreeshrii · 2018-03-29T15:47:02Z

Sorry, I see in the change log now Modified jpeg2000 header to use openjpeg 2.3.

…

On Thu 29 Mar, 2018, 9:13 PM ShreeDevi Kumar, ***@***.***> wrote: Alex, Please see https://github.com/DanBloomberg/leptonica/blob/master/src/environ.h /*-------------------------------------------------------------------------* * Leptonica supports OpenJPEG 2.0+. If you have a version of openjpeg * * (HAVE_LIBJP2K == 1) that is >= 2.0, set the path to the openjpeg.h * * header in angle brackets here. * *-------------------------------------------------------------------------*/ #define LIBJP2K_HEADER <openjpeg-2.3/openjpeg.h> Though, setting have_libjp2k =1 in environmental.h did not work for me when I tried to build with it. On Thu 29 Mar, 2018, 8:40 PM Alexander Pozdnyakov, < ***@***.***> wrote: > @Shreeshrii <https://github.com/Shreeshrii> > As I understand. > Need openjpeg version 2.3 and higher > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1339 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AE2_o-mynoctMSe2HRZ-uT4C5yp7_YZmks5tjPlcgaJpZM4SNpit> > . >

yurytch · 2018-03-29T16:15:46Z

@Shreeshrii yes, right, the default PSM and tessdata_best. I get poor-ish results with tessdata_fast, so don't even keep it on disk. Linux 64 bit, FWIW.

AlexanderP · 2018-03-30T17:48:53Z

@Shreeshrii I compiled the leptonica by means of cmake.
jpeg2000 is not supported though in the log it gathers.

Shreeshrii · 2018-03-30T18:01:04Z

@AlexanderP Thank you for following up.

I went back to autotools because the cmake version was too slow on my pc (I run WSL on windows 10).

I built openjpeg from source and leptonica build was able to find it.

tesseract -v
tesseract 4.0.0-beta.1-64-gd284
 leptonica-1.76.0
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libopenjp2 2.3.0
 Found AVX
 Found SSE

I am finding quite a bit of difference between the recognized text on my PC vs the ones by @yurytch using the same traineddata and same images with same tesseract code. However, the hardware and o/s and leptonica version maybe different. Locale may also be different.

I am hoping that @stweil will be able to investigate and figure it out.

AlexanderP · 2018-03-30T18:25:49Z

whether there is a sense to add to ppa - openjpeg-2.3?

Shreeshrii · 2018-03-30T18:32:04Z

Yes, I think it will be helpful to add openjpeg-2.3 to PPA. Thanks!

Shreeshrii · 2018-04-26T08:25:55Z

Opened a new issue for the recent discussion on this thread.

Original issue of complete lines being dropped during recognition still exists.

stweil · 2018-04-28T09:12:13Z

png is lossless, so it should be the same image and make no difference in the OCR result. I'll try the example myself later.

Sorry that it took some time. Now I have done some tests with that images.

Both jp2 and png images don't include resolution information. That explains why earlier versions
of Tesseract (which assumed 70 DPI before 2017-09-08) get other results than newer versions (which estimate a resolution of 179 DPI). Neither 70 DPI nor 179 DPI are correct for the test image, so I expect that the result could be better with the right resolution.

I get the same results from the original jp2 image and from a png image made from the jp2 by using convert. That confirms my earlier statement that the image format should not make a difference when both formats are lossless.

@Shreeshrii, your png image differs from mine:

$ ls -l *png
-rw-r--r-- 1 stweil stweil  812518 Apr 28 10:29 ZeitschriftFuerHistorischeWaffenkunde5_0122.jp2.png
-rw-r--r-- 1 stweil stweil 1576022 Apr 28 08:55 ZeitschriftFuerHistorischeWaffenkunde5_0122.png
$ file *png
ZeitschriftFuerHistorischeWaffenkunde5_0122.jp2.png: PNG image data, 1335 x 1602, 8-bit grayscale, non-interlaced
ZeitschriftFuerHistorischeWaffenkunde5_0122.png:     PNG image data, 1335 x 1602, 8-bit/color RGB, non-interlaced

I get the same results as you with your png image, but those results differ from the jp2 / grayscale png image results. So one interesting result is that obviously Tesseract gets different results from grayscale and color images, even when both look exactly the same. This needs more investigations.

@yurytch, I cannot reproduce your results. Could you try the latest Debian packages for tesseract-ocr? In my tests, those Debian packages and latest Git master show the same results.

Shreeshrii · 2018-04-30T05:40:03Z

@stweil Could different results be because of different versions of leptonica library?

#1339 (comment)
@yurytch is using 1.74.1 , I am using more recent versions.

stweil · 2018-04-30T06:10:59Z

I used 1.75.3-4 from Debian testing, but can repeat the test with an older version.

yurytch · 2018-04-30T09:26:35Z

Answering several days' worth of msgs at once, the results of OCR'ing the initial JP2: https://yadi.sk/d/G2scDhj53TsU52

with fresh versions: Leptonica 1.75.3, Tesseract Open Source OCR Engine v4.0.0-beta.1-203-g45bb

are here:
https://yadi.sk/i/7JBcJVzY3Uxw7L

There are no obvious dropouts here, however dropouts (1-3 letters at a time) still happen, with that version, too. Is it possible to make tesseract output some kind of placeholder or tag into OCR'ed text?

Same for the bogus letters introduced into the text, like in the example, 'güuülden' or 'fruünflich'. Could tesseract be made to output anything not recognised reliably enough, as some kind of 'empty glyph' or tag?

Format is irrelevant for the result, and DPI setting seems to be ignored by tesseract. I've put the density field into TIFF file by ImageMagick's 'convert', and it is there, verified by 'identify' tool, but tesseract still goes 'estimating the resolution'.

Shreeshrii · 2018-05-22T06:02:04Z

See report regarding missing line in Persian

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/dG3hcHQrlr8/cTksOnoxCgAJ

bhaveshvyas007 · 2020-01-21T16:13:23Z

This is very annoying... Irregularly for same image document format, few lines are missing (sometimes).
Mine is simple English.
Can anyone tell what is the cause of missing lines?

zdenop · 2020-01-21T17:41:50Z

@bhaveshvyas007: no we can not - we do not have crystal balls to determine your image, tesseract version, command you run, OS you use and other must information.

bhaveshvyas007 · 2020-01-22T14:03:58Z

@zdenop @amitdo
I can't share image but below are the details: (Since this a open issue, I didn't worry about sharing the version info, sorry)

Ubuntu version : Ubuntu 18.04.3 LTS
Tesseract Version

tesseract 4.1.0
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE
Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1

Command I run :
tesseract public/uploads/1579701431490/sample.pdf.jpg public/uploads/1579701431490/result -l engfast --psm 6 hocr

Btw, I even tried -l eng, and psm 3 or 4 but same line is missing always.

Btw that missing line is just a address line with city, state and zip :

bhaveshvyas007 · 2020-02-28T13:17:26Z

@Shreeshrii @zdenop

Samples :
I have added few dummy samples containing the jpeg files and the hocr result here

Command :
tesseract ./1582890068747/Barack,_Obama.pdf.jpg ./1582890068747/result -l engfast --psm 6 hocr

Tesseract Version Details :
tesseract 4.1.0
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE

OS Version : Ubuntu 18.04.3 LTS

Issue:
If you take a look into the Patient Address line, it missing from hocr output i.e

sometimes the line which says Account #: is also missing.
i.e:

Btw these issues are happening only with few samples, it works fine for most of them.

What I tried
I tried using language eng and eng-best but no success.
I can't change the psm mode because --psm 6 is working fine for most of the samples, I don't want to change parser code.

Question :
Can anyone figure it out why such big lines are totally missing from the output?
Any solution?

kbrajwani · 2021-04-27T09:23:48Z

Hey , Guys
Have you found any solution i am facing the same issues.
System configurations.
tesseract 5.0.0-alpha-20210401-71-g2be89
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1

You can see the image attached below for reference.
https://user-images.githubusercontent.com/29722986/116217982-b67c4680-a767-11eb-8005-f7bb0c8f55c3.png

Let's say if i am using psm 6 the first line date prepared is missing and you will see policy numbers are not correct.
And if i will use psm 11 then the po box line is missing also there will be spaces in same word.
like in AMERICAN AGENCY === ER ICAN AGENCY
94-23 JAMAICA === 94-23 JAMA ICA

Can you guys tell me how can i solve this issue.

amitdo · 2021-04-27T11:20:57Z

@stweil,

I get the same results as you with your png image, but those results differ from the jp2 / grayscale png image results. So one interesting result is that obviously Tesseract gets different results from grayscale and color images, even when both look exactly the same. This needs more investigations.

Most likely this is caused by Tesseract's otsu thresholding.

amitdo · 2021-04-27T11:42:25Z

@kbrajwani

Let's say if i am using psm 6

From the command line help:

6 Assume a single uniform block of text.

It does not make sense to use psm 6 on your image, which has multiple blocks. You misleading Tesseract.

Did you try to use psm 3?

kbrajwani · 2021-04-27T12:16:36Z

Hey @amitdo psm 3 works great. I didn't remember why i have changed default psm.
Thanks

kbrajwani · 2021-04-30T15:29:19Z

https://user-images.githubusercontent.com/29722986/116717573-a6c66180-a9f6-11eb-85af-1d364de7e3ee.png
Hey @amitdo please look into image there is lines are missing in psm 3.
Possiblity of issue is lines are connected.

amitdo · 2021-04-30T15:47:46Z

Possiblity of issue is lines are connected.

You are right in your assumption. Tesseract's layout analysis can't cope with connected lines.

kbrajwani · 2021-04-30T16:05:42Z

Thanks for confirming.can you tell me there is any way we can handle this.

amitdo · 2021-04-30T18:46:24Z

If you can find another tool that will correctly segment the lines, you can then run Tesseract on each line.

kbrajwani · 2021-05-05T17:19:06Z

@amitdo hey can't we train tesseract to identify the lines bounding box. As we are giving line-level bounding box information at the time of training tesseract on own images.
Thanks

amitdo · 2021-05-05T18:15:49Z

No, the layout analysis part is not trainable.

zc813 changed the title ~~Entire lines of text missing based on psm~~ Entire lines of text missing. Different missing when psm = 3, 6, 11 Feb 21, 2018

zdenop added the accuracy label Mar 27, 2018

Shreeshrii mentioned this issue Apr 26, 2018

Different results using same image and same version of tesseract #1530

Open

amitdo added the layout analysis label Apr 25, 2020

amitdo added the binarization label Apr 27, 2021

Entire lines of text missing. Different missing when psm = 3, 6, 11 #1339

Entire lines of text missing. Different missing when psm = 3, 6, 11 #1339

Comments

zc813 commented Feb 21, 2018 • edited Loading

Environment

Current Behavior:

Brief description:

Test image:

zc813 commented Feb 21, 2018 • edited Loading

Shreeshrii commented Feb 21, 2018 • edited Loading

Shreeshrii commented Feb 21, 2018 • edited Loading

zc813 commented Feb 22, 2018 • edited Loading

zc813 commented Feb 22, 2018 • edited Loading

Shreeshrii commented Feb 22, 2018

Shreeshrii commented Mar 27, 2018

yurytch commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

yurytch commented Mar 29, 2018

amitdo commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

yurytch commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

stweil commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

stweil commented Mar 29, 2018

AlexanderP commented Mar 29, 2018

yurytch commented Mar 29, 2018

yurytch commented Mar 29, 2018 • edited Loading

Shreeshrii commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018 via email

Shreeshrii commented Mar 29, 2018 via email

yurytch commented Mar 29, 2018

AlexanderP commented Mar 30, 2018

Shreeshrii commented Mar 30, 2018

AlexanderP commented Mar 30, 2018

Shreeshrii commented Mar 30, 2018

Shreeshrii commented Apr 26, 2018

stweil commented Apr 28, 2018 • edited Loading

Shreeshrii commented Apr 30, 2018

stweil commented Apr 30, 2018

yurytch commented Apr 30, 2018

Shreeshrii commented May 22, 2018

bhaveshvyas007 commented Jan 21, 2020 • edited Loading

zdenop commented Jan 21, 2020 • edited Loading

bhaveshvyas007 commented Jan 22, 2020 • edited Loading

bhaveshvyas007 commented Feb 28, 2020

kbrajwani commented Apr 27, 2021 • edited Loading

amitdo commented Apr 27, 2021 • edited Loading

amitdo commented Apr 27, 2021

kbrajwani commented Apr 27, 2021

kbrajwani commented Apr 30, 2021

amitdo commented Apr 30, 2021

kbrajwani commented Apr 30, 2021

amitdo commented Apr 30, 2021

kbrajwani commented May 5, 2021

amitdo commented May 5, 2021

zc813 commented Feb 21, 2018 •

edited

Loading

zc813 commented Feb 21, 2018 •

edited

Loading

Shreeshrii commented Feb 21, 2018 •

edited

Loading

Shreeshrii commented Feb 21, 2018 •

edited

Loading

zc813 commented Feb 22, 2018 •

edited

Loading

zc813 commented Feb 22, 2018 •

edited

Loading

yurytch commented Mar 29, 2018 •

edited

Loading

stweil commented Apr 28, 2018 •

edited

Loading

bhaveshvyas007 commented Jan 21, 2020 •

edited

Loading

zdenop commented Jan 21, 2020 •

edited

Loading

bhaveshvyas007 commented Jan 22, 2020 •

edited

Loading

kbrajwani commented Apr 27, 2021 •

edited

Loading

amitdo commented Apr 27, 2021 •

edited

Loading