Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract 4.0 hangs when processing a particular image #2288

Open
lewislun opened this issue Mar 4, 2019 · 18 comments
Open

Tesseract 4.0 hangs when processing a particular image #2288

lewislun opened this issue Mar 4, 2019 · 18 comments

Comments

@lewislun
Copy link

lewislun commented Mar 4, 2019

Environment

  • Tesseract Version: tesseract 4.0.0-beta.1
    leptonica-1.75.3
    libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
  • Platform: Ubuntu 18.04.1 LTS

Current Behavior:

hangs when running the following command:
tesseract failed-image.jpeg output.txt

output message:

Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 207

Tesseract does not stop nor give any message after that.
other images work fine, i only have trouble processing this particular image.
I have found that the image after processed by tesseract (or leptonica?) is weird, dont know if it is related.

failed-image.jpeg: https://drive.google.com/open?id=1HsgCbtuNpgf_XxzjkekXU9-uuiWDsV0H
tessinput.tif: https://drive.google.com/open?id=1sE8Nn5rykSWPT6PMF3nFSonPMT9y-H61

Expected Behavior:

Tesseract should either give an error message or finish ocr on the image even if the image quality is bad.

@zdenop
Copy link
Contributor

zdenop commented Mar 4, 2019

  1. Your tesseract version is outdated.
  2. jpeg is not suitable format for OCR (jpeg compression artifacts)
  3. Your input is not suitable for tesseract binarization (Otsu) algorithm (result you see in tessinput.tif). Did you read ImproveQuality wiki?

@stweil
Copy link
Member

stweil commented Mar 4, 2019

The problem also exists with latest code. This might be another example for issue #2196.

@stweil stweil added the bug label Mar 4, 2019
@stweil stweil added this to the 4.1.0 milestone Mar 4, 2019
@stweil
Copy link
Member

stweil commented Mar 4, 2019

Tesseract hangs in an endless loop here:

(gdb) i s
#0  tesseract::ColPartitionGrid::FindPartitionPartners (this=0x555557d2ea90) at ../../../../../src/textord/colpartitiongrid.cpp:1190
#1  0x00005555555ffdc0 in tesseract::ColumnFinder::FindBlocks (this=0x555557d2e950, pageseg_mode=tesseract::PSM_AUTO, scaled_color=0x0, scaled_factor=-1, 
    input_block=0x555557d1ed60, photo_mask_pix=0x5555559592d0, thresholds_pix=0x555555958550, grey_pix=0x5555559585a0, pixa_debug=0x7ffff69159f0, blocks=0x7fffffffd130, 
    diacritic_blobs=0x7fffffffd208, to_blocks=0x7fffffffd210) at ../../../../../src/textord/colfind.cpp:432
#2  0x00005555555ca938 in tesseract::Tesseract::AutoPageSeg (this=0x7ffff68f2010, pageseg_mode=tesseract::PSM_AUTO, blocks=0x555555955720, to_blocks=0x7fffffffd210, 
    diacritic_blobs=0x7fffffffd208, osd_tess=0x0, osr=0x7fffffffd5d0) at ../../../../../src/ccmain/pagesegmain.cpp:226
#3  0x00005555555ca4d7 in tesseract::Tesseract::SegmentPage (this=0x7ffff68f2010, input_file=0x55555595ed90, blocks=0x555555955720, osd_tess=0x0, osr=0x7fffffffd5d0)
    at ../../../../../src/ccmain/pagesegmain.cpp:139
#4  0x0000555555584380 in tesseract::TessBaseAPI::FindLines (this=0x5555558ce280 <main::api>) at ../../../../../src/api/baseapi.cpp:2090
#5  0x000055555557f7cd in tesseract::TessBaseAPI::Recognize (this=0x5555558ce280 <main::api>, monitor=0x0) at ../../../../../src/api/baseapi.cpp:835
#6  0x0000555555580fa6 in tesseract::TessBaseAPI::ProcessPage (this=0x5555558ce280 <main::api>, pix=0x5555559583c0, page_index=0, 
    filename=0x7fffffffe744 "issue/2288/failed-image.jpeg", retry_config=0x0, timeout_millisec=0, renderer=0x555555950840) at ../../../../../src/api/baseapi.cpp:1228
#7  0x0000555555580d3a in tesseract::TessBaseAPI::ProcessPagesInternal (this=0x5555558ce280 <main::api>, filename=0x7fffffffe744 "issue/2288/failed-image.jpeg", 
    retry_config=0x0, timeout_millisec=0, renderer=0x555555950840) at ../../../../../src/api/baseapi.cpp:1186
#8  0x00005555555806d1 in tesseract::TessBaseAPI::ProcessPages (this=0x5555558ce280 <main::api>, filename=0x7fffffffe744 "issue/2288/failed-image.jpeg", retry_config=0x0, 
    timeout_millisec=0, renderer=0x555555950840) at ../../../../../src/api/baseapi.cpp:1076
#9  0x000055555557b3ae in main (argc=3, argv=0x7fffffffe498) at ../../../../../src/api/tesseractmain.cpp:745

Issue #2196 has a different stack, so it looks like we have two issues with images causing an endless loop in the layout detection.

@zdenop
Copy link
Contributor

zdenop commented Mar 4, 2019

Yes, endless loop is problem - that is the reason I keep issue open.
But points 2. and 3. can help to avoid problem or if there is no issue with endless loop, OCR will not produce expected results.

@amitdo
Copy link
Collaborator

amitdo commented Mar 4, 2019

The main issue here is Tesseract's binarization method.

I used GIMP's thresholding (60-255) to produce this image.

i2288-bin-60-255

output with best:

 
 

Great Daddy, 2014 ELE

Acrylic on canvas
200 x 300 cm

 
Error during processing.

@zdenop
Copy link
Contributor

zdenop commented Mar 4, 2019

@amitdo : I do not think the main issue is Tesseract's binarization method... It works good in most of cases (see e.g. 2264) - but not it all. I expect if we replace it with something else, we will get similar reports with other kind of images.

Anyway patch for automatic selection best of binarization algorithm is welcomed ;-)

And of course infinite loop in tesseract should be fixed too.

@stweil
Copy link
Member

stweil commented Mar 4, 2019

Automatic selection would be great, but a first step could be to offer some binarization algorithms, so the user has a choice (command line option or config parameter).

@chintler
Copy link

I'm facing this issue too. Are there any updates or workarounds that I can try, including what @stweil suggested?

@stweil stweil modified the milestones: 4.1.0, 5.0.0 Nov 27, 2019
@Ra-Na
Copy link

Ra-Na commented Dec 4, 2019

Same here. Ubuntu 18.04, tesseract 4.0.0-beta.1.

@Ra-Na
Copy link

Ra-Na commented Dec 4, 2019

On Ubuntu 18.04.3 Tesseract is updated to version 4.1.1, the issue is gone (in my case).
The issue is gone in Tesseract 4.1.1. You have to install it manually. For Ubuntu 18.04 users, simply

sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update
sudo apt upgrade

(Details here)

languitar added a commit to languitar/paperless that referenced this issue Feb 16, 2020
This make tesseract 4.1 avaialbe, which fixes some things like infinite
processing loops on some documents:
tesseract-ocr/tesseract#2288 (comment)

Some dependencies had to be bumped for being compatible with the new Alpine
libraries.
languitar added a commit to languitar/paperless that referenced this issue Feb 16, 2020
This make tesseract 4.1 avaialbe, which fixes some things like infinite
processing loops on some documents:
tesseract-ocr/tesseract#2288 (comment)

Some dependencies had to be bumped for being compatible with the new Alpine
libraries.
languitar added a commit to languitar/paperless that referenced this issue Feb 29, 2020
This make tesseract 4.1 avaialbe, which fixes some things like infinite
processing loops on some documents: tesseract-ocr/tesseract#2288
languitar added a commit to languitar/paperless that referenced this issue Mar 1, 2020
This make tesseract 4.1 avaialbe, which fixes some things like infinite
processing loops on some documents: tesseract-ocr/tesseract#2288
@amitdo
Copy link
Collaborator

amitdo commented Mar 6, 2020

@lewislun, was this issue solved for your case with version 4.1.1 or the current code in the master branch?

BastianPoe pushed a commit to BastianPoe/paperless that referenced this issue Jun 16, 2020
This make tesseract 4.1 avaialbe, which fixes some things like infinite
processing loops on some documents: tesseract-ocr/tesseract#2288
@jcrogel
Copy link

jcrogel commented Jul 29, 2020

I am still seeing this on 4.1.1 and png files

@zdenop
Copy link
Contributor

zdenop commented Jul 29, 2020

@jcrogel: without image, that can help to find problem you comment is useless.

@saikalyan9981
Copy link

saikalyan9981 commented Jan 25, 2021

I'm trying to use "Tesseract Open Source OCR Engine v4.1.1-rc2-20-g01fb with Leptonica" on the following
Image
It's stuck.
@zdenop can you help with this and suggest any workaround? As of now with --oem 0 (legacy) it's working fine

@Shreeshrii
Copy link
Collaborator

@saikalyan9981 Works fine with current code from repo. Time taken is different based on the traineddata file being used.

(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata_best
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m33.252s
user    1m47.232s
sys     0m0.826s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata_fast
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m12.468s
user    0m30.834s
sys     0m0.593s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m18.681s
user    0m53.303s
sys     0m0.714s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata --oem 0
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m19.286s
user    0m54.827s
sys     0m0.696s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata --oem 1
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m18.088s
user    0m51.650s
sys     0m0.760s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata --oem 2
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m19.176s
user    0m54.583s
sys     0m0.744s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata --oem 3
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m19.216s
user    0m54.951s
sys     0m0.682s

@saikalyan9981
Copy link

@Shreeshrii Thanks a lot, I'll use v5.0.0. I think the issue is with v4.1.1

@Ra-Na
Copy link

Ra-Na commented Jan 25, 2021

I just ran

tesseract 4.1.1
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4

(output of tesseract --version)

on the above image without any issues.

@amitdo
Copy link
Collaborator

amitdo commented May 15, 2021

With the code from #3418, the processing ends after 7 seconds, when Sauvola binarization is used, but the output is garbage.

@amitdo amitdo modified the milestones: 5.0.0, 6.0.0 Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants