psm 3 and psm 6 skip different parts of text based on font size #538

Shreeshrii · 2016-12-06T12:24:43Z

In an image with Hindi text in various fonts, some of it at very large size

psm 3 - recognizes text at large font size
psm 6 - recognizes text at smaller font size

input image and output files are attached.

sample6-psm3.txt
sample6-psm6.txt

Shreeshrii · 2017-09-11T14:41:30Z

psm 3 - recognizes text at large font size
psm 6 - recognizes text at smaller font size

The result is similar with the best traineddata and current code, though there are differences between best/Devanagari, best/hin and best/san.

psm 3 treats smaller text as 'diacritics' - jpg at 600dpi

**************************** ./fontsize.jpg **********************************
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 29 diacritics

real    0m14.309s
user    0m13.406s
sys     0m0.656s
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 145 diacritics

real    0m8.970s
user    0m8.469s
sys     0m0.359s
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 145 diacritics

real    0m4.188s
user    0m3.813s
sys     0m0.375s

Shreeshrii · 2017-09-11T14:47:59Z

Same image at 300 dpi, gets fewer blobs recognized as diacritics.

**************************** ./fontsize.jpg **********************************
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 3 diacritics

real    0m18.550s
user    0m17.688s
sys     0m0.688s
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 17 diacritics

real    0m16.757s
user    0m15.203s
sys     0m0.438s
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 17 diacritics

real    0m6.672s
user    0m6.125s
sys     0m0.438s

Shreeshrii · 2018-04-26T08:31:25Z

See HOCR Bbox related issue, withe differences in processing for psm 3 and 6, posted in forum - https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/W19nmACyri8/Ipi5a-6hBQAJ

We depend on accurate bbox tags to do further processing of the extracted text.
The problem is more acute when using PSM 6 (Single block of text).
However, even with PSM 3, it happens sometimes, but very very rarely.

sreenathbh · 2018-04-30T10:00:19Z

I built Tesseract 4 Beta from git repository on Debian 64.
Same behaviour with beta version as well.
Here is the image, it is large page.

here is generated hocr file:
page_2.txt

The bbox tag for the word PLAINTIFF is set to 0,0,3400,4400
thanks,
Sreenath

amitdo · 2018-04-30T10:15:37Z

@sreenathbh

Your issue is similar to:
#1015

We depend on accurate bbox tags to do further processing of the extracted text.

See #1276

Shreeshrii · 2020-12-10T12:34:29Z

 tesseract -v
tesseract 5.0.0-alpha-839-gd93e
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found NEON
 Found OpenMP 201511

Still same result...

 tesseract 538.jpg - -l hin --psm 6
जोधपुर के लिए और मिली 64 करोड़ की सौगात
विस्कॉन्सिन राज्य के ओक फ्रीक शहर में एक गुछुद्वारे में हुई गोलीबारी की
पूपणी सलाहकार पंकज पचौरी ले बीबीसी से बातचीत इस्तीफ़ा भेज दिया है और इसे मंजूरी के लिए राष्ट्रपति के पास भेज
में इस बात की पुष्टि कर दी है कि कृष्णा ले प्रधानमंत्री के पास अपना दिया जाएगा. कृष्णा ले रविवार को संभावित फेखदल से सिर्फ़ दो दिल
जट रिस्की, आफ्टर विस्की

tesseract 538.jpg - -l hin --psm 3
Detected 157 diacritics
सनी लियोनी

जोधपुर के लिए और मिली 64 करोड़ की सौगात
तिस्कॉल्सिल कप

0 दर (]

विस्कॉन्सिन राज्य के ओक फ्रीक शहर में एक गुछुद्वारे में हुई गोलीबारी की

बेल्जियम

मूषना और प्रभारण

इसे मुंजूरी के लिए

जट रिस्की, आफ्टर विस्की

Shreeshrii changed the title ~~LSTM: psm 3 and psm 6 recognize different parts of text in multiple fonts~~ psm 3 and psm 6 skip different parts of text based on font size Sep 11, 2017

Shreeshrii mentioned this issue Sep 11, 2017

Method WordFontAttributes does not work #1074

Closed

This was referenced Feb 21, 2018

Don't drop words with low certainty #1264

Merged

Entire lines of text missing. Different missing when psm = 3, 6, 11 #1339

Open

amitdo added the layout analysis label Apr 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

psm 3 and psm 6 skip different parts of text based on font size #538

psm 3 and psm 6 skip different parts of text based on font size #538

Shreeshrii commented Dec 6, 2016

Shreeshrii commented Sep 11, 2017 •

edited

Loading

Shreeshrii commented Sep 11, 2017

Shreeshrii commented Apr 26, 2018

sreenathbh commented Apr 30, 2018

amitdo commented Apr 30, 2018 •

edited

Loading

Shreeshrii commented Dec 10, 2020

psm 3 and psm 6 skip different parts of text based on font size #538

psm 3 and psm 6 skip different parts of text based on font size #538

Comments

Shreeshrii commented Dec 6, 2016

Shreeshrii commented Sep 11, 2017 • edited Loading

Shreeshrii commented Sep 11, 2017

Shreeshrii commented Apr 26, 2018

sreenathbh commented Apr 30, 2018

amitdo commented Apr 30, 2018 • edited Loading

Shreeshrii commented Dec 10, 2020

Shreeshrii commented Sep 11, 2017 •

edited

Loading

amitdo commented Apr 30, 2018 •

edited

Loading