Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

psm 3 and psm 6 skip different parts of text based on font size #538

Open
Shreeshrii opened this issue Dec 6, 2016 · 6 comments
Open

psm 3 and psm 6 skip different parts of text based on font size #538

Shreeshrii opened this issue Dec 6, 2016 · 6 comments

Comments

@Shreeshrii
Copy link
Collaborator

In an image with Hindi text in various fonts, some of it at very large size

psm 3 - recognizes text at large font size
psm 6 - recognizes text at smaller font size

input image and output files are attached.
sample6
sample6-psm3.txt
sample6-psm6.txt

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Sep 11, 2017

psm 3 - recognizes text at large font size
psm 6 - recognizes text at smaller font size

The result is similar with the best traineddata and current code, though there are differences between best/Devanagari, best/hin and best/san.

psm 3 treats smaller text as 'diacritics' - jpg at 600dpi

**************************** ./fontsize.jpg **********************************
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 29 diacritics

real    0m14.309s
user    0m13.406s
sys     0m0.656s
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 145 diacritics

real    0m8.970s
user    0m8.469s
sys     0m0.359s
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 145 diacritics

real    0m4.188s
user    0m3.813s
sys     0m0.375s

@Shreeshrii
Copy link
Collaborator Author

Same image at 300 dpi, gets fewer blobs recognized as diacritics.

**************************** ./fontsize.jpg **********************************
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 3 diacritics

real    0m18.550s
user    0m17.688s
sys     0m0.688s
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 17 diacritics

real    0m16.757s
user    0m15.203s
sys     0m0.438s
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 17 diacritics

real    0m6.672s
user    0m6.125s
sys     0m0.438s

@Shreeshrii Shreeshrii changed the title LSTM: psm 3 and psm 6 recognize different parts of text in multiple fonts psm 3 and psm 6 skip different parts of text based on font size Sep 11, 2017
@Shreeshrii
Copy link
Collaborator Author

See HOCR Bbox related issue, withe differences in processing for psm 3 and 6, posted in forum - https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/W19nmACyri8/Ipi5a-6hBQAJ

We depend on accurate bbox tags to do further processing of the extracted text.
The problem is more acute when using PSM 6 (Single block of text).
However, even with PSM 3, it happens sometimes, but very very rarely.

@sreenathbh
Copy link

I built Tesseract 4 Beta from git repository on Debian 64.
Same behaviour with beta version as well.
Here is the image, it is large page.
page_2

here is generated hocr file:
page_2.txt

The bbox tag for the word PLAINTIFF is set to 0,0,3400,4400
thanks,
Sreenath

@amitdo
Copy link
Collaborator

amitdo commented Apr 30, 2018

@sreenathbh

Your issue is similar to:
#1015

We depend on accurate bbox tags to do further processing of the extracted text.

See #1276

@Shreeshrii
Copy link
Collaborator Author

 tesseract -v
tesseract 5.0.0-alpha-839-gd93e
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found NEON
 Found OpenMP 201511

Still same result...

 tesseract 538.jpg - -l hin --psm 6
जोधपुर के लिए और मिली 64 करोड़ की सौगात
विस्कॉन्सिन राज्य के ओक फ्रीक शहर में एक गुछुद्वारे में हुई गोलीबारी की
पूपणी सलाहकार पंकज पचौरी ले बीबीसी से बातचीत इस्तीफ़ा भेज दिया है और इसे मंजूरी के लिए राष्ट्रपति के पास भेज
में इस बात की पुष्टि कर दी है कि कृष्णा ले प्रधानमंत्री के पास अपना दिया जाएगा. कृष्णा ले रविवार को संभावित फेखदल से सिर्फ़ दो दिल
जट रिस्की, आफ्टर विस्की
tesseract 538.jpg - -l hin --psm 3
Detected 157 diacritics
सनी लियोनी

जोधपुर के लिए और मिली 64 करोड़ की सौगात
तिस्कॉल्सिल कप

0 दर (]

विस्कॉन्सिन राज्य के ओक फ्रीक शहर में एक गुछुद्वारे में हुई गोलीबारी की

बेल्जियम

मूषना और प्रभारण

इसे मुंजूरी के लिए

जट रिस्की, आफ्टर विस्की

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants