-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract segmentation fault when using Arabic and English #1275
Comments
Please try tesseract Arabic_to_English.png -l Arabic output and report the result. |
Image is not attached. |
From the master branch? From which repo did you download the traineddata? |
@Shreeshrii
|
@amitdo I have updated my OP. I did not download any test data |
Duplicates #235 |
@amitdo Thank you for the link. Is there any way I can get an alert if and when tesseract will work with ara+eng in the future? |
That combination should already work with the latest experimental Tesseract 4 (which also supports an alternative combination Arabic+Latin), but I have no personal experience with |
An interesting discussion: |
Repeating myself, just to try this GitHub feature: Duplicate of #235 |
@Freedomafia If you use tesseract4 (LSTM engine) with Arabic (which has both ara and eng already), it works fine. See attached. |
Hi @Shreeshii I did manage to use it with Tesseract v4 two days ago but I encountered some issues which I will post as a comment on this thread later today. Many thanks. |
Hi @Shreeshrii @amitdo @stweil . Thank you for your help thus far. I have two questions: (1) The text file produced from the arabic and arabic+english are not great. However when I use english only it gets all the english words correctly however it guesses the english characters when it meets the arabic letters. I am thinking of returning the arabic letters with empty spaces by utilising what I believe to be the fact that the english only LSTM will produce a low confidence score when it comes across arabic letters. Is there anywhere to extract confidence scores per letter/character? (2) I could not run all the tesseract4 features. I ran the following lines:
oem 1 and 3 produced results (see attached). However all the --oem 0 and 2 failed to produce an OCR text and they returned the error message (for both):
Many thanks :) outputoem1_AE.txt |
--oem 1 is LSTM, and --oem 3 is default - which should fallback to --oem 1. So the results should be the same.
|
see There is a debug type of config variable you can set to see details such as |
You’re a star @Shreeshrii . I will test this out and report back here. Many thanks. |
For enabling the debug info related to this, update the config called logfile config
command
The tesseract.log generated by above will be on the following lines.
|
Thank you very much @Shreeshrii . This is very detailed and beneficial |
tesseract imgara.png -l ara output get error like mgr->GetComponent(TESSDATA_INTTEMP, &fp):Error:Assert failed:in file adaptmatch.cpp, line 537 |
When I tried the Arabic only and English only text copying it worked. However when I tried to use them both simultaneously on the picture of the scanned page I got a 'segmentation fault'. I have attached a link to the image of a scanned page of the Arabic-English dictionary : https://imgur.com/a/K8bqz.
My bashscript was:
tesseract Arabic_to_English.png -l eng+ara output
However the terminal returned the message that there was a 'segmentation fault'. Full error message:
I wanted to ask whether tesseract is able to work with English and Arabic simultaneously.
Environment
tesseract 3.04.01
leptonica-1.73
libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
Current Behavior:
Expected Behavior:
Suggested Fix:
The text was updated successfully, but these errors were encountered: