You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When OCR'ing English (Latin) text with diacritics it doesn't always recognise them. The diacritics in my document are part of surnames originating from Hungary and Belgium.
I've tried with just English, English + Hungarian dictionaries, also tried with Latin script (which has extended character map) to no avail.
The words: poéme, pathétique, animé are recognised.
The words: Ysaÿe, Jenő, Petőfi, etc. are not recognised.
The words csárdás, Telmányi, Dvořák are recognised only with Latin script.
OCRmyPDF mainly transcribes the OCR output from Tesseract to a PDF. It does not handle OCR itself so it cannot usually improve issues like missed diacritics.
If you run ocrmypdf --keep-temporary-files ... a folder will be produced containing the page images that were sent to OCR. Your best bet is to take these images and report them to https://github.com/tesseract-ocr/tesseract/ as missed recognition. The font in the sample is a little unusual and the spacing between letters and their diacritic seems to be more than typical. It's possible Tesseract would need training to recognize this unusual font.
Just possibly, using ocrmypdf --oversample 600 may improve results. This causes OCRmyPDF to generate higher resolution images, which can help with diacritic detection.
Describe the bug
When OCR'ing English (Latin) text with diacritics it doesn't always recognise them. The diacritics in my document are part of surnames originating from Hungary and Belgium.
I've tried with just English, English + Hungarian dictionaries, also tried with Latin script (which has extended character map) to no avail.
The words:
poéme
,pathétique
,animé
are recognised.The words:
Ysaÿe
,Jenő
,Petőfi
, etc. are not recognised.The words
csárdás
,Telmányi
,Dvořák
are recognised only with Latin script.Steps to reproduce
Files
Source file: Booklet.pdf
How did you download and install the software?
Linux package manager (apt, dnf, etc.)
OCRmyPDF version
16.3.1+dfsg1
Relevant log output
No response
The text was updated successfully, but these errors were encountered: