Selecting multiple languages for OCR #305

vivadavid · 2024-02-05T19:36:09Z

Hi,

I wanted to suggest the possibility of selecting more than one language for the OCR engine, which would help with multilingual documents. The way it works now, you can only select one language at a time.

On a separate note, I wanted to ask a question (I apologize if the issue is explained somewhere else and I couldn't find the information). When you open a PDF document and then apply OCR on it, is the OCR added as a new layer on the document with no further changes made on it or is a completely new PDF generated with a inevitable reduction in the quality of the original?

Thanks!

cyanfish · 2024-02-05T20:38:32Z

OCR works as a new layer. Image editing (e.g. rotation, crop) is what you want to avoid to keep the original quality.

vivadavid · 2024-02-05T20:47:24Z

OCR works as a new layer. Image editing (e.g. rotation, crop) is what you want to avoid to keep the original quality.

Great to know, thanks!

Also, could I suggest adding the Tesseract version in the Releases section of Github (when a new version is included) and also on the About section of the programme? I'm currently not sure which Tesseract version is included. Version 5.3.4 was recently released, though the Mannheim binaries are still on 5.3.3.

Thanks again!

cyanfish · 2024-02-05T20:54:14Z

I don't update Tesseract often as changes rarely affect the functionality NAPS2 uses. You can check the version used here.

vivadavid · 2024-02-05T21:08:03Z

I don't update Tesseract often as changes rarely affect the functionality NAPS2 uses. You can check the version used here.

Thanks for the information: it's currently on 5.2.0, as I can see. It'd be nice to have the latest version, but I understand it must take time to update it. However, I'd like to point out that 5.3.3 included a fix for an issue that can affect the quality of the OCR:

tesseract-ocr/tesseract#4014

cyanfish · 2024-02-05T21:34:10Z

Thanks for pointing that out, I'll update that for the next NAPS2 version.

Tarek-Hasan · 2024-02-07T15:00:11Z

Hi,
you should check out OCRmyPDF, if you can integrate it with naps2 to helps with OCR related issues. This tool is build upon Tesseract and specialized to ease PDF OCR. It supports multiple language. It also doesn't change the resolution of the embedded images like other PDF OCR tools.

cyanfish · 2024-03-11T04:49:35Z

Multiple Languages can now be selected as an option (in the "OCR language" dropdown) in 7.4.0.

Also 7.4.0 has updated Tesseract to 5.3.4.

vivadavid · 2024-03-11T09:02:05Z

Thank you for adding multiple language selection on the latest release. Appreciated!

vivadavid · 2024-03-11T18:47:06Z

Multiple Languages can now be selected as an option (in the "OCR language" dropdown) in 7.4.0.

Also 7.4.0 has updated Tesseract to 5.3.4.

I wanted to ask you a question, though: I can see no binaries for Tesseract 5.3.4 from Mannheim. Did you get the binaries from another source or just compiled them yourself?

cyanfish · 2024-03-11T18:52:15Z

I compile them myself. https://github.com/cyanfish/naps2-tesseract has the compiled binaries and my scripts that include all the flags etc to keep the compiled size down <5MB.

vivadavid · 2024-03-11T18:56:32Z

Interesting: thanks!

cyanfish added feature ocr labels Feb 5, 2024

cyanfish closed this as completed Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selecting multiple languages for OCR #305

Selecting multiple languages for OCR #305

vivadavid commented Feb 5, 2024

cyanfish commented Feb 5, 2024

vivadavid commented Feb 5, 2024

cyanfish commented Feb 5, 2024

vivadavid commented Feb 5, 2024 •

edited

Loading

cyanfish commented Feb 5, 2024

Tarek-Hasan commented Feb 7, 2024

cyanfish commented Mar 11, 2024

vivadavid commented Mar 11, 2024

vivadavid commented Mar 11, 2024

cyanfish commented Mar 11, 2024

vivadavid commented Mar 11, 2024

Selecting multiple languages for OCR #305

Selecting multiple languages for OCR #305

Comments

vivadavid commented Feb 5, 2024

cyanfish commented Feb 5, 2024

vivadavid commented Feb 5, 2024

cyanfish commented Feb 5, 2024

vivadavid commented Feb 5, 2024 • edited Loading

cyanfish commented Feb 5, 2024

Tarek-Hasan commented Feb 7, 2024

cyanfish commented Mar 11, 2024

vivadavid commented Mar 11, 2024

vivadavid commented Mar 11, 2024

cyanfish commented Mar 11, 2024

vivadavid commented Mar 11, 2024

vivadavid commented Feb 5, 2024 •

edited

Loading