OCRLinguist

A tool to translate Multilingual Legal Document PDFs to English using Tesseract, EasyOCR and EasyNMT's Opus-MT Machine Translation model locally

How to use this tool?

Clone the repository to your system
Run the Setting Up Environment Jupyter Notebook while using it for the first time
Place the PDFs in "PDFDataFolder" in your system and specify the document name or "all" feature according to your choice
Run the Converting PDF to Image file if you haven't already converted it or directly place the Images in "ImageDataFolder"
Choose the models to run accordingly from the following

Primary Model - OCR-Translation1-(Nolanguagespecified): Using Tesserract and OpusMT model for translation. Automatically detects the language and translates the text to English
OCR-Translation1-(languagespecified): Using EasyOCR and OpusMT model for translation. For EasyOCR, you are required to know the language of the PDF in advance and specify it in the Jupyer Notebook
Use in case of special characters -

You will find the translated .txt file in the "OutputText" folder
To convert the .txt file to PDF, run the Text2PDF Jupyter Notebook

For the poppler file

Windows

Download the latest package from http://blog.alivate.com.au/poppler-windows/
Extract the package
Move the extracted directory to the desired place on your system
Add the bin/ directory to your PATH
Test that all went well by opening cmd and making sure that you can call pdftoppm -h

For Tesseract-OCR

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

Note: Note the tesseract path from the installation. Default installation path was: r'C:\Program Files\Tesseract-OCR\tesseract'. It may change so please check the installation path. Change the location according to the location where you set up Tesseract locally

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.ipynb_checkpoints		.ipynb_checkpoints
ImageDataFolder		ImageDataFolder
PDFDataFolder		PDFDataFolder
ConvertPDF2Image.ipynb		ConvertPDF2Image.ipynb
OCR-Translation1-(Nolanguagespecified).ipynb		OCR-Translation1-(Nolanguagespecified).ipynb
README.md		README.md
SettingUpEnvironment.ipynb		SettingUpEnvironment.ipynb
poppler-0.68.0_x86.7z		poppler-0.68.0_x86.7z
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCRLinguist

How to use this tool?

For the poppler file

For Tesseract-OCR

About

Releases

Packages

Languages

Priyanshiguptaaa/OCRLinguist

Folders and files

Latest commit

History

Repository files navigation

OCRLinguist

How to use this tool?

For the poppler file

For Tesseract-OCR

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages