PW_OCR_ADVANCED

QGIS processing algorithm which recognizes text from raster images inside input polygon features and saves as attribute value of output layer.

PW_OCR_ADVENCED script processes in different way than PW_OCR. Check specifications of both scripts to choose better one for your applications.

Citing

If usage of the script leads to a scientific publication, please acknowledge this fact by citing:

Graszka, O. (2021). Automatyzacja procesu rozpoznawania i weryfikacji nazw geograficznych ze źródeł historycznych na przykładzie Słownika geograficznego Królestwa Polskiego. W T. Epsztein (red.), Od Słownika geograficznego Królestwa Polskiego do map topograficznych Wojskowego Instytutu Geograficznego (s. 23–32).

Python Tesseract

PW_OCR script usues Pytesseract library and requires its installation. After installation you have to update path to your Tesseract directory at the beginning of the script.

// path to your tesseract installation directory.
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'

You may set Pytesseract configuration (page segmentation mode and OCR engine model) using comboboxes of script graphical interface, but if you want to use language other than polish, you have to edit line below in the code:

data = pytesseract.image_to_data(Raster_lyr.source(), lang='pol', config=self.config, output_type=Output.DICT)

Algorithm

This algorithm iterates over all input raster layers and processes them according to the scheme below:

Recognizing all words on the sheet and returning table with their pixel coordinates , width, height, confidence of recognition and recognized text (Pytesseract library).
Itarating over all features overlaying raster and collecting all words which centroids are inside feature boundaries.
Merging sentences from all collected for each feature words.
Adding recognized text (sentence) as attribute value to output feature field.
Adding confidence of recognition as list of percentage values for each word to confidence output field.

Cekcyn Polski -> [96,71]

Restrictions

This script works properly only if edges of rectified rasters are paralell to axes of QGIS project coordinate reference system.

Parameters

Input polygon layer

The features used to recognize text inside them.

Text output field

The field in the input table in which the recognized text will be add.

Confidence output field

The field in the input table in which the text recognition confidence will be add. Confidence is saved in the list; one value for each word.

Run for all raster layers

The algorithm will recognize text from all active raster layers, if checked.

Input raster layer

If above checkbox unchecked, the algorithm will recognize text only from this raster layer.
In case of multiband raster images, the only first band will be used.

Page Segmentation Mode

Tesseract Page Segmentation Mode.

OCR Engine Model

Tesseract OCR Engine Model.

Add words recognized with zero confidence

If there are some words recognized with zero confidence, they will be add too.

Output layer

Location of the output layer with filled text attribute.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
__pycache__		__pycache__
images		images
README.md		README.md
pw_ocr_adv.py		pw_ocr_adv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PW_OCR_ADVANCED

Citing

Python Tesseract

Algorithm

Restrictions

Parameters

See also

About

Releases

Packages

Languages

OskarGraszka/PW_OCR_ADVANCED

Folders and files

Latest commit

History

Repository files navigation

PW_OCR_ADVANCED

Citing

Python Tesseract

Algorithm

Restrictions

Parameters

See also

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages