Skip to content

QGIS processing algorithm which recognizes text from raster images inside input polygon features and saves as attribute value of output layer.

Notifications You must be signed in to change notification settings

OskarGraszka/PW_OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 

Repository files navigation

PW_OCR

qgis

QGIS processing algorithm which recognizes text from raster images inside input polygon features and saves as attribute value of output layer.

PW_OCR_ADVENCED script processes in different way than PW_OCR_ADVANCED. Check specifications of both scripts to choose better one for your applications.

Citing

If usage of the script leads to a scientific publication, please acknowledge this fact by citing:

Graszka, O. (2021). Automatyzacja procesu rozpoznawania i weryfikacji nazw geograficznych ze źródeł historycznych na przykładzie Słownika geograficznego Królestwa Polskiego. W T. Epsztein (red.), Od Słownika geograficznego Królestwa Polskiego do map topograficznych Wojskowego Instytutu Geograficznego (s. 23–32).

Python Tesseract

PW_OCR script usues Pytesseract library and requires its installation. After installation you have to update path to your Tesseract directory at the beginning of the script.

// path to your tesseract installation directory.
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'

You may set Pytesseract configuration (page segmentation mode and OCR engine model) using comboboxes of script graphical interface, but if you want to use language other than polish, you have to edit line below in the code:

text = pytesseract.image_to_string(img, lang='pol', config=self.config)

Algorithm

Schema

This algorithm iterates over all input polygon features and processes them according to this scheme:

  • Exporting feature as separate shapefile in temporary file location.
  • Clipping raster overlaying the feature to the already made shapefile object boundaries and saving it into temporary file location (using GDAL library).
  • Regonizing text on the clipped raster image (Pytesseract library).
  • Adding recognized text as attribute value to output feature field.

screen

Parameters

Input polygon layer
The features used to recognize text inside them.

Text output field
The field in the input table in which the recognized text will be add.

Run for all raster layers
The algorithm will recognize text from all active raster layers, if checked.

Input raster layer
If above checkbox unchecked, the algorithm will recognize text only from this raster layer.
In case of multiband raster images, the only first band will be used.

Page Segmentation Mode
Tesseract Page Segmentation Mode.

OCR Engine Model
Tesseract OCR Engine Model.

Remove comma
If comma is the last character in recognized text, it will be removed.

Temporary files location
Location of such transitional files like image translated to 8bit TIFF, image clipped to the single feature and shapefile contains only one feature. These files are created during iterating over all input features.

Output layer
Location of the output layer with filled text attribute.

See also

PW_OCR_ADVANCED

PW_ABBREVIATIONS

About

QGIS processing algorithm which recognizes text from raster images inside input polygon features and saves as attribute value of output layer.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages