-
Notifications
You must be signed in to change notification settings - Fork 1
uzn format
Nick White edited this page Aug 21, 2014
·
1 revision
uzn is a simple text file format for describing sections of a scanned image. The migneuzn tool outputs in this format for its segmentation.
The format is simply:
left top width height freetext
So an example .uzn file is this:
395 368 1633 78 Text/Latin
2030 368 1634 78 Text/Greek
388 478 1633 2275 Text/Greek
2031 478 1634 2275 Text/Latin
396 2852 1633 1002 Text/Greek
2018 2852 1634 1002 Text/Latin
471 3960 1565 75 Text/Latin
1639 4141 685 62 AppCrit
394 4293 3249 1482 AppCrit
4078 462 5 606 AppCrit
Tesseract can read in uzn files, and use them instead of doing its own segmentation, on two conditions:
- The segmentation mode
PSM_SINGLE_COLUMN
must be used (which is the default for the migneocr tool) - The uzn file must be named
<imagebase>.uzn
, where<imagebase>
is the path of the image, without the file extension. So forscan001.png
the uzn file must be namedscan001.uzn
.
The format is sometimes called a "zone file," and was created for the UNLV OCR tests in the 1990s. The name probably comes from "UNLV Zone."