Skip to content
Nick White edited this page Aug 21, 2014 · 1 revision

uzn is a simple text file format for describing sections of a scanned image. The migneuzn tool outputs in this format for its segmentation.

The format is simply:

left top width height freetext

So an example .uzn file is this:

  395   368  1633    78 Text/Latin
 2030   368  1634    78 Text/Greek
  388   478  1633  2275 Text/Greek
 2031   478  1634  2275 Text/Latin
  396  2852  1633  1002 Text/Greek
 2018  2852  1634  1002 Text/Latin
  471  3960  1565    75 Text/Latin
 1639  4141   685    62 AppCrit
  394  4293  3249  1482 AppCrit
 4078   462     5   606 AppCrit

Tesseract can read in uzn files, and use them instead of doing its own segmentation, on two conditions:

  1. The segmentation mode PSM_SINGLE_COLUMN must be used (which is the default for the migneocr tool)
  2. The uzn file must be named <imagebase>.uzn, where <imagebase> is the path of the image, without the file extension. So for scan001.png the uzn file must be named scan001.uzn.

The format is sometimes called a "zone file," and was created for the UNLV OCR tests in the 1990s. The name probably comes from "UNLV Zone."

Clone this wiki locally