Skip to content

NoraCorrection

OlaBauge edited this page Oct 13, 2009 · 11 revisions

Overview

Detecting gibberish

On many stretches of text where pdfBox is unable to map glyphs to normative code points, it instead numbers the glyphs and outputs them in the form "g81g72g81g3g82g80g3g76g81g86g87" etc. The letter may be g, a, x; it may also be strings like 'affi' (in connection with Arabic glyphs) or 'cid' (seen only in documents produced on a Mac).

Most of these unmapped glyphs can be found with the regex  \d(([A-Za-z]+)\d+){3}  — this matches runs of three of characters and digits right next to each other. A high proportion of text like this indicates that the document is mostly unreadable.

Classification of documents, sorted by score: [http://heim.ifi.uio.no/olasba/nora/anonglyphs06.log]

Source code: [http://heim.ifi.uio.no/olasba/nora/findanonglyph06.pl.txt]

  • —Ola, 13 Oct
Clone this wiki locally