NoraCorrection

Overview

Detecting gibberish

On many stretches of text where pdfBox is unable to map glyphs to normative code points, it instead numbers the glyphs and outputs them in the form "g81g72g81g3g82g80g3g76g81g86g87" etc. The letter may be g, a, x; it may also be strings like 'affi' (in connection with Arabic glyphs) or 'cid' (seen only in documents produced on a Mac).

Most of these unmapped glyphs can be found with the regex \d(([A-Za-z]+)\d+){3} — this matches runs of three of characters and digits right next to each other. A high proportion of text like this indicates that the document is mostly unreadable.

Classification of documents, sorted by score: [http://heim.ifi.uio.no/olasba/nora/anonglyphs06.log]

Source code: [http://heim.ifi.uio.no/olasba/nora/findanonglyph06.pl.txt]

^{—Ola, 13 Oct}

Home | Forum | Discussions | Events

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NoraCorrection

Overview

Detecting gibberish

Clone this wiki locally