-
Notifications
You must be signed in to change notification settings - Fork 4
NoraCorrection
On many stretches of text where pdfBox is unable to map glyphs to normative code points, it instead numbers the glyphs and outputs them in the form "g81g72g81g3g82g80g3g76g81g86g87" etc. The letter may be g, a, x; it may also be strings like 'affi' (in connection with Arabic glyphs) or 'cid' (seen only in documents produced on a Mac).
Most of these unmapped glyphs can be found with the regex \d(([A-Za-z]+)\d+){3} — this matches runs of three of characters and digits right next to each other. A high proportion of text like this indicates that the document is mostly unreadable.
Classification of documents, sorted by score: [http://heim.ifi.uio.no/olasba/nora/anonglyphs06.log]
Source code: [http://heim.ifi.uio.no/olasba/nora/findanonglyph06.pl.txt]
- —Ola, 13 Oct
Home | Forum | Discussions | Events