Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Replace pdf.ttf with sharp2.ttf, keep name the same
As discussed at length in issue #182, the existing pdf.ttf causes difficulties for certain PDF viewers, in part because the old file had zero advance width. With testing, sharp2.ttf seems to be the best available compromise, although it's not perfect and causes some visual difficulties in Evince. It does seem to fix Kindle and OS X Preview.
- Loading branch information
b30930b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering Ghostscript bug and related discussion on http://bugs.ghostscript.com/show_bug.cgi?id=696116 can pdf.ttf be somehow adjusted to workaround "gs" behavior?
b30930b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the font metrics in PDF.ttf for Firefox compatibility, which should have just made it to GitHub recently as part of Tesseract 4.x. so probably the first thing to do is retest when the dust settles. [EDIT: I am going to sit down and figure out the current state of affairs on Monday before I confuse myself and everybody else]
b30930b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS. Don't commit compatibility changes to PDF generation without my involvement. It is very easy to break one thing while fixing another.
b30930b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, Ray I've confirmed that the github pdf.ttf needs updating. Talking to Ray...
b30930b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what we ultimately want.
$ md5sum pdf.ttf
e436074b54ed9cc5bf4789f79059b01b pdf.ttf
b30930b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue I reported to Ghostscript is related:
http://bugs.ghostscript.com/show_bug.cgi?id=696874
OCR output produced by Tesseract will survive Ghostscript pdfwrite for versions less than 9.20. Versions <= 9.19 have a bug that can corrupt the character mapping if characters above U+00FF appear. That can easily happen for "plain English" if Tesseract misdetects a diacritic, or picks up a ligature or special character.