-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misaligned text selection due to overriding font metrics from dummy invisible font file (used when OCRing) #6863
Comments
This might be related to #6509. /cc @yurydelendik |
@rossj Can you help me determine exactly which font metrics disagree? I am the right person to make an update with respect to Tesseract. Here's the current situation at HEAD in Tesseract. |
We have the same problem with our CRD'ed pdf's with tesseract. Is there any plan so fix this in the future? If we can help in any way to test the issue please ask. |
Sorry for the delayed response @jbreiden; I missed your reply notification. I believe the discrepancy is in the I believe setting As a potentially separate but related issue, I believe Tesseract's PDF generator is intended to create half-hight highlights (see this comment). I think setting the PDF font metric ascent to be 1000 (and setting the corresponding |
@rossj Thank you, I raised ascent to 1024 in the TTF and tested in Firefox version 46.0.1. Although there are significant improvements, we're not quite there yet. Any thoughts? (Images below are Before vs After the TTF change) |
I get better but not perfect results on Firefox with additional fiddling. Specifically switching some of the various yMax numbers in the font to 2048 and switching PDF height metrics to 1000 as per your suggestions. My guess is most PDF viewers are primarily looking at the metrics from the PDF file itself, and getting good results. Whereas Firefox is listening to the font and still finding some trouble in there. I just don't know what that trouble is. |
Here's another example to play with. Recommend taking a look at the calculation that handles Tz especially when it is doing a horizontal squeeze (e.g. anything less than 100). |
Any progress on this? |
I just tested against Firefox 52.8.1. Looks even worse, especially copy-paste performance. Would love to discuss with a pdf.js programmer. |
Closing since most of these files are fixed now, and this issue contains lots of different files with different problems, making it less actionable. The Tesseract problem is still tracked in another issue. If there are remaining issues, please open a new issue per problem. |
Whilst upgrading PDF.js, I noticed a potential regression where text selection / highlighting of hidden text in OCRd documents appears vertically offset from the text in the background image.
I tracked the issue down to pull request #6601 (merged in commit bb29e13), which adds the overriding of PDF FontDescriptor metrics with metrics from the header of the embedded font file. Below is a link to a test file that displays the issue, along with screenshots using commit 9a830a7 and the previous commit 9a830a7.
ocr-highlight.pdf
Commit bb29e13
![Commit bb29e13307a813226e08db6db412b505ab9dc781](https://cloud.githubusercontent.com/assets/735679/12287585/4c1bfbcc-b993-11e5-8cdd-dba6fe0b0e11.png)
Commit 9a830a7
![Commit 9a830a7b624679ace8fcae5b91ad80d70d91ed1f](https://cloud.githubusercontent.com/assets/735679/12287957/230b5428-b996-11e5-9ce6-f792521ad059.png)
It seems that in order to create the overlaid invisible text, some OCR software uses a dummy TTF file with dummy metric information and a single glyph, and the PDF FontDescriptor information is expected to be used instead. Tesseract OCR uses this technique. You can see a discussion of the method in their code here.
It doesn't seem that Tesseract is changing the PDF FontDescriptor information based on the font that is being OCRd, so it does seem reasonable for them to update the dummy .ttf file to contain the same metrics that are used in the FontDescriptor, which I believe would fix the issue for new PDFs generated by Tesseract. However, this won't help PDFs that are already out in the wild with this descrepancy.
Perhaps it would be possible to detect these dummy fonts and then trust the FontDescriptor information?
The text was updated successfully, but these errors were encountered: