-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[api-minor] Propagate the translated font name to TextContentItem for system fonts #15659
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
However, I suppose that the effect of this wouldn't be that severe considering how the caching is currently implemented (essentially using the pdf.js/src/display/text_layer.js Lines 130 to 135 in eda51d1
pdf.js/src/display/text_layer.js Lines 49 to 53 in eda51d1
The more problematic part is the API "breakage", since all of this is very old code. @timvandermeij How do you feel about this change? Edit: It just occurred to me that one additional point in favour of this change is probably that it's already inconsistent, since setting |
Hi Jonas, thanks for taking a look at this so quickly, and for the explanations. To provide some additional context, the reason for this change is because it can be useful to look up the original font name, bolding, etc. of extracted text (e.g., for document understanding applications). Right now, this is possible for non-system fonts, but as far as I can tell, it's not feasible to do this for system fonts because of the name mismatch. So even though there's no requirement for the fonts to match between the two modes, I think it would make the API more useful to maintain that correspondence. Another way to do this would be to store a separate field (e.g., "fontObjectName" or some such) with the TextContentItem (perhaps only when it differs from the fontName, i.e., only for system fonts), and callers can do a lookup via item.fontObjectName || item.fontName. This would make the extraction output a bit larger but maintain backwards compatibility. Thoughts? |
In this case it's pretty clear why a third-party implementation might want to do this; however (in general) please keep in mind that additional context usually belongs in the commit message (and not just the PR description/discussion) in order to aid future code archaeology :-)
This sounds like very much like PR #10753, which we really don't want to do because of its overhead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(You can obviously wait with addressing these comments until we've agreed to the general approach.)
Good to know, I'll definitely keep that in mind for future commits.
That makes sense. If I'm reading correctly, this change would enable the use case that PR #10753, Issue #15651, etc. are interested in, without the memory overhead. Thanks for the code review as well. I'll wait for Tim to chime in on the general approach - in the meantime, would you prefer that I apply fixes by amending the original commit, or by tacking on another commit and squashing them if and when the PR gets merged? |
In terms of the general approach, I'd say that if it's already inconsistent (either in terms of system versus non-systems fonts behaving differently and the |
This allows font data for system fonts to be looked up in the PDFObjects.
0d6d7b2
to
36fb5c1
Compare
/botio test |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/e24e3021dcb54cb/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/aea9cd687619c41/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/aea9cd687619c41/output.txt Total script time: 25.41 mins
Image differences available at: http://54.241.84.105:8877/aea9cd687619c41/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/e24e3021dcb54cb/output.txt Total script time: 32.12 mins
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
This allows font data for system fonts to be looked up in PDFObjects. Without this, only text with non-system fonts get a font name that matches the name in PDFObjects (e.g., "g_d0_f1") while text with system fonts get an untranslated name (e.g., "Times").