Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[api-minor] Propagate the translated font name to TextContentItem for system fonts #15659

Merged
merged 1 commit into from
Nov 8, 2022

Conversation

sxyuan
Copy link
Contributor

@sxyuan sxyuan commented Nov 4, 2022

This allows font data for system fonts to be looked up in PDFObjects. Without this, only text with non-system fonts get a font name that matches the name in PDFObjects (e.g., "g_d0_f1") while text with system fonts get an untranslated name (e.g., "Times").

@Snuffleupagus

This comment was marked as outdated.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Nov 4, 2022

Furthermore this patch will cause a lot more font switching in the textLayer thus (potentially) affecting performance,

However, I suppose that the effect of this wouldn't be that severe considering how the caching is currently implemented (essentially using the Font.prototype.fallbackName property):

const style = styles[geom.fontName];
if (style.vertical) {
angle += Math.PI / 2;
}
const fontHeight = Math.hypot(tx[2], tx[3]);
const fontAscent = fontHeight * getAscent(style.fontFamily, ctx);
and
function getAscent(fontFamily, ctx) {
const cachedAscent = ascentCache.get(fontFamily);
if (cachedAscent) {
return cachedAscent;
}

and would also break API-compatibility for something that's existed for (at least) ten years.

The more problematic part is the API "breakage", since all of this is very old code. @timvandermeij How do you feel about this change?

Edit: It just occurred to me that one additional point in favour of this change is probably that it's already inconsistent, since setting useSystemFonts=false and thus triggering use of the standardFontData will already (effectively) enable this behaviour for non-embedded standard fonts.

@Snuffleupagus Snuffleupagus reopened this Nov 4, 2022
@sxyuan
Copy link
Contributor Author

sxyuan commented Nov 4, 2022

Hi Jonas, thanks for taking a look at this so quickly, and for the explanations. To provide some additional context, the reason for this change is because it can be useful to look up the original font name, bolding, etc. of extracted text (e.g., for document understanding applications). Right now, this is possible for non-system fonts, but as far as I can tell, it's not feasible to do this for system fonts because of the name mismatch. So even though there's no requirement for the fonts to match between the two modes, I think it would make the API more useful to maintain that correspondence.

Another way to do this would be to store a separate field (e.g., "fontObjectName" or some such) with the TextContentItem (perhaps only when it differs from the fontName, i.e., only for system fonts), and callers can do a lookup via item.fontObjectName || item.fontName. This would make the extraction output a bit larger but maintain backwards compatibility. Thoughts?

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Nov 4, 2022

To provide some additional context, the reason for this change is because it can be useful to look up the original font name, bolding, etc. of extracted text (e.g., for document understanding applications).

In this case it's pretty clear why a third-party implementation might want to do this; however (in general) please keep in mind that additional context usually belongs in the commit message (and not just the PR description/discussion) in order to aid future code archaeology :-)

Another way to do this would be to store a separate field (e.g., "fontObjectName" or some such) with the TextContentItem (perhaps only when it differs from the fontName, i.e., only for system fonts), and callers can do a lookup via item.fontObjectName || item.fontName. This would make the extraction output a bit larger but maintain backwards compatibility. Thoughts?

This sounds like very much like PR #10753, which we really don't want to do because of its overhead.

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(You can obviously wait with addressing these comments until we've agreed to the general approach.)

@sxyuan
Copy link
Contributor Author

sxyuan commented Nov 4, 2022

In this case it's pretty clear why a third-party implementation might want to do this; however (in general) please keep in mind that additional context usually belongs in the commit message (and not just the PR description/discussion) in order to aid future code archaeology :-)

Good to know, I'll definitely keep that in mind for future commits.

This sounds like very much like PR #10753, which we really don't want to do because of its overhead.

That makes sense. If I'm reading correctly, this change would enable the use case that PR #10753, Issue #15651, etc. are interested in, without the memory overhead.

Thanks for the code review as well. I'll wait for Tim to chime in on the general approach - in the meantime, would you prefer that I apply fixes by amending the original commit, or by tacking on another commit and squashing them if and when the PR gets merged?

@timvandermeij
Copy link
Contributor

timvandermeij commented Nov 5, 2022

In terms of the general approach, I'd say that if it's already inconsistent (either in terms of system versus non-systems fonts behaving differently and the useSystemFonts option also adding a difference in behavior) it sounds better to make it consistent. If we mark this as [api-minor] we at least indicate it's an API change for safety, but while I can't fully oversee if people actually rely on the current behavior I don't think that's the case.

@Snuffleupagus Snuffleupagus changed the title Propagate the translated font name to TextContentItem for system fonts. [api-minor] Propagate the translated font name to TextContentItem for system fonts Nov 5, 2022
This allows font data for system fonts to be looked up in the
PDFObjects.
@sxyuan sxyuan force-pushed the system-font-name-fix branch from 0d6d7b2 to 36fb5c1 Compare November 8, 2022 19:16
@Snuffleupagus
Copy link
Collaborator

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Nov 8, 2022

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/e24e3021dcb54cb/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Nov 8, 2022

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/aea9cd687619c41/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Nov 8, 2022

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/aea9cd687619c41/output.txt

Total script time: 25.41 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 8

Image differences available at: http://54.241.84.105:8877/aea9cd687619c41/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

pdfjsbot commented Nov 8, 2022

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/e24e3021dcb54cb/output.txt

Total script time: 32.12 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: Passed

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants