Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 7613, Parse AcroForm Dictionary #8625

Closed
wants to merge 37 commits into from

Conversation

dmitryskey
Copy link
Contributor

@dmitryskey dmitryskey commented Jul 6, 2017

#7613, Parse AcroForm dictionary and 'DR' appearance and set annotation wizard font properties.

It's related to the section:

Appearances

  • It seems like the current font code for text widgets is dead as fontRefName is never set...
  • Parse AcroForm dictionary

There are two commits, first is dealing with fix itself, second updates unit tests

@timvandermeij
Copy link
Contributor

timvandermeij commented Jul 14, 2017

First of all, thank you for working on this!

The core of the change looks good, but I'm also noticing a lot of other changes that do not appear to be related to the appearances issue. One of those changes is making the factory asynchronous; why is that necessary/helpful? Another one is the xref check in the factory; why do we need that (xref should always be there)? Finally, why do we need the entire AnnotationWorkerTask?

In short, it looks like this patch could be simplified a lot, which will also make reviewing considerably easier. The core of the change appears to be https://github.com/mozilla/pdf.js/pull/8625/files#diff-18ca06e2bd5be4c9132b51f78388b8f0R697; let's keep it mostly at that and avoid other changes that are not directly related to the appearance issue.

@dmitryskey
Copy link
Contributor Author

Hi!

Thank you very much for the code review, I really appreciate it. Let me explain my vision of the problem and the general approach. The annotation field font information is stored as /DA field like '/Helv 0 Tf 0 g', currently it's saved in the defaultAnnotation property as a string and therefore should be parsed further. I definitely didn't want to write a mini-parser of any kind and wanted to re-utilize an existing PartialEvaluator. But it required the following changes

  1. It works asynchronously, so I had to change the way annotation factory was working. Also as a result unit tests were modified
  2. Parser needs a worker. But I really tried to minimize changes and didn't want to change API for annotations property (document.js:319), therefore I preferred to create an internal worker mockup. This can be area of improvement, I just not such familiar with PDF.JS code base, it may be better to change this property to the function with worker parameter, get rid of internal mockup and just use the standard worker

As for the xref - I feel it can be simplified, but it's only way I found to save parsed font info for the annotation layer factory.

screen shot 2017-07-14 at 9 26 12 am

@timvandermeij
Copy link
Contributor

Thank you for the clarification; it helps to understand your choices. The re-use strategy is good and it is definitely what we need to do here. I'll check if we can simplify the code, e.g., I have a feeling that the worker task may be solved differently to require less code, but I'm not too familiar with that part of the code base myself as well. If others are reading along, any pointers are appreciated :-)

@timvandermeij
Copy link
Contributor

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/3152ba1194fa817/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/3152ba1194fa817/output.txt

Total script time: 2.36 mins

Published

@dmitryskey
Copy link
Contributor Author

dmitryskey commented Jan 24, 2018

Hi!

Is there any plans to review and potentially merge these changes? They're used in the site https://www.smartformsondemand.org/ and it seems working fine in production.

@timvandermeij
Copy link
Contributor

timvandermeij commented Jan 24, 2018

I think there is some interesting code in this pull request, but at the moment I'm lacking the time to properly go over the changes. Just keep the pull request open so it remains in our review queue and we can look at it. If possible, can we split this functionality into a commit per feature? It looks like a lot of functionality is in here (obtaining the AcroForm dictionary, parsing it, styling the various components) and it's much easier to review and test if there is one commit per functional change. Usually this is done by splitting it up into smaller pull requests, for example a first one that only contains the functionality of obtaining the AcroForm dictionary. After that, a pull request for parsing it can be made, et cetera.

@dmitryskey
Copy link
Contributor Author

dmitryskey commented Jan 24, 2018

Ok, sounds like a plan. It'll take some time to slice it and test every sub-commit, but it seems reasonable.

So let me confirm. I'll start submitting these changes in smaller pull requests and you review and merge the incoming request. Then next part can be created unless we get all functionality added.

@timvandermeij
Copy link
Contributor

timvandermeij commented Jan 25, 2018

Yes, let's do it one pull request at a time. Make every pull request contain one piece of complete functionality including tests (unit and/or reference tests) and keep it as simple as possible. This will make it much easier to verify and review and easier to get the functionality merged. Thank you!

@escapewindow
Copy link
Contributor

Hi @dmitryskey,

As far as I can tell, 1) we split out initial logic and landed in #9822 , but 2) we haven't ported the rest, and 3) we still want these changes. Is that accurate?

Do you mind if I try to rescue the rest of this PR? I started unbitrotting, but thought I should check before continuing further.

@dmitryskey
Copy link
Contributor Author

dmitryskey commented Jul 18, 2020

Hi @escapewindow,

Yes - we split this PR into two parts - make annotation layer parsing async (which was landed as #9822) and further parsing of the annotations in order to extract font info etc.

The main problem with the rest of the code was that I wasn't able to use the same worker and had to create a new one for the additional parsing. It works perfectly fine in my own branch which I use for the very specific PDF "US I9", but in general we have to eliminate this part because of performance concerns.

As far as I remember there was another PR where author parsed annotation info with regex, but for the arbitrary PDF it might have even bigger performance issues.

Feel free to refactor this part, the most critical part s getting rid of a separate AnnotationWorkerTask. I'd rather start with converting this PR to ES6 first like it was done in other requests, and then try to improve the approach in general.

Feel free to ask any questions and thank you for taking care about this PR, once merged it'll make my branch for I-9 a way easier to maintain. Also I'd rather recommend to use the master branch from https://github.com/dmitryskey/pdf.js, this PR is outdated and I'd rather to create a new one from this repository

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants