-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix highlight match indexes when text normalization changes length #9448
Conversation
317a5eb
to
1b0e63c
Compare
Thank you for working on this! Normalization is nothing more than a mapping, so wouldn't it be possible (and simpler) to just denormalize, i.e., reverse the mapping, when needed? Even though this solution works, I find it a bit hard to follow in the code, primarily how |
Hi Tim, thanks for taking a look. In general, I think that transforming a normalized string back to its original is not possible (purely based on the normalization mapping). For example, both strings It would be possible to leave the body text as-is, and then search in it for a normalized and denormalized version of the search query, e.g. leave the body text as Given this permutation complication with only normalizing one of the strings, I believe that normalizing both content and query strings is still the best solution. I'm sorry that my implementation of First, during normalization, any length changes that occur on the content string are tracked in an array.
The diff array here contains a single entry, indicating that the normalized string becomes 2 characters longer at index 4 of the original string, because the single character Then, after finding a match index on the normalized string, the function
|
Thank you for explaining this in more detail since it does clarify the implementation choices quite well! Perhaps we can put the small example of normalization and denormalization in the comment for |
|
||
// Prepare arrays for storing the matches. | ||
if (!this.pageMatchesLength) { | ||
this.pageMatchesLength = []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pageMatches
is initialized in the reset
method (refer to https://github.com/rossj/pdf.js/blob/1b0e63c19e0c496c51cadb4aaa2d6e5cf0c7d8d6/web/pdf_find_controller.js#L67), so let's do that for this one too to avoid this check here and simplify this a bit. Note though that there already appears to be a member variable named this.pageMatchesLength
, so I think you need to choose a different name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that this.pageMatchesLength
is already in use for the word (non-phrase) search, and also for the actual highlighting in text_layer_builder.js
here. Previously, phrase search did not require tracking separate match lengths, but it does with this update, and by re-using the same member variable in the updated phrase-search function, it prevents us from having to further update text_layer_builder.js
. Additionally, the current searching method is either phrase matching, or word matching, but not both, so I don't think there is any conflict with the use of this.pageMatchesLength
.
I agree that the initialization could be handled better. I just copied the initialization from _calculateWordMatch
, but I think it would be better to update both methods and initialize it as you suggest
@rossj Kudos on the PR. I'm using pdf.js to search math curriculum/worksheets. Eagerly awaiting this to be merged :) 👍 |
Closing since this is replaced by #12855. Thanks. |
That PR is now merged. Thanks again @rossj for providing the initial version here! |
One source of search highlighting problems is due to the fact that character normalization / replacement takes place on the content string and the query string prior to searching, and this normalization does not necessarily preserve the strings length / match indexes. For example, the single character Unicode fraction
½
is normalized to the 3-character string "1/2". This normalization causes the match index of any text after such a normalized character to be off by 2, resulting in improper highlighting when applied to the original content string.For example, here are the results for searching for "fraction" on the master branch:
data:image/s3,"s3://crabby-images/bcb5c/bcb5c7b112358765c10b6599e675a65c8aaed243" alt="image"
The first highlight is correct, the 2nd one is off by 2 characters due to the single preceding fraction, and the 3rd is off by 8 characters due to the preceding 4 fractions
Furthermore, currently searching for "1/2" will match the ½ character due to normalization, but the highlight will be 3 characters long instead of 1, so will also highlight the subsequent 2 characters.
This PR fixes these issues by tracking any length changes / offsets during text normalization, and then uses these offsets to transform a match index in the normalized string to a match index in the original string.
This PR does the following:
With these changes, here is the above search for "fraction" again:
data:image/s3,"s3://crabby-images/bc4ea/bc4eaa545e427a01182f3326dbb7d0d7db3f4119" alt="image"
And here is the result of searching for "/2":
data:image/s3,"s3://crabby-images/bd945/bd945b81b8220be2ed98952944b94e90bfe9c0d9" alt="image"
Attached is the PDF used for these screenshots:
fraction-highlight.pdf