Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix highlight match indexes when text normalization changes length #9448

Closed
wants to merge 1 commit into from

Conversation

rossj
Copy link
Contributor

@rossj rossj commented Feb 6, 2018

One source of search highlighting problems is due to the fact that character normalization / replacement takes place on the content string and the query string prior to searching, and this normalization does not necessarily preserve the strings length / match indexes. For example, the single character Unicode fraction ½ is normalized to the 3-character string "1/2". This normalization causes the match index of any text after such a normalized character to be off by 2, resulting in improper highlighting when applied to the original content string.

For example, here are the results for searching for "fraction" on the master branch:
image
The first highlight is correct, the 2nd one is off by 2 characters due to the single preceding fraction, and the 3rd is off by 8 characters due to the preceding 4 fractions

Furthermore, currently searching for "1/2" will match the ½ character due to normalization, but the highlight will be 3 characters long instead of 1, so will also highlight the subsequent 2 characters.

This PR fixes these issues by tracking any length changes / offsets during text normalization, and then uses these offsets to transform a match index in the normalized string to a match index in the original string.

This PR does the following:

  • Fixes search highlight positioning after any number of normalized characters
  • Properly handles highlight match length when match contains one of these special normalized characters, e.g. a match for "1/2" against ½ will have a proper length of 1.
  • Allows proper matching & highlighting of a partial-replacement match, e.g. searching for all of "1/2", "1/", "/2", "1", "2" will all match the character ½.
  • Opens the door for further normalization / replacements of any length, e.g. more fractions or ☺ => "smiley"
  • Works from search bar and the multi-word hash / querystring search feature

With these changes, here is the above search for "fraction" again:
image

And here is the result of searching for "/2":
image

Attached is the PDF used for these screenshots:
fraction-highlight.pdf

@rossj rossj force-pushed the highlight-idx-fix branch from 317a5eb to 1b0e63c Compare February 18, 2018 18:04
@timvandermeij
Copy link
Contributor

Thank you for working on this! Normalization is nothing more than a mapping, so wouldn't it be possible (and simpler) to just denormalize, i.e., reverse the mapping, when needed? Even though this solution works, I find it a bit hard to follow in the code, primarily how _fixMatchIdx works exactly. At the very least this would need a better name, which also applies to the variables such as text2 and matchEnd2.

@rossj
Copy link
Contributor Author

rossj commented Feb 19, 2018

Hi Tim, thanks for taking a look.

In general, I think that transforming a normalized string back to its original is not possible (purely based on the normalization mapping). For example, both strings "1/2" and "½" would normalize to "1/2", but it both cases denormalizing "1/2" would result in "½", which is not the correct original text for the 1st input string.

It would be possible to leave the body text as-is, and then search in it for a normalized and denormalized version of the search query, e.g. leave the body text as "½" and transform a query of "1/2" into essentially "1/2" OR "½". However, this approach becomes more complicated if the search query contains multiple normalized characters, as all permutations would technically need to be included as ORd search phases. For example, a query of "1/2 1/2" would need to be run including the 4 phrases "1/2 1/2" OR "½ ½" OR "1/2 ½" OR "½ 1/2".

Given this permutation complication with only normalizing one of the strings, I believe that normalizing both content and query strings is still the best solution.

I'm sorry that my implementation of _fixMatchIdx is a bit convoluted. I can try to refactor it to be more clear. Here is an attempt to explain what is happening in more detail with an example:

First, during normalization, any length changes that occur on the content string are tracked in an array.
As an example, here is the result of normalizing a content string that has a length change:

Original:   "aaa ½ bbb"
Normalized: "aaa 1/2 bbb"
Diff array: [[4, 2]]

The diff array here contains a single entry, indicating that the normalized string becomes 2 characters longer at index 4 of the original string, because the single character "½" was replaced with 3 ("1/2").

Then, after finding a match index on the normalized string, the function _fixMatchIdx uses this diff array to adjust the match index as necessary, so that it is correct given the original string. For example, a search for "aaa" would have a match index of 0, which is correct in both strings, however a search for "bbb" would have a match index of 8 in the normalized string, but it should be 6 based on the original string.

_fixMatchIdx does this normalization by iterating through the diff array, accumulating a sum of the length changes between the two strings, until it's iterated through all of the diffs, or its gotten to a diff which occurs after the normalized match idx, and is therefore not relevant. The accumulated difference is then subtracted from the normalized match index to give the original match index.

@timvandermeij
Copy link
Contributor

Thank you for explaining this in more detail since it does clarify the implementation choices quite well! Perhaps we can put the small example of normalization and denormalization in the comment for _fixMatchIdx? Would _denormalizeMatches perhaps be a better name for the method? Let's first update the method/variable names/comments as indicated to be a bit more descriptive and see if you can simplify the code a bit because I think that would answer most of the questions.


// Prepare arrays for storing the matches.
if (!this.pageMatchesLength) {
this.pageMatchesLength = [];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pageMatches is initialized in the reset method (refer to https://github.com/rossj/pdf.js/blob/1b0e63c19e0c496c51cadb4aaa2d6e5cf0c7d8d6/web/pdf_find_controller.js#L67), so let's do that for this one too to avoid this check here and simplify this a bit. Note though that there already appears to be a member variable named this.pageMatchesLength, so I think you need to choose a different name here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that this.pageMatchesLength is already in use for the word (non-phrase) search, and also for the actual highlighting in text_layer_builder.js here. Previously, phrase search did not require tracking separate match lengths, but it does with this update, and by re-using the same member variable in the updated phrase-search function, it prevents us from having to further update text_layer_builder.js. Additionally, the current searching method is either phrase matching, or word matching, but not both, so I don't think there is any conflict with the use of this.pageMatchesLength.

I agree that the initialization could be handled better. I just copied the initialization from _calculateWordMatch, but I think it would be better to update both methods and initialize it as you suggest

@jmealo
Copy link

jmealo commented Mar 9, 2018

@rossj Kudos on the PR. I'm using pdf.js to search math curriculum/worksheets. Eagerly awaiting this to be merged :) 👍

@timvandermeij
Copy link
Contributor

Closing since this is replaced by #12855. Thanks.

@timvandermeij
Copy link
Contributor

That PR is now merged. Thanks again @rossj for providing the initial version here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants