Fix highlight match indexes when text normalization changes length #9448

rossj · 2018-02-06T21:38:30Z

One source of search highlighting problems is due to the fact that character normalization / replacement takes place on the content string and the query string prior to searching, and this normalization does not necessarily preserve the strings length / match indexes. For example, the single character Unicode fraction ½ is normalized to the 3-character string "1/2". This normalization causes the match index of any text after such a normalized character to be off by 2, resulting in improper highlighting when applied to the original content string.

For example, here are the results for searching for "fraction" on the master branch:

The first highlight is correct, the 2nd one is off by 2 characters due to the single preceding fraction, and the 3rd is off by 8 characters due to the preceding 4 fractions

Furthermore, currently searching for "1/2" will match the ½ character due to normalization, but the highlight will be 3 characters long instead of 1, so will also highlight the subsequent 2 characters.

This PR fixes these issues by tracking any length changes / offsets during text normalization, and then uses these offsets to transform a match index in the normalized string to a match index in the original string.

This PR does the following:

Fixes search highlight positioning after any number of normalized characters
Properly handles highlight match length when match contains one of these special normalized characters, e.g. a match for "1/2" against ½ will have a proper length of 1.
Allows proper matching & highlighting of a partial-replacement match, e.g. searching for all of "1/2", "1/", "/2", "1", "2" will all match the character ½.
Opens the door for further normalization / replacements of any length, e.g. more fractions or ☺ => "smiley"
Works from search bar and the multi-word hash / querystring search feature

With these changes, here is the above search for "fraction" again:

And here is the result of searching for "/2":

Attached is the PDF used for these screenshots:
fraction-highlight.pdf

timvandermeij · 2018-02-19T21:36:35Z

Thank you for working on this! Normalization is nothing more than a mapping, so wouldn't it be possible (and simpler) to just denormalize, i.e., reverse the mapping, when needed? Even though this solution works, I find it a bit hard to follow in the code, primarily how _fixMatchIdx works exactly. At the very least this would need a better name, which also applies to the variables such as text2 and matchEnd2.

rossj · 2018-02-19T22:57:10Z

Hi Tim, thanks for taking a look.

In general, I think that transforming a normalized string back to its original is not possible (purely based on the normalization mapping). For example, both strings "1/2" and "½" would normalize to "1/2", but it both cases denormalizing "1/2" would result in "½", which is not the correct original text for the 1st input string.

It would be possible to leave the body text as-is, and then search in it for a normalized and denormalized version of the search query, e.g. leave the body text as "½" and transform a query of "1/2" into essentially "1/2" OR "½". However, this approach becomes more complicated if the search query contains multiple normalized characters, as all permutations would technically need to be included as ORd search phases. For example, a query of "1/2 1/2" would need to be run including the 4 phrases "1/2 1/2" OR "½ ½" OR "1/2 ½" OR "½ 1/2".

Given this permutation complication with only normalizing one of the strings, I believe that normalizing both content and query strings is still the best solution.

I'm sorry that my implementation of _fixMatchIdx is a bit convoluted. I can try to refactor it to be more clear. Here is an attempt to explain what is happening in more detail with an example:

First, during normalization, any length changes that occur on the content string are tracked in an array.
As an example, here is the result of normalizing a content string that has a length change:

Original:   "aaa ½ bbb"
Normalized: "aaa 1/2 bbb"
Diff array: [[4, 2]]

The diff array here contains a single entry, indicating that the normalized string becomes 2 characters longer at index 4 of the original string, because the single character "½" was replaced with 3 ("1/2").

Then, after finding a match index on the normalized string, the function _fixMatchIdx uses this diff array to adjust the match index as necessary, so that it is correct given the original string. For example, a search for "aaa" would have a match index of 0, which is correct in both strings, however a search for "bbb" would have a match index of 8 in the normalized string, but it should be 6 based on the original string.

_fixMatchIdx does this normalization by iterating through the diff array, accumulating a sum of the length changes between the two strings, until it's iterated through all of the diffs, or its gotten to a diff which occurs after the normalized match idx, and is therefore not relevant. The accumulated difference is then subtracted from the normalized match index to give the original match index.

timvandermeij · 2018-02-20T22:51:08Z

Thank you for explaining this in more detail since it does clarify the implementation choices quite well! Perhaps we can put the small example of normalization and denormalization in the comment for _fixMatchIdx? Would _denormalizeMatches perhaps be a better name for the method? Let's first update the method/variable names/comments as indicated to be a bit more descriptive and see if you can simplify the code a bit because I think that would answer most of the questions.

timvandermeij · 2018-02-20T22:53:25Z

web/pdf_find_controller.js

+
+    // Prepare arrays for storing the matches.
+    if (!this.pageMatchesLength) {
+      this.pageMatchesLength = [];


pageMatches is initialized in the reset method (refer to https://github.com/rossj/pdf.js/blob/1b0e63c19e0c496c51cadb4aaa2d6e5cf0c7d8d6/web/pdf_find_controller.js#L67), so let's do that for this one too to avoid this check here and simplify this a bit. Note though that there already appears to be a member variable named this.pageMatchesLength, so I think you need to choose a different name here.

You are right that this.pageMatchesLength is already in use for the word (non-phrase) search, and also for the actual highlighting in text_layer_builder.js here. Previously, phrase search did not require tracking separate match lengths, but it does with this update, and by re-using the same member variable in the updated phrase-search function, it prevents us from having to further update text_layer_builder.js. Additionally, the current searching method is either phrase matching, or word matching, but not both, so I don't think there is any conflict with the use of this.pageMatchesLength.

I agree that the initialization could be handled better. I just copied the initialization from _calculateWordMatch, but I think it would be better to update both methods and initialize it as you suggest

jmealo · 2018-03-09T00:04:41Z

@rossj Kudos on the PR. I'm using pdf.js to search math curriculum/worksheets. Eagerly awaiting this to be merged :) 👍

timvandermeij · 2021-01-12T19:59:47Z

Closing since this is replaced by #12855. Thanks.

timvandermeij · 2021-01-12T21:06:16Z

That PR is now merged. Thanks again @rossj for providing the initial version here!

timvandermeij added the text-selection label Feb 6, 2018

Fix highlight match indexes when text normalization changes length

1b0e63c

rossj force-pushed the highlight-idx-fix branch from 317a5eb to 1b0e63c Compare February 18, 2018 18:04

timvandermeij reviewed Feb 20, 2018

View reviewed changes

timvandermeij mentioned this pull request Jul 4, 2018

PDFJS highlight is not correct. #9306

Open

PeterW570 mentioned this pull request Jul 23, 2020

textlayer misalignment on Chinese document #12110

Closed

Snuffleupagus mentioned this pull request Jan 12, 2021

[api-minor] Highlight search results correctly for normalized text (PR 9448) #12855

Merged

timvandermeij closed this Jan 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix highlight match indexes when text normalization changes length #9448

Fix highlight match indexes when text normalization changes length #9448

rossj commented Feb 6, 2018

timvandermeij commented Feb 19, 2018

rossj commented Feb 19, 2018

timvandermeij commented Feb 20, 2018

timvandermeij Feb 20, 2018

rossj Feb 20, 2018

jmealo commented Mar 9, 2018

timvandermeij commented Jan 12, 2021

timvandermeij commented Jan 12, 2021

Fix highlight match indexes when text normalization changes length #9448

Fix highlight match indexes when text normalization changes length #9448

Conversation

rossj commented Feb 6, 2018

timvandermeij commented Feb 19, 2018

rossj commented Feb 19, 2018

timvandermeij commented Feb 20, 2018

timvandermeij Feb 20, 2018

Choose a reason for hiding this comment

rossj Feb 20, 2018

Choose a reason for hiding this comment

jmealo commented Mar 9, 2018

timvandermeij commented Jan 12, 2021

timvandermeij commented Jan 12, 2021