Cache filtering by font #64

jstockwin · 2020-04-16T14:27:06Z

Filtering by font occurs frequently, and involves checking the font of each element.

It could be worth caching the filtered elements for each font somewhere on the document. This shouldn't be done at load, only when the function for each font is called.

pauloAmaral · 2020-05-04T10:25:42Z

This should probably be done in all of the filtering functions that operate of immutable properties (i.e., fonts, text, pages).

Also, maybe using https://docs.python.org/3/library/functools.html#functools.lru_cache would be a good idea.

jstockwin · 2020-05-04T10:52:47Z

@pauloAmaral Hmm, the problem with e.g. caching by text is that there are so many possible inputs. The reason I was thinking about fonts was that there is a very limited set of fonts for each document.

I don't think it'd be as simple as using lru_cache because you can filter_by_font on any element list, and depending on the elements within the list it will return something different. You'd end up with a different cache for each element list, which feels a bit pointless?

I was thinking of having an _elements_by_font: Dict[str, set[int]] on the document (mapping fonts to set of element indexes), and then the filter_by_font on the element list would just do self.indexes & document._elements_by_font[font] to get the new indexes.

I think the set operation should be quite quick. Obviously it does mean that in some cases you'd be considering all the elements in the document even when your element list is tiny, but I do this it's probably a lot quicker than checking all the elements manually in each case.

What do you think? Perhaps I'm overcomplicating it somehow, but I don't see how it would really work for text etc. How do you envisage it working for e.g. text? Additionally, I think it doesn't make sense for pages since that should already be pretty efficient because we just lookup the page and get the start and end index.

pauloAmaral · 2020-05-04T11:05:08Z

I think I agree with both your points:

There are potentially too many inputs for text so doesn't make sense, and looking at the code fetching by pages is fast enough so no need for cache.
I admit I didn't consider that we would end up with too many caches... Having an _elements_by_font seems to be the right answer for this case.

jstockwin added priority: low difficulty: medium component: components enhancement labels Apr 16, 2020

jstockwin self-assigned this Apr 16, 2020

pauloAmaral mentioned this issue May 4, 2020

[filtering] Cache filtering by fonts #73

Merged

jstockwin assigned pauloAmaral and unassigned jstockwin May 4, 2020

pauloAmaral closed this as completed in #73 May 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache filtering by font #64

Cache filtering by font #64

jstockwin commented Apr 16, 2020

pauloAmaral commented May 4, 2020

jstockwin commented May 4, 2020

pauloAmaral commented May 4, 2020

Cache filtering by font #64

Cache filtering by font #64

Comments

jstockwin commented Apr 16, 2020

pauloAmaral commented May 4, 2020

jstockwin commented May 4, 2020

pauloAmaral commented May 4, 2020