Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache filtering by font #64

Closed
jstockwin opened this issue Apr 16, 2020 · 3 comments · Fixed by #73
Closed

Cache filtering by font #64

jstockwin opened this issue Apr 16, 2020 · 3 comments · Fixed by #73

Comments

@jstockwin
Copy link
Owner

Filtering by font occurs frequently, and involves checking the font of each element.

It could be worth caching the filtered elements for each font somewhere on the document. This shouldn't be done at load, only when the function for each font is called.

@pauloAmaral
Copy link

This should probably be done in all of the filtering functions that operate of immutable properties (i.e., fonts, text, pages).

Also, maybe using https://docs.python.org/3/library/functools.html#functools.lru_cache would be a good idea.

@jstockwin
Copy link
Owner Author

@pauloAmaral Hmm, the problem with e.g. caching by text is that there are so many possible inputs. The reason I was thinking about fonts was that there is a very limited set of fonts for each document.

I don't think it'd be as simple as using lru_cache because you can filter_by_font on any element list, and depending on the elements within the list it will return something different. You'd end up with a different cache for each element list, which feels a bit pointless?

I was thinking of having an _elements_by_font: Dict[str, set[int]] on the document (mapping fonts to set of element indexes), and then the filter_by_font on the element list would just do self.indexes & document._elements_by_font[font] to get the new indexes.

I think the set operation should be quite quick. Obviously it does mean that in some cases you'd be considering all the elements in the document even when your element list is tiny, but I do this it's probably a lot quicker than checking all the elements manually in each case.

What do you think? Perhaps I'm overcomplicating it somehow, but I don't see how it would really work for text etc. How do you envisage it working for e.g. text? Additionally, I think it doesn't make sense for pages since that should already be pretty efficient because we just lookup the page and get the start and end index.

@pauloAmaral
Copy link

I think I agree with both your points:

  1. There are potentially too many inputs for text so doesn't make sense, and looking at the code fetching by pages is fast enough so no need for cache.
  2. I admit I didn't consider that we would end up with too many caches... Having an _elements_by_font seems to be the right answer for this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants