-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[filtering] Cache filtering by fonts #73
Conversation
py_pdf_parser/filtering.py
Outdated
non_cached_fonts = fonts - self.document._elements_indexes_by_font.keys() | ||
if non_cached_fonts: | ||
# If we don't have cached elements for any of the required fonts, build | ||
# the cache for the non cached fonts. | ||
for element in self: | ||
if element.font not in non_cached_fonts: | ||
continue | ||
|
||
self.document._elements_indexes_by_font[element.font].add( | ||
element._index | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid I don't think this works.
Say I have element A on page 1 and element B on page 2, both with the same font.
If I do document.get_page(1).elements.filter_by_font(font)
, then element A gets added to the cache.
Then when I do document.get_page(2).elements.filter_by_font(font)
, non_cached_fonts
will be empty, and you code will return an empty element list since B isn't in the cache?
I think you want something like:
non_cached_fonts = fonts - self.document._elements_indexes_by_font.keys()
if non_cached_fonts:
# If we don't have cached elements for any of the required fonts, build
# the cache
self.document._elements_indexes_by_font[font] = set(element._index for element in document.elements if element.font == font)
However, I also think this might be cleaner as a private method on the document? So e.g. document._elements_with_font(font)
, which uses the __element_indexes_by_font
variable? It seems weird to me that a function external to the PDFDocument
is handling a cache on the PDFDocument
, if that makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted, I forgot that I was actually looping the elements list and not the document itself!
@@ -157,13 +157,17 @@ def test_filter_by_font(self): | |||
self.assertEqual(len(doc.elements.filter_by_font("hello,1")), 0) | |||
|
|||
self.assertEqual(len(doc.elements.filter_by_font("foo,2")), 1) | |||
# Check if "foo,2" has been added to cache | |||
self.assertEqual(doc._elements_indexes_by_font, {"foo,2": {0}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate there's no styleguide in this project for this, but we don't use the curly bracket notation for sets anywhere in this project so perhaps this should be set([0])
? (and elsewhere in this file)
86ed0e0
to
a3932e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small things, otherwise LGTM thanks!
|
||
def _element_indexes_with_fonts(self, *fonts) -> Set[int]: | ||
""" | ||
Returns all the indexes of elements with given fonts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just say "For internal use only, used to cache fonts. If you want to filter by fonts you should use elements.filter_by_fonts
instead." or something similar?
Sorry @pauloAmaral can you also edit the CHANGELOG please? There will be a checklist when you open the PR once #70 is merged. |
Co-authored-by: Jake Stockwin <[email protected]>
Co-authored-by: Jake Stockwin <[email protected]>
bf0f1e2
to
0ff19e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thing...
Co-authored-by: Jake Stockwin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Closes #64