Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filtering] Cache filtering by fonts #73

Merged
merged 6 commits into from
May 6, 2020
Merged

Conversation

pauloAmaral
Copy link

Closes #64

py_pdf_parser/components.py Outdated Show resolved Hide resolved
py_pdf_parser/components.py Outdated Show resolved Hide resolved
Comment on lines 216 to 226
non_cached_fonts = fonts - self.document._elements_indexes_by_font.keys()
if non_cached_fonts:
# If we don't have cached elements for any of the required fonts, build
# the cache for the non cached fonts.
for element in self:
if element.font not in non_cached_fonts:
continue

self.document._elements_indexes_by_font[element.font].add(
element._index
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't think this works.

Say I have element A on page 1 and element B on page 2, both with the same font.

If I do document.get_page(1).elements.filter_by_font(font), then element A gets added to the cache.
Then when I do document.get_page(2).elements.filter_by_font(font), non_cached_fonts will be empty, and you code will return an empty element list since B isn't in the cache?

I think you want something like:

        non_cached_fonts = fonts - self.document._elements_indexes_by_font.keys()
        if non_cached_fonts:
            # If we don't have cached elements for any of the required fonts, build
            # the cache
            self.document._elements_indexes_by_font[font] = set(element._index for element in document.elements if element.font == font)

However, I also think this might be cleaner as a private method on the document? So e.g. document._elements_with_font(font), which uses the __element_indexes_by_font variable? It seems weird to me that a function external to the PDFDocument is handling a cache on the PDFDocument, if that makes sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted, I forgot that I was actually looping the elements list and not the document itself!

@@ -157,13 +157,17 @@ def test_filter_by_font(self):
self.assertEqual(len(doc.elements.filter_by_font("hello,1")), 0)

self.assertEqual(len(doc.elements.filter_by_font("foo,2")), 1)
# Check if "foo,2" has been added to cache
self.assertEqual(doc._elements_indexes_by_font, {"foo,2": {0}})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate there's no styleguide in this project for this, but we don't use the curly bracket notation for sets anywhere in this project so perhaps this should be set([0])? (and elsewhere in this file)

@pauloAmaral pauloAmaral force-pushed the cache-filtering-by-font branch 4 times, most recently from 86ed0e0 to a3932e1 Compare May 4, 2020 13:41
@pauloAmaral pauloAmaral requested a review from jstockwin May 4, 2020 13:42
Copy link
Owner

@jstockwin jstockwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small things, otherwise LGTM thanks!

py_pdf_parser/components.py Outdated Show resolved Hide resolved

def _element_indexes_with_fonts(self, *fonts) -> Set[int]:
"""
Returns all the indexes of elements with given fonts.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just say "For internal use only, used to cache fonts. If you want to filter by fonts you should use elements.filter_by_fonts instead." or something similar?

@jstockwin
Copy link
Owner

Sorry @pauloAmaral can you also edit the CHANGELOG please?

There will be a checklist when you open the PR once #70 is merged.

@pauloAmaral pauloAmaral force-pushed the cache-filtering-by-font branch from bf0f1e2 to 0ff19e8 Compare May 4, 2020 15:56
@pauloAmaral pauloAmaral requested a review from jstockwin May 4, 2020 16:02
Copy link
Owner

@jstockwin jstockwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing...

CHANGELOG.md Outdated Show resolved Hide resolved
Co-authored-by: Jake Stockwin <[email protected]>
@pauloAmaral pauloAmaral requested a review from jstockwin May 4, 2020 16:09
Copy link
Owner

@jstockwin jstockwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@pauloAmaral pauloAmaral merged commit 797f8b6 into master May 6, 2020
@pauloAmaral pauloAmaral deleted the cache-filtering-by-font branch May 6, 2020 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache filtering by font
3 participants