Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate get_value_by_key_path and replace with xpath_search #626

Merged
merged 8 commits into from
Oct 13, 2024

Conversation

MaxDall
Copy link
Collaborator

@MaxDall MaxDall commented Oct 3, 2024

This PR deprecates get_value_by_key_path class method of LinkedDataMapping and replaces all existing occasions with xpath_search.

To talk about efficiency I provide some additional runtime comparison and a script to reproduce.
The script loads one article for The West Australian, since it by far includes the largest LD and loads a specific value (or key path) 100000 times. For more details click on Script.

Script
from timeit import timeit

setup = '''
from timeit import timeit

from fundus import PublisherCollection, Crawler
from lxml.etree import XPath

test_xpath = XPath("NewsArticle/datePublished")

crawler = Crawler(PublisherCollection.au.WestAustralian)

for article in crawler.crawl(max_articles=1):
  ld = article.ld
'''

stmt1 = 'ld.get_value_by_key_path(["NewsArticle", "datePublished"])'
stmt2 = 'ld.xpath_search("NewsArticle/datePublished")'
stmt3 = 'ld.xpath_search(test_xpath)'

length = 100000

time1 = timeit(setup=setup, stmt=stmt1, number=length)
time2 = timeit(setup=setup, stmt=stmt2, number=length)
time3 = timeit(setup=setup, stmt=stmt3, number=length)

print(f"Loading a value {length} times")
print(f"key_path took {time1:.2f} seconds, with an avg of {time1/length:.2e}")
print(f"xpath_search without pre-compiled XPath took {time2:.2f} seconds, with an avg. of {time2/length:.2e}")
print(f"xpath_search with pre-compiled XPath took {time3:.2f} seconds, with an avg. of {time3/length:.2e}")
print(f"using pre compiled xpath is {time2/time3:.2f} times faster")
print(f"using key_path is {time3/time1:.2f} times faster than xpath_search")

Output

Loading a value 100000 times
key_path took 0.07 seconds, with an avg of 6.98e-07
xpath_search without pre-compiled XPath took 2.42 seconds, with an avg. of 2.42e-05
xpath_search with pre-compiled XPath took 1.51 seconds, with an avg. of 1.51e-05
using pre-compiled XPath is 1.60 times faster
using key_path is 21.65 times faster than xpath_search

For better typing, this PR adds a scalar parameter to xpath_search, indicating that one expects an optional scalar value in return.

@MaxDall MaxDall requested review from addie9800 and dobbersc October 3, 2024 16:25
Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good change. Thanks 👍

def xpath_search(self, query: Union[XPath, str], scalar: Literal[True] = True) -> Optional[Any]:
...

def xpath_search(self, query: Union[XPath, str], scalar: bool = False) -> Union[Any, List[Any]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not being that familiar with typing, is this third variation necessary for the case where a variable value is passed to scalar? So it's undetermined at compilation, whether it will be True or False?

Also, if this is the case, shouldn't the return value also be Union[Optional[Any], List[Any]], to be consistent with the case of Literal[True]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you don't have to type hint the original function, it's just for IDEs like Pycharm.

You're right, I forgot the Optional there.

@MaxDall MaxDall requested a review from addie9800 October 8, 2024 18:52
Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

@MaxDall MaxDall merged commit a14fbb3 into master Oct 13, 2024
5 checks passed
@MaxDall MaxDall deleted the deprecate-value-by-key-path branch October 13, 2024 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants