Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache parse_links() by --find-links html page content #5

Merged
merged 9 commits into from
Feb 14, 2020

Conversation

cosmicexplorer
Copy link

@cosmicexplorer cosmicexplorer commented Feb 13, 2020

Cherry-pick of pypa#7729! This change is intended to fix pex-tool/pex#887.

Copy link

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so you're not trying to cache between runs, just within a single run.

Would there be even more benefit in the future to caching between runs?

@cosmicexplorer
Copy link
Author

cosmicexplorer commented Feb 13, 2020

Would there be even more benefit in the future to caching between runs?

EDITED THIS COMMENT A LOT!!

See maybe-related convo over at the pip PR in pypa#7729 (comment).

pypa#5670 is working against us a little here, because we don't want to have caching that avoids reading in updates from publishing. I'm actually not sure if I can see an obvious form of caching between runs that improves on this PR without causing issues like pypa#5670.

@cosmicexplorer
Copy link
Author

Hm, I think the above is correct, but also think a really relevant optimization would be to cache the html page fetching as well for --find-links repos (currently the page is still re-fetched on each transitive requirement -- this caching just avoids re-parsing the page constantly). That would probably take the shape of another dictionary somewhere nearby to avoid fetching an HTMLPage more than once, keyed by url. I'll use the same benchmark script as in pypa#7729 to see whether that change affects performance too.

@cosmicexplorer
Copy link
Author

...even locally (without going across the network to retrieve the --find-links page), it appears that caching the html page itself by url has an impact on performance. Via the benchmark script at pypa#7729, I'm seeing pip complete at pretty reliably 11.8 seconds with the changes in 1d8f381, and 11.9-12.+ seconds with just the parse_links() caching. I'm going to propose adding this additional level of caching to the upstream pip PR as well.

@cosmicexplorer
Copy link
Author

Proposed the change in the upstream pip PR: pypa#7729 (comment).

@cosmicexplorer cosmicexplorer force-pushed the pants-fork-cache-parse-links branch from 1d8f381 to d0127a6 Compare February 14, 2020 19:31
@benjyw benjyw merged commit 5eb9470 into pex-tool:master Feb 14, 2020
@cosmicexplorer
Copy link
Author

(we will be cherry-picking the resulting changes from pypa#7729, if they differ from this, after it is approved and merged!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

the pip-powered resolve in pex 2 will re-tokenize --find-links pages on each transitive requirement
2 participants