Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use git's partial clone feature to speed up pip #9086

Merged
merged 1 commit into from
Aug 15, 2021

Conversation

nipunn1313
Copy link
Contributor

Clone with --filter=blob:none - as it fetches all
metadata, but only dynamically fetches the blobs as
needed by checkout. Since typically, pip only needs the blobs for
a single revision, this can be a big improvement, especially
when fetching from repositories with a lot of history,
particularly on slower network connections.

Added unit test for the rev-less path. Confirmed that both
of the if/else paths are tested by the unit tests.

@nipunn1313 nipunn1313 force-pushed the partial_clone branch 4 times, most recently from 4f9f4d3 to f7ac0d5 Compare November 2, 2020 02:38
@nipunn1313
Copy link
Contributor Author

Unfortunately, I found that this only works if the server has a couple of config options set.

I've set them in the unit tests here, and I know that github supports those config options, but without a way to detect if the server supports filtering, I'm not sure if it's possible to proceed here

@BrownTruck

This comment has been minimized.

@BrownTruck BrownTruck added the needs rebase or merge PR has conflicts with current master label Apr 3, 2021
@uranusjr
Copy link
Member

uranusjr commented Apr 3, 2021

What would happen when the server is not configured so? Since the most majority of our users likely use services that do support this, it may be reasonable to implement a fallback mechanism if possible.

@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label Aug 6, 2021
@nipunn1313
Copy link
Contributor Author

Sorry for the delay. Fell off my radar, but back to clean it up.

I updated the test to confirm that the git client actually behaves totally fine when the server lacks support for filtering (emits a warning upon git pull, but proceeds correctly w/o filtering). As you mentioned, majority of users at this point are using services that will support this feature.

news/9086.feature.rst Outdated Show resolved Hide resolved
@nipunn1313
Copy link
Contributor Author

nipunn1313 commented Aug 7, 2021

Ended up running black on test_vcs_git.py to fix some line length issues the linter brought up. I split up the commit which runs black and the commit which makes the feature change (you can see it in the commits tab on github)

@uranusjr
Copy link
Member

uranusjr commented Aug 7, 2021

Looks good to me now, @pypa/pip-committers do we want a --use-feature or --use-deprecated flag for this so the user has a way out if this somehow does not work for them?

@sbidoul sbidoul added the C: vcs pip's interaction with version control systems like git, svn and bzr label Aug 7, 2021
@sbidoul
Copy link
Member

sbidoul commented Aug 7, 2021

Cross-referencing #9603 and #9607

@sbidoul
Copy link
Member

sbidoul commented Aug 7, 2021

do we want a --use-feature or --use-deprecated flag for this so the user has a way out if this somehow does not work for them?

To answer that it would help knowing if this feature has been supported by git (client) since long enough.
If yes, then I think a feature flag is not necessary as compatibility does not seem to be an issue, and we can think of adding it if and when we discover problematic situations.

@uranusjr uranusjr added this to the 21.3 milestone Aug 7, 2021
@nipunn1313
Copy link
Contributor Author

It's been around since around 2018 https://git-scm.com/docs/partial-clone since git 2.19 (though support increased over the versions)

If using an older version of git which doesn't understand the "--filter" flag, it may break pip. In that case, something like "--use-deprecated" or even a silent fallback to a version without the --filter flag

I went and manually tested by building old versions of git from source and v2.16.0 seems to crash without filter flag

➜  git git:(2512f15446) ✗ git describe
v2.16.0
➜  git git:(2512f15446) ✗ bin-wrappers/git clone --filter=blob:none https://github.com/pypa/pip
error: unknown option `filter=blob:none'

and v2.7.0 seems to work fine

➜  git git:(468165c1d8) ✗ git describe
v2.17.0
➜  git git:(468165c1d8) ✗ bin-wrappers/git clone --filter=blob:none https://github.com/pypa/pip
Cloning into 'pip'...
remote: Enumerating objects: 54569, done.

According to https://en.wikipedia.org/wiki/Git#Releases 2.17 released in 2018

Depending on our willingness to break people using >3 year old versions of git, we could opt to include a --use-deprecated or a silent fallback w/o the --filter flag

@nipunn1313
Copy link
Contributor Author

Doing a case study with ubuntu:

ubuntu 18.04 LTS ships with 2.17 https://packages.ubuntu.com/bionic/git

However, if you go back further to ubuntu 16.04 LTS, it ships with 2.7.4 of git https://launchpad.net/ubuntu/xenial/+source/git

Ubuntu 16.04 LTS ended support in 2021 recently - though there is still some kind of "extended security maintenance" until 2024, though I suspect we can allow those folks to be on their own - installing older versions of pip.
https://ubuntu.com/about/release-cycle

@uranusjr
Copy link
Member

uranusjr commented Aug 8, 2021

I happened to have the opportunity to deal with this “foo is released 5 years ago so we could probably drop it” stuff, and it turns out it’s entirely not uncommon for pip users to run things on pretty old setups. If the support was first released in 2018, we probably couldn’t safely assume its existence until somewhere like 2028 (and people will still loudly complain, it’s just we’ll have better ground refuting them by the time)

@sbidoul
Copy link
Member

sbidoul commented Aug 8, 2021

2018 is not that old. So I'd suggest the silent fallback on older git version. We already have get_git_version() to help with that.

@uranusjr
Copy link
Member

uranusjr commented Aug 8, 2021

We need to detect the Git server's version as well, so get_git_version isn't enough (and I don't think there's a reliable way to detect that).

@nipunn1313
Copy link
Contributor Author

nipunn1313 commented Aug 8, 2021 via email

@nipunn1313
Copy link
Contributor Author

I confirmed that since git client version 2.17, cloning with --filter is fine regardless of whether server supports it

➜  clonetestdir ~/src/git/bin-wrappers/git --version
git version 2.17.0
➜  clonetestdir ~/src/git/bin-wrappers/git clone --filter=blob:none file:///`pwd`/clonetest clonetest2
Cloning into 'clonetest2'...
warning: filtering not recognized by server, ignoring
remote: Counting objects: 3, done.
remote: Total 3 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (3/3), done.

Agreed that 2018 is recent and that it'll probably be a long time since we stop seeing it. At my last company, we were using ubuntu 14.04 still (trying hard to get off it, but it wasn't so easy). I'll update the diff to leverage get_git_version

@nipunn1313 nipunn1313 force-pushed the partial_clone branch 2 times, most recently from 745ca16 to bf90abf Compare August 9, 2021 22:07
@uranusjr uranusjr requested a review from sbidoul August 11, 2021 05:24
Copy link
Member

@sbidoul sbidoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great contrib thanks !

Perhaps, in a followup, it would be interesting to lru_cache git_get_version().

@sbidoul
Copy link
Member

sbidoul commented Aug 11, 2021

One last question, have you done some feature and/or performance comparisons between --filter=blog:none and --filter=tree:0 ? Any particular reason you chose one over the other ?

@nipunn1313
Copy link
Contributor Author

One last question, have you done some feature and/or performance comparisons between --filter=blog:none and --filter=tree:0 ? Any particular reason you chose one over the other ?

No I haven't!

Docs are here https://git-scm.com/docs/git-rev-list#Documentation/git-rev-list.txt---filterltfilter-specgt

From reading the docs, --filter=blob:none is sort of the 0IQ - don't get anything option. tree:0 seems similar to blob:none, but tree:1 could be interesting. I imagine git clone is implicitly doing a checkout of HEAD after cloning so it's somewhat equivalent to tree:1.

I have done some perf comparisons between filter=blob:none and full clone, and it's a pretty major difference on large repositories with lots of history. I did the comparisons at my last company on the monorepo, and it was something like 40m clone -> 8m clone (v/ slow, I know).

Clone with --filter=blob:none - as it fetches all
metadata, but only dynamically fetches the blobs as
needed by checkout. Since typically, pip only needs the blobs for
a single revision, this can be a big improvement, especially
when fetching from repositories with a lot of history,
particularly on slower network connections.

Added unit test for the rev-less path. Confirmed that both
of the if/else paths are tested by the unit tests.
@nipunn1313
Copy link
Contributor Author

(rebased to handle conflicts). Should be better now.

@sbidoul
Copy link
Member

sbidoul commented Aug 15, 2021

Ok, I did some more testing and all looks good. I also re-read this GitHub article which seems to indicate that blobless clone is the good compromise for pip.

Thanks again!

@sbidoul sbidoul merged commit 01308b8 into pypa:main Aug 15, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 25, 2021
@ichard26 ichard26 added the type: performance Commands take too long to run label Dec 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bot:chronographer:provided C: vcs pip's interaction with version control systems like git, svn and bzr type: performance Commands take too long to run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants