Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git: find a "better" way to handle tokens than git credential store #1051

Closed
Wauplin opened this issue Sep 14, 2022 · 22 comments · Fixed by #1138
Closed

Git: find a "better" way to handle tokens than git credential store #1051

Wauplin opened this issue Sep 14, 2022 · 22 comments · Fixed by #1138
Assignees
Milestone

Comments

@Wauplin
Copy link
Contributor

Wauplin commented Sep 14, 2022

Mentioned in #1043 (comment).

Currently we store the user token for git commands in the git-credential-store. This is the default git storage that stores creds in plain text in a file. huggingface_hub warns the user to use it by default to avoid problems (by running git config --global credential.helper store). In a perfect world, it would be good to use the default credential helper from the user. In particular, macos users have a macosxkeychain tool by default to securely handle credentials.

Another possibility is to not store the credential in git and automatically fill the values (from python) when git requires them (in the Repository module).

Note: I am no expert on that topic so any addition is welcomed here :)

Useful links:

(Edit: also to mention that when a user do huggingface-cli login or notebook_login(), the token is also stored locally in plain text in the home directory ~/.huggingface/token to be reused in API calls. Changing this is out of topic for this issue)

@julien-c
Copy link
Member

For context, do you think Repository (which uses git under the hood) is still widely used, now that we have more powerful HTTP-based APIs to store repos w/o git? How do you anticipate its usage to evolve in the coming months (based on usage in the downstream libraries)

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 15, 2022

Based on @LysandreJik's comment #939 (comment), Repository is still planned to be the default way to push data on the hub while training in transformers. I am not sure we should anticipate a usage decrease soon as the http-based endpoints doesn't offer the same flexibility yet. @LysandreJik, do you mind sharing what are the benefits of the git-based approach and what is missing to the http-based approach to close the gap ?

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 15, 2022

kinda related, @julien-c do you know if we track Repository usage on our side ?
This is something we can do as it is using the credentials user=hf_user; password=token. Any user cloning/pushing from outside huggingface_hub lib will use their own username/password right ?

@julien-c
Copy link
Member

(lysandre will know best but i think it might be the support for asynchronous aka. non-blocking pushes i.e. decoupling committing and pushing, which is useful during training as you need to push large files and don't want that to be blocking)

(Also it's quite natural, indeed, to have a local git repo for a model you are currently training)

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 15, 2022

About async commit/push, I totally get it (we can give priority to #939).

About having a local git repo for a model you are training, that's something the git-based approach cannot be replaced right ? So we should not expect Repository to be abandoned (and therefore keep improving what we can).

@julien-c
Copy link
Member

That's an open question. I think it would be interesting to poll the community on this (in particular maintainers of downstream libraries, both HF and third party ones)

@LysandreJik
Copy link
Member

LysandreJik commented Sep 15, 2022

About the issue in general:

An important aspect that we would want to keep were we to move away from using git-credential store is for huggingface-cli login to still have side-effects on non-python-runtime tasks.

The token stored in ~/.huggingface/token could be used by all of our Python code, including Repository without much issue. However, I would expect the following workflow to work as well (and so do users, as we have had this issue in the past):

huggingface-cli login
git clone https://huggingface.co/<PRIVATE_REPO>

This is only possible if git has access to the credential helper.


About the Repository question specifically: I don't think it's going to leave anytime soon, as it's still the favored method in some scenarios. Most importantly, it remains the favored method as soon as we would like to use the Hub as a true git platform.

This can be the case when:

  • Collaborating on a project
  • Using revisions, such as tags
  • Doing complex workflows which you'd rather manage locally before pushing a single update

and additionally, but this could arguably be added to the HTTP methods as well if we decide it's worth it, but the fact that it can cleanly handle async stuff by using git/git-lfs under the hood makes it reliable.

I would also recommend reading #321 where considering non-git approaches wasn't 100% accepted.

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 15, 2022

huggingface-cli login to still have side-effects on non-python-runtime tasks.
(...)
However, I would expect the following workflow to work as well (and so do users, as we have had this issue in the past):

huggingface-cli login
git clone https://huggingface.co/<PRIVATE_REPO>

Thanks for reminding this here @LysandreJik ! TBH I completely forgot about this aspect. Dunno yet how to tackle the issue but good to know that the solution has to be through git credential.

@julien-c
Copy link
Member

julien-c commented Sep 15, 2022

I would challenge the fact that

huggingface-cli login
git clone https://huggingface.co/<PRIVATE_REPO>

should work.

(As a matter of fact, it doesn't work currently (i.e. with latest release, before #1053), in both Colab and anywhere a user doesn't have helper=store)

If no-one has complained about it not working, I think we should just drop it – but still support Repository, which works via a different mechanism (= overriding of the local remote-url actually, local credential helper)

If we insist on supporting the first use case, I would only do on Colab like was done in #1053

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 15, 2022

I just made some tests with the abstract level that git uses (git credential [fill|approve|reject])

If I run:

git credential approve
url=https://huggingface.co
protocol=https
host=huggingface.co
username=hf_user
password=hf_****

It will store the credentials in the default helper defined on the user's machine.

So what we could do is to simply change the current command (from write_to_credential_store)

git credential-store store
url=https://huggingface.co
username=hf_user
password=hf_****

with its abstract version. Same for deleting the token (currently using git credential-store erase).


Using the default git helper on the machine instead of store will be cleaner:

  • for macos users using xkeychain by default
  • for any users using a secure helper of their choice
  • for any users using store (doesn't change for them compared to the current situation)

And for users without any git helper configured (e.g. me until 1 week ago 👋), we either show a warning or help them configure a git helper though a CLI. This is already the case anyway.


EDIT: implemented workflow slightly differs from the one below (see description in #1138 (comment))

Side note from @julien-c: we need to think about the case where a token is already stored in the user helper (as we do not want to overwrite an existing value).

To summarize, here is a workflow I see:

  1. If huggingface-cli login:
    1. if git helper configured
      1. if a "huggingface.co" value is already stored: print a warning "A token is already stored in your git credential helper. If you want to update it, please delete it manually first." (alternatively, prompt "Do you want to erase current stored value ? [y/n]")
      2. if no existing value: add the entry using git credential approve
    2. if git helper is not configured
      1. warning message "hey you should use a git credential helper if you want to be fully integrated"
      2. alternatively: prompt "Set git-credential-store as default ? [y/n]" (or maybe propose a list: git-credential-cache, xkeychain, git credential manager,...)
      3. alternatively: make git-credential-store default without asking and store the value there
  2. If notebook_login() in a google colab
    1. Use git-credential-store as default. No warning message. It is usually run on a temporary machine anyway. Same as current main branch (since this recent PR Globally set git credential.helper to store in google colab #1053)
  3. If notebook_login() not in a colab: we assume this is a machine owned by the user so same as huggingface-cli login
    1. if git helper configured
      1. if a "huggingface.co" value is already stored: print a warning
      2. if no existing value: add the entry using git credential approve
    2. if git helper is not configured
      1. same as with huggingface-cli but if it starts to be complicated (menu/prompt something), just put a warning "hey, please use the CLI"

Does that cover every possible use case?

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 15, 2022

If no-one has complained about it not working, I think we should just drop it

I would be in favor of making huggingface-cli login / git clone ... work if we can (+help the user configure an helper).

For being a user that almost never used the git+https pattern (always used git+ssh), I found it annoying when I started using the Hub as I typed my username/password quite a lot. Having a huggingface-cli that handles as much as possible the git credentials or help the user to do so is still useful to lower the entry barrier IMO

(edit: my point is that not having complains doesn't mean some users doesn't struggle with it)

@julien-c
Copy link
Member

Your proposed workflow looks generally good to me @Wauplin, I would tweak it at the margins with:

  • 1.i.a. if a "huggingface.co" value is already stored: just do nothing
  • 1.ii. : i would do a. and/or b.
  • 3.i.a.: no need to print a warning

WDYT?

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 15, 2022

Ok in general. For 1.ii., let's do a.

I just wonder if a user already have an helper with a value and does huggingface-cli login to update a revoked token, he/she might expect the git credential to be updated as well (case 1.i.a and 3.i.a). So warning/prompt "erase? [y/N]" doesn't seem too absurd here in my opinion.

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 15, 2022

But I'm not too opinionated on those 2 corner cases if you think no warning is fine as well.

@julien-c
Copy link
Member

Yes, up to you. A prompt might be ok!

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 27, 2022

I started to work on this issue and it was nearly good until I found a quite annoying issue. The plan now is to use git credential approve and git credential fill to store/retrieve git credentials while using the helper defined by the user. It works quite well (straightforward to use).

My problem is when I try to read the credential when they are not set. What I'd like is to return None which would mean "user is not logged in". However, git credential fill guarantees to return a username/password pair no matter what. To do that, it asks all helpers and in the end, prompts the user to enter username/password. => annoying part is that this prompt is not caught by python subprocess.

Here is a standalone script to reproduce the problem. Any idea for a workaround for this ?

import subprocess


ENDPOINT = "https://random_endpoint.hf"

# Store credential for "toto"
with subprocess.Popen(
    ["git", "credential", "approve"],
    stdin=subprocess.PIPE,
    stdout=subprocess.PIPE,
    encoding="utf-8",
) as process:
    process.stdin.write(f"url={ENDPOINT}\nusername=toto\npassword=secret\n\n")
    process.stdin.flush()

# Retrieve credentials for "toto" -> works
print("Get credentials when stored:")
with subprocess.Popen(
    ["git", "credential", "fill"],
    stdin=subprocess.PIPE,
    stdout=subprocess.PIPE,
    encoding="utf-8",
) as process:
    process.stdin.write(f"url={ENDPOINT}\n\n")
    process.stdin.flush()
    print(process.stdout.read())

# Delete credentials for "toto"
with subprocess.Popen(
    ["git", "credential", "reject"],
    stdin=subprocess.PIPE,
    stdout=subprocess.PIPE,
    encoding="utf-8",
) as process:
    process.stdin.write(f"url={ENDPOINT}\nusername=toto\n\n")
    process.stdin.flush()

# Retrieve credentials for "toto" -> prompt user !
print("Get credentials when not stored:")
with subprocess.Popen(
    ["git", "credential", "fill"],
    stdin=subprocess.PIPE,
    stdout=subprocess.PIPE,
    encoding="utf-8",
) as process:
    process.stdin.write(f"url={ENDPOINT}\n\n")
    process.stdin.flush()
    print(process.stdout.read())

FYI, git_prompt is implemented here but it's written in C and didn't help me to find a solution. I tried to mock sys.stdin and sys.stdout from python as well but no success.

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 27, 2022

Context: I try to adapt the existing read_from_credential_store method that is currently public even though it is used nowhere. A quick solution is to only implement approve (store) and reject (erase) as those are the interesting ones. However I would not be able to implement the complete workflow we talked about (e.g. first check that the credential is not set, then prompt user before overwritting, ...)

@julien-c @LysandreJik any idea ?

@julien-c
Copy link
Member

julien-c commented Oct 27, 2022

disclaimer: 😈

Should we just remove this feature and see what happens?

I suspect not a long of stuff would break, and stuff that breaks we can override the remote to include token, or something

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 28, 2022

So we are back to the question "do we want to support huggingface-cli login ... git clone https://huggingface.co/<PRIVATE_REPO> use case ?"

Because otherwise we can also have a checkbox/prompt "Save token as git credential ?" and let the user decide. No need to know if we are overwriting something or not, it is up to the user. I feel that this checkbox is mainly designed for your use case @julien-c (:roll_eyes:) but at least we would still support git credential and in a better way than what's done with the current git credential store.

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 28, 2022

I don't like the idea of removing the feature completely with @LysandreJik specifically advocating on the fact that it caused issues by the past (#1051 (comment)).

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 28, 2022

I just created a draft PR and proposed an adapted workflow in #1138 (comment).

@julien-c
Copy link
Member

Haven't read the other PR yet, but yeah asking Save token as git credential ? sounds good to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants