Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python 3] Basic auth relying on utf-8 encoding by default #4564

Closed
princess-entrapta opened this issue Mar 28, 2018 · 13 comments
Closed

[Python 3] Basic auth relying on utf-8 encoding by default #4564

princess-entrapta opened this issue Mar 28, 2018 · 13 comments

Comments

@princess-entrapta
Copy link

princess-entrapta commented Mar 28, 2018

I got deceived by default requests behavior for auth headers.

As using a password containing some non-ascii chars, the basic auth method silently encoded in latin1, which caused authentication to fail as it was expected to be utf-8 encoded server side.

Utf-8 is default encoding in modern terminals and in python3 strings. As a result, requests behavior for auth is asymmetrical with curl and is odd when being reversed by authentication servers. Requests is expected to fail on non-binary input, or comply to web standards of using utf-8 text strings.

As a workaround I encoded the string in utf-8 beforehand authing with requests.

Reproduction Steps

import requests
requests.get('https://example.com', auth=('user', 'àéïòù'))

Expected Result

Basic auth header is base64encoded version of "user:àéïòù".encode('utf-8')

Actual Result

Basic auth header is base64encoded version of "user:àéïòù".encode('latin1')

Workaround

requests.get('https://example.com', auth=('user', 'àéïòù'.encode('utf-8')))

@sigmavirus24
Copy link
Contributor

We could extend https://github.com/requests/requests/blob/b66908e7b647689793e299edc111bf9910e93ad3/requests/auth.py#L79 to accept an encoding and that could pass it through to https://github.com/requests/requests/blob/b66908e7b647689793e299edc111bf9910e93ad3/requests/auth.py#L28 but the extra parameter there would need to default to latin-1 as it does not. Alternatively, this could be added to the toolbelt

@nateprewitt
Copy link
Member

If we added an extra param does that give us anything new? The current solution we have was provided in #3662 and gives the user the option to choose by sending unambiguous bytes. It also prevents parameter creep on many functions we could provide an encoding. As for defaults, most of the other recent choices made around iso-8859-1 vs utf-8 suggest there isn’t a “right” choice.

As @sigmavirus24 said, we need to maintain the latin-1 default for backwards compat on Requests 2.X. For Requests 3.0, if we enforce a hard fail on a non-bytes value, does that provides a better user experience? I’m not sure it does. If we change the defaults for the whole library to UTF-8 does that put us more inline with the 2018 web? Probably, but it would be nice to have some solid numbers/specifications on that.

@princess-entrapta
Copy link
Author

princess-entrapta commented Mar 29, 2018

@nateprewitt It seems to me utf-8 is a quite broadly accepted default regarding text encoding in web content, it also is the default encoding of Python strings. On the other hand, what was the justification for using iso-8859-1 specifically, even in lower Request versions ?

Either failing on non-bytes or silently encoding in utf-8 are behavior I believe more expectable in user perspective and more coherent within the general environment. In case an arbitrary behavior need to be maintained, I believe a proper documentation warning would also help. Personally I had to browse source code in order to understand what went on under the hood.

@sigmavirus24
Copy link
Contributor

@arthur-hav Most of the web still implements HTTP/1.1 as defined in RFC 2616 with some exceptional cases following the revisions in 7230, 7231, 7232, 7233, 7234, and 7235. The default encoding for the web is latin-1. Presuming that encoding is the safest backwards-compatible solution. We've been plenty fair in offering other ways to handle this, but it seems like you want to make a backwards incompatible break without understanding the history of the library or the specification it implements.

@princess-entrapta
Copy link
Author

princess-entrapta commented Mar 29, 2018

@sigmavirus24 As much as I understand from wikipedia and discussion on web auth header https://stackoverflow.com/questions/7242316/what-encoding-should-i-use-for-http-basic-authentication the issue was never really settled on non-ascii characters although, fact I was unaware of, the RFC 2616 did state iso-8859-1 to be the default encoding for headers.

I am not really set on "what I want" and certainly won't insist on making changes that are likely to break things. That being said, I would certainly advocate toward making it easier to realize what is going on when trying to authenticate with non-ascii character passwords. Curl defaults to utf-8, apparently major web browsers as well, so it is to be expected a fair part of interfaces will expect utf-8 too and disregard the 1999 RFC, making this a possible common pitfall.

@CarliJoy
Copy link

Hello,

quite some time passed since this issues was discussed.
I also would like to opt in for using UTF-8 by default - in my case I can't use my corporate proxy with pip with my proper password using latin1.
The issues is related to pypa/pip#5801

@chrahunt writes there:

It is probably safe to assume that the encoding of credential fields should be UTF-8 as:

  1. Browsers use it (source)
  2. When servers want to request a specific charset, their only option is UTF-8. (source)

Also in the meanwhile python2 is outdated and python3 uses UTF8 by default.

So there should be no need on enforcing latin1 anymore.
If you still want to enforce backward compatibility, I would suggest this:
A (dirty) workaround: Catch the UnicodeEncodeError and try to encode in UTF8 in this case.
As UTF8 is based on latin1 this should work for probably 90% of the cases and is backward compatible, still allowing pip users to use non latin1 passwords.

    # We use latin1 for backwards compatibility by default but allow
    # unicode once we can't encode latin1
    if isinstance(username, str):
        try:
            username = username.encode('latin1')
        except UnicodeEncodeError:
            username = username.encode('utf8')

    if isinstance(password, str):
        try:
            password = password.encode('latin1')
        except UnicodeEncodeError:
            password = password.encode('utf8')
        password = password.encode('utf8')

Please comment so it can be decided which way to go - pure UTF8 or dirty catching non latin1.

@CarliJoy
Copy link

CarliJoy commented May 6, 2020

@sigmavirus24 Can I change your mind on this topic?
As requests is the base for pip it actually breaks things for users.
As today all modern browser use UTF-8, it is cumbersome for users to be forced to use latin1 (there is no choice when using pip, as you can't add byte code to the config).

So from a user perspective it is quite strange that auth fails with requests but works with all browsers and curl.

If people need backwards compatibility they still could use bytecode to encode to latin-1. For me these "numbers" (and the issues linked) would be actually enough to switch to a default UTF-8 - at least on Unicode errors and switch to utf8 default on a new major release.

I know requests want to continue support Python2 - so maybe some could help adopting the snipped so it will work in Python2 as well (could not check so far).

@CarliJoy
Copy link

CarliJoy commented May 6, 2020

Also note: Just short after the last comment (before mine) Mozilla made the change to UTF8 (following Chrome): https://www.fxsitecompat.dev/en-CA/docs/2018/basic-auth-credentials-are-now-encoded-in-utf-8-instead-of-iso-8859-1/

So this is an two years old issue and old major browsers and tools use UTF8 as default option for years now and the new standard (RFC7617) clearly states basically that UTF8 is the only option.
This might be the time to assume it is safe to switch ;-)
@sigmavirus24 @nateprewitt what do you think?

@sigmavirus24
Copy link
Contributor

@CarliJoy I'm no longer a maintainer and the maintenance team is focusing on the bare minimum work to keep Requests secure at this point.

So this is an two years old issue and old major browsers and tools use UTF8 as default option for years now and the new standard (RFC7617) clearly states basically that UTF8 is the only option.

This is all great and well until you consider that there are servers out there that people still need to interact with that haven't been updated much past the era of RFC 2616 and suddenly sending UTF8 doesn't work for them.

Further no project attempting to follow SemVer can change this kind of behaviour in anything other than a major version release (e.g., requests 3.0) and that's unlikely to happen any time soon. While I'm not diametrically opposed to the behaviour, I also have no say in the matter. Please check next time before tagging someone in multiple comments to see if they're still relevant to the project.

@anderseknert
Copy link

Having wasted a good hour of an otherwise fine evening on this before finding this - kind of surprising to see a modern library these days catering to a spec dating two decades(!) back, but anyway - easily worked around by manually setting the auth header:

b64bytes = base64.b64encode(f'{username}:{password}'.encode('utf-8'))
userpass_encoded = 'Basic ' + str(b64bytes, 'utf-8')

r = requests.post(url, data=json.dumps(payload), headers={'Authorization': str(userpass_encoded)})

@CarliJoy
Copy link

CarliJoy commented Aug 7, 2020

@anderseknert arthur-hav suggested an easier workaround already in the issue itself:
requests.get('https://example.com', auth=(username.encode('utf-8), password.encode('utf-8')))

No need to manipulate the header itself.

@anderseknert
Copy link

Thanks @CarliJoy - seems I missed that somehow - probably since I had the workaround in place already when coming here :) That's indeed better.

AdamWill added a commit to AdamWill/mwclient that referenced this issue Jan 27, 2024
As discussed upstream in
psf/requests#4564 , HTTP basic auth
usernames and passwords sent to requests as Python text strings
are encoded as latin1. This of course makes it impossible to
log in with a username or password containing characters not
represented in latin1, as the reporter of mwclient#315 found out.

To work around this rather old-fashioned default, let's intercept
string usernames and passwords and encode them as utf-8 before
sending them to requests.

Anyone dealing with a really old server that can't handle utf-8,
or something like that, can encode the username and password
appropriately and provide them as bytestrings.

Signed-off-by: Adam Williamson <[email protected]>
AdamWill added a commit to AdamWill/mwclient that referenced this issue Jan 27, 2024
As discussed upstream in
psf/requests#4564 , HTTP basic auth
usernames and passwords sent to requests as Python text strings
are encoded as latin1. This of course makes it impossible to
log in with a username or password containing characters not
represented in latin1, as the reporter of mwclient#315 found out.

To work around this rather old-fashioned default, let's intercept
string usernames and passwords and encode them as utf-8 before
sending them to requests.

Anyone dealing with a really old server that can't handle utf-8,
or something like that, can encode the username and password
appropriately and provide them as bytestrings.

Signed-off-by: Adam Williamson <[email protected]>
AdamWill added a commit to mwclient/mwclient that referenced this issue Jan 28, 2024
As discussed upstream in
psf/requests#4564 , HTTP basic auth
usernames and passwords sent to requests as Python text strings
are encoded as latin1. This of course makes it impossible to
log in with a username or password containing characters not
represented in latin1, as the reporter of #315 found out.

To work around this rather old-fashioned default, let's intercept
string usernames and passwords and encode them as utf-8 before
sending them to requests.

Anyone dealing with a really old server that can't handle utf-8,
or something like that, can encode the username and password
appropriately and provide them as bytestrings.

Signed-off-by: Adam Williamson <[email protected]>
csm10495 added a commit to csm10495/pyhtcc that referenced this issue Apr 4, 2024
@sethmlarson
Copy link
Member

Changing this requires a backwards incompatible change which IMO isn't worth the squeeze right now. Closing this unless others have stronger opinions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants