-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python 3] Basic auth relying on utf-8 encoding by default #4564
Comments
We could extend https://github.com/requests/requests/blob/b66908e7b647689793e299edc111bf9910e93ad3/requests/auth.py#L79 to accept an encoding and that could pass it through to https://github.com/requests/requests/blob/b66908e7b647689793e299edc111bf9910e93ad3/requests/auth.py#L28 but the extra parameter there would need to default to |
If we added an extra param does that give us anything new? The current solution we have was provided in #3662 and gives the user the option to choose by sending unambiguous bytes. It also prevents parameter creep on many functions we could provide an encoding. As for defaults, most of the other recent choices made around iso-8859-1 vs utf-8 suggest there isn’t a “right” choice. As @sigmavirus24 said, we need to maintain the latin-1 default for backwards compat on Requests 2.X. For Requests 3.0, if we enforce a hard fail on a non-bytes value, does that provides a better user experience? I’m not sure it does. If we change the defaults for the whole library to UTF-8 does that put us more inline with the 2018 web? Probably, but it would be nice to have some solid numbers/specifications on that. |
@nateprewitt It seems to me utf-8 is a quite broadly accepted default regarding text encoding in web content, it also is the default encoding of Python strings. On the other hand, what was the justification for using iso-8859-1 specifically, even in lower Request versions ? Either failing on non-bytes or silently encoding in utf-8 are behavior I believe more expectable in user perspective and more coherent within the general environment. In case an arbitrary behavior need to be maintained, I believe a proper documentation warning would also help. Personally I had to browse source code in order to understand what went on under the hood. |
@arthur-hav Most of the web still implements HTTP/1.1 as defined in RFC 2616 with some exceptional cases following the revisions in 7230, 7231, 7232, 7233, 7234, and 7235. The default encoding for the web is latin-1. Presuming that encoding is the safest backwards-compatible solution. We've been plenty fair in offering other ways to handle this, but it seems like you want to make a backwards incompatible break without understanding the history of the library or the specification it implements. |
@sigmavirus24 As much as I understand from wikipedia and discussion on web auth header https://stackoverflow.com/questions/7242316/what-encoding-should-i-use-for-http-basic-authentication the issue was never really settled on non-ascii characters although, fact I was unaware of, the RFC 2616 did state iso-8859-1 to be the default encoding for headers. I am not really set on "what I want" and certainly won't insist on making changes that are likely to break things. That being said, I would certainly advocate toward making it easier to realize what is going on when trying to authenticate with non-ascii character passwords. Curl defaults to utf-8, apparently major web browsers as well, so it is to be expected a fair part of interfaces will expect utf-8 too and disregard the 1999 RFC, making this a possible common pitfall. |
Hello, quite some time passed since this issues was discussed. @chrahunt writes there:
Also in the meanwhile python2 is outdated and python3 uses UTF8 by default. So there should be no need on enforcing latin1 anymore. # We use latin1 for backwards compatibility by default but allow
# unicode once we can't encode latin1
if isinstance(username, str):
try:
username = username.encode('latin1')
except UnicodeEncodeError:
username = username.encode('utf8')
if isinstance(password, str):
try:
password = password.encode('latin1')
except UnicodeEncodeError:
password = password.encode('utf8')
password = password.encode('utf8') Please comment so it can be decided which way to go - pure UTF8 or dirty catching non latin1. |
@sigmavirus24 Can I change your mind on this topic? So from a user perspective it is quite strange that auth fails with requests but works with all browsers and curl. If people need backwards compatibility they still could use bytecode to encode to latin-1. For me these "numbers" (and the issues linked) would be actually enough to switch to a default UTF-8 - at least on Unicode errors and switch to utf8 default on a new major release. I know requests want to continue support Python2 - so maybe some could help adopting the snipped so it will work in Python2 as well (could not check so far). |
Also note: Just short after the last comment (before mine) Mozilla made the change to UTF8 (following Chrome): https://www.fxsitecompat.dev/en-CA/docs/2018/basic-auth-credentials-are-now-encoded-in-utf-8-instead-of-iso-8859-1/ So this is an two years old issue and old major browsers and tools use UTF8 as default option for years now and the new standard (RFC7617) clearly states basically that UTF8 is the only option. |
@CarliJoy I'm no longer a maintainer and the maintenance team is focusing on the bare minimum work to keep Requests secure at this point.
This is all great and well until you consider that there are servers out there that people still need to interact with that haven't been updated much past the era of RFC 2616 and suddenly sending UTF8 doesn't work for them. Further no project attempting to follow SemVer can change this kind of behaviour in anything other than a major version release (e.g., requests 3.0) and that's unlikely to happen any time soon. While I'm not diametrically opposed to the behaviour, I also have no say in the matter. Please check next time before tagging someone in multiple comments to see if they're still relevant to the project. |
Having wasted a good hour of an otherwise fine evening on this before finding this - kind of surprising to see a modern library these days catering to a spec dating two decades(!) back, but anyway - easily worked around by manually setting the auth header: b64bytes = base64.b64encode(f'{username}:{password}'.encode('utf-8'))
userpass_encoded = 'Basic ' + str(b64bytes, 'utf-8')
r = requests.post(url, data=json.dumps(payload), headers={'Authorization': str(userpass_encoded)}) |
@anderseknert arthur-hav suggested an easier workaround already in the issue itself: No need to manipulate the header itself. |
Thanks @CarliJoy - seems I missed that somehow - probably since I had the workaround in place already when coming here :) That's indeed better. |
As discussed upstream in psf/requests#4564 , HTTP basic auth usernames and passwords sent to requests as Python text strings are encoded as latin1. This of course makes it impossible to log in with a username or password containing characters not represented in latin1, as the reporter of mwclient#315 found out. To work around this rather old-fashioned default, let's intercept string usernames and passwords and encode them as utf-8 before sending them to requests. Anyone dealing with a really old server that can't handle utf-8, or something like that, can encode the username and password appropriately and provide them as bytestrings. Signed-off-by: Adam Williamson <[email protected]>
As discussed upstream in psf/requests#4564 , HTTP basic auth usernames and passwords sent to requests as Python text strings are encoded as latin1. This of course makes it impossible to log in with a username or password containing characters not represented in latin1, as the reporter of mwclient#315 found out. To work around this rather old-fashioned default, let's intercept string usernames and passwords and encode them as utf-8 before sending them to requests. Anyone dealing with a really old server that can't handle utf-8, or something like that, can encode the username and password appropriately and provide them as bytestrings. Signed-off-by: Adam Williamson <[email protected]>
As discussed upstream in psf/requests#4564 , HTTP basic auth usernames and passwords sent to requests as Python text strings are encoded as latin1. This of course makes it impossible to log in with a username or password containing characters not represented in latin1, as the reporter of #315 found out. To work around this rather old-fashioned default, let's intercept string usernames and passwords and encode them as utf-8 before sending them to requests. Anyone dealing with a really old server that can't handle utf-8, or something like that, can encode the username and password appropriately and provide them as bytestrings. Signed-off-by: Adam Williamson <[email protected]>
…e passed along Works around psf/requests#4564
Changing this requires a backwards incompatible change which IMO isn't worth the squeeze right now. Closing this unless others have stronger opinions. |
I got deceived by default requests behavior for auth headers.
As using a password containing some non-ascii chars, the basic auth method silently encoded in latin1, which caused authentication to fail as it was expected to be utf-8 encoded server side.
Utf-8 is default encoding in modern terminals and in python3 strings. As a result, requests behavior for auth is asymmetrical with curl and is odd when being reversed by authentication servers. Requests is expected to fail on non-binary input, or comply to web standards of using utf-8 text strings.
As a workaround I encoded the string in utf-8 beforehand authing with requests.
Reproduction Steps
Expected Result
Basic auth header is base64encoded version of "user:àéïòù".encode('utf-8')
Actual Result
Basic auth header is base64encoded version of "user:àéïòù".encode('latin1')
Workaround
requests.get('https://example.com', auth=('user', 'àéïòù'.encode('utf-8')))
The text was updated successfully, but these errors were encountered: