Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected urllib parse result #98022

Closed
planetA opened this issue Oct 7, 2022 · 9 comments
Closed

Unexpected urllib parse result #98022

planetA opened this issue Oct 7, 2022 · 9 comments
Labels
pending The issue will be closed if no feedback is provided type-bug An unexpected behavior, bug, or error

Comments

@planetA
Copy link

planetA commented Oct 7, 2022

Subject

I parse an IP address with port to get a url, but ParseResult does not look right.

Environment

Describe your environment.
At least, paste here the output of:

OS Linux-5.10.0-17-amd64-x86_64-with-glibc2.31
>>> print("Python", platform.python_version())
Python 3.9.2

Steps to Reproduce

from urllib.parse import urlparse
urlparse('192.168.1.132:16992')

Expected Behavior

ParseResult(scheme='', netloc='192.168.1.132:16992', path='', params='', query='', fragment='')

Actual Behavior

ParseResult(scheme='192.168.1.132', netloc='', path='16992', params='', query='', fragment='')

I guess scheme should not be an IP address

This issue seems to be related: #38644

@planetA planetA added the type-bug An unexpected behavior, bug, or error label Oct 7, 2022
@planetA
Copy link
Author

planetA commented Oct 7, 2022

Adding scheme does not seem to help:

urlparse('1.2.3.4:80', 'http')
ParseResult(scheme='1.2.3.4', netloc='', path='80', params='', query='', fragment='')

@Jason-Y-Z
Copy link
Contributor

I might be horribly wrong, but from the example given -

>>> from urllib.parse import urlparse
>>> urlparse("scheme://netloc/path;parameters?query#fragment")
ParseResult(scheme='scheme', netloc='netloc', path='/path;parameters', params='',
            query='query', fragment='fragment')

I think it's treating the IP address of your string as the scheme because it's the first part of the URL string (before the colon).

@planetA
Copy link
Author

planetA commented Oct 7, 2022

According to RFC3986 scheme should start with an alpha character.

@kwsp
Copy link
Contributor

kwsp commented Oct 8, 2022

Quoting from the documentation of urlparse

Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.

To get your expected result, you need to prefix with // for urlparse to recognize the netloc:

>>> urlparse('//192.168.1.132:16992')
ParseResult(scheme='', netloc='192.168.1.132:16992', path='', params='', query='', fragment='')

Since your input doesn't start with a scheme or '//', urlparse should treat your input as a "relative URL". However, your input also contains :, which doesn't make sense in a relative path, so I guess the result you see is actually undefined behavior?

@hauntsaninja hauntsaninja added the pending The issue will be closed if no feedback is provided label Oct 9, 2022
@Kcchouette
Copy link

Kcchouette commented Oct 23, 2022

Hello
I have the same problem in recent version of python

>>> urlparse('google.com:443')
ParseResult(scheme='google.com', netloc='', path='443', params='', query='', fragment='')

Same if I specify the scheme in the function:

>>> urlparse('google.com:443', scheme="https")
ParseResult(scheme='google.com', netloc='', path='443', params='', query='', fragment='')

@kwsp
Copy link
Contributor

kwsp commented Oct 23, 2022

I think the issue boils down to this:

urllib.parse.urlparse follows the contents of RFC1808 (Relative URL) religiously. However, RFC1808 is a companion to RFC1738 (URL), and because of that, RFC1808 skipped the description of net_loc entirely (Section 2.1 of RFC1808), since it's describe in Section 3.1 of RFC1738. As a result, RFC1808 skips the description of port in the net_loc entirely, and doesn't consider the case where the net_loc is presented by itself, without the // prefix, which is undefined behaviour as this case is not defined by RFC1808 or RFC1738.

I get the impression that nowadays, most people do not know that according to RFC1738, a net_loc must be prefixed by //, and using a netloc by itself (like 192.168.1.132:16992 presented in the original issue) is very commonplace, and we should just support it in urlparse.

I will work on a patch for this

@kwsp
Copy link
Contributor

kwsp commented Oct 24, 2022

On second thought and reading of the discussion on #38644, this is actually documented behaviour, as documented in the second paragraph of https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlparse.

The only bug here then is as @planetA mentioned:

According to RFC3986 scheme should start with an alpha character.

I can look into a fix for this - maybe raise an error if the parsed scheme doesn't start with an alpha character?

Adding scheme does not seem to help:

urlparse('1.2.3.4:80', 'http')
ParseResult(scheme='1.2.3.4', netloc='', path='80', params='', query='', fragment='')

This is technically documented behaviour too:

The scheme argument gives the default addressing scheme, to be used only if the URL does not specify one. It should be the same type (text or bytes) as urlstring, except that the default value '' is always allowed, and is automatically converted to b'' if appropriate.

Since urlparse technically discovered a scheme, therefore it ignores the scheme argument.

@arhadthedev
Copy link
Member

I can look into a fix for this

ping

@kwsp
Copy link
Contributor

kwsp commented Mar 9, 2023

@arhadthedev Sorry I forgot about this.

Taking a look again, the behaviour and changes I described above has already been fixed in this issue (#99418) and PR (#99421).

We can close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending The issue will be closed if no feedback is provided type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

7 participants