Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleCookie() fails for json-like values with embedded double-quotes #92936

Open
pbuckner opened this issue May 18, 2022 · 12 comments
Open

SimpleCookie() fails for json-like values with embedded double-quotes #92936

pbuckner opened this issue May 18, 2022 · 12 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@pbuckner
Copy link

pbuckner commented May 18, 2022

Bug report
http.cookies.SimpleCookie() takes a string and should return a dict-like parse of the result. On some parse errors, it returns an empty dict, or one sparsely populated with values, for example on success on a cookie with two name-value pairs:

>>> import http.cookies.SimpleCookie('a1=b;a2="c d";')
<SimpleCookie: a1='b' a2='c d'>

Cookies consist of name-value pairs, both of which have legal character subsets as defined in RFC2068 and RFC2109. Actual browser / server implementations are more lenient and cookies.py source includes such acknowledgement.

# Pattern for finding cookie
#
# This used to be strict parsing based on the RFC2109 and RFC2068
# specifications.  I have since discovered that MSIE 3.0x doesn't
# follow the character rules outlined in those specs.  As a
# result, the parsing rules here are less strict.

Cool. Modern times bring more exceptions. Specifically, Google's OAUTH implementation now includes a cookie (g_state) whose value appears to be JSON, and embedded double quotes cause SimpleCookie() to fail (or actually "succeed" in a useless way):

>>> import http.cookies.SimpleCookie('a1={"b":c};')
<SimpleCookie: >

Bug Fix/Modest Proposal
Rather than trying to get Google to change their cookie format (which is happily supported by common browsers), or requiring users of http.cookie module write their own parsers, I suggest simple augmentation of the regular expression used to "find cookies".

The change would simply allow any number of embedded double quotes. The following snippet adds two lines to the existing _CookiePattern as found in cpython/Lib/http/cookies.py:

_CookiePattern = re.compile(r"""
    \s*                            # Optional whitespace at start of cookie
    (?P<key>                       # Start of group 'key'
    [""" + _LegalKeyChars + r"""]+?   # Any word of at least one letter
    )                              # End of group 'key'
    (                              # Optional group: there may not be a value.
    \s*=\s*                          # Equal Sign
    (?P<val>                         # Start of group 'val'
    "(?:[^\\"]|\\.)*"                  # Any doublequoted string
    |                                  # or
    \w{3},\s[\w\d\s-]{9,11}\s[\d:]{8}\sGMT  # Special case for "expires" attr
    |                                  # or
    [""" + _LegalValueChars + r"""]*      # Any word or empty string
# additional clause vvvvv
    |                                                                                                                                               
    [""" + _LegalValueChars + r"""]+[""" + _LegalValueChars + r'"]*[' + _LegalValueChars + r"""]+      # Any word with internal quotes              
# end ^^^^
    )                                # End of group 'val'
    )?                             # End of optional value group
    \s*                            # Any number of spaces.
    (\s+|;|$)                      # Ending either at space, semicolon, or EOS.
    """, re.ASCII | re.VERBOSE)    # re.ASCII may be removed if safe.

The two added lines merely permit cookie values to contain any number of double quotes, as long as the first and last character of the value is not a double quote. No further interpretation of cookie value (such as json validation) is attempted or warranted.

Your environment
Observed initially in 3.8.10 (linux), confirmed 3.10.1 (Mac) and observed the code in cpython/Lib/http/cookies.py:437.

Linked PRs

@pbuckner pbuckner added the type-bug An unexpected behavior, bug, or error label May 18, 2022
@NewUserHa
Copy link
Contributor

how about

|
`\{(?:"(?:[^\\"]|\\.)*":\s?"(?:[^\\"]|\\.)*",?\s?)+?\}` # any key:value paried doublequoted string in { and }

@pbuckner
Copy link
Author

I think you're asking for trouble -- the idea is we cannot / should not attempt to interpret the 'value' part -- it's technically just a string. If the cookie value wants to be ("a","b","c") why shouldn't we return from python the string '("a","b","c")'.

The "rule" for value is:
token or quoted-string

quoted-string is double-quotes surrounding qdtext -- which is any TEXT except double quote (unless it is escaped)
So value="{"a":0}" is invalid, but value="{\"a\":0}" is valid as quoted string, (and is supported by SimpleCookie, if I recall correctly.)

Here, we're more concerned with the 'token' definition. It is supposed to be any TEXT except most punctuation included braces, parens, angle brackets etc. However SimpleCookie already makes exceptions for many of these -- I'm just suggesting we add in the double quote as being "legal" character unless it begin or ends the value (... in which case the double-quote behaves as part of quoted-string definition and does not contribute to the actual value of the cookie.

@NewUserHa
Copy link
Contributor

you mean treat value=123{{"a":b}} as a valid cookie?

@pbuckner
Copy link
Author

Yes, I would -- in python. If a browser doesn't like it, let the browser say so, not our problem.

Let the receiving application figure out what to do with it. We don't attempt to interpret numbers, value=3.14159 ==> str("3.14159"), so let's not try and interpret other values.

It's parsable -- we need to be more careful about spaces, semicolons and equal sign, as they're used to determine boundaries.

@NewUserHa
Copy link
Contributor

I wonder how browsers do
my last regex expression is to try to validate that the JSON within the cookies is valid when SimpleCookie loads a string from the users. and still reject invalid cookies as much as before.

@pbuckner
Copy link
Author

Understood... but you only think it's json. Maybe the cookie maker has other ideas (given there's no way to specify cookie format such as Content-Type.)

Note I can create a cookie (in browser):

document.cookie = 'a={"b,0);' + document.cookie;

and the string value is stored just like that... {"b,0)

@NewUserHa
Copy link
Contributor

ok..
that makes sense too.

@kbower
Copy link

kbower commented Aug 30, 2023

@pbuckner
Your patch nearly solved my issue, but it ultimately still couldn't handle a JSON-like cookie string which has a space embedded (protected) inside double-quotes.

We'd need that handled correctly, wouldn't we?

Example:

>>> Cookie._CookiePattern.match('a2={"site":"my value"}').groupdict()
{'key': 'a2', 'val': '{"site":"my'}

@pbuckner
Copy link
Author

double quote only works for the full value of the cookie (everything right of the equal sing). In your example, the value is not double quoted, but is {"site":"my value"}. The space, therefore, is not legal. ... Just like a=b c; isn't legal.

Yes, it's "obvious" that "my value" is a single construct, but only if you assume JSON-like semantics. Cookies aren't JSON.

(Personally, I have no problem making the python cookie parser much smarter & think there's little risk, as long as we don't add semantics like converting dates or data-types, adding or removing spaces to look more JSON-like, etc.)

In your case, adding a single character to my suggested added line:

[""" + _LegalValueChars + r"""]+[""" + _LegalValueChars + r' "]*[' + _LegalValueChars + r"""]+      # Any word with internal quotes              
                                                            ^

would suffice.

The difference is adding a single space in the middle clause, between the single-quote and double-quote:
r' "] instead of r'"])

@kbower
Copy link

kbower commented Aug 31, 2023

@pbuckner
Thanks for the regex! We also realized since we control the client we should just base64 encode the cookie.

@nburns
Copy link
Contributor

nburns commented Jan 3, 2024

Took a shot at implementing this by modifying the existing regex for a string wrapped in double quotes: #113663

@Dreamsorcerer
Copy link
Contributor

This discussion suggests that the safest option is probably just to split on semi-colon, then take the entire string (after the first =) and use that as the value.

In summary:

The latest RFC considers " to be an invalid character except when surrounding the entire value (foo="bar";), but doesn't specify any particular behaviour for the user-agent when it does encounter such invalid values.

The original netscape spec did allow " characters and it appears that most browsers still follow this practice.

There were 2 other specs in between those which introduce a quoted-string, but according to that discussion those specs were basically nonsense and nobody really implemented them. Note that both the original and the latest spec do not have any quoted-string (the latest simply allows " to appear around the value, but it doesn't change what the allowed characters can be).

So, for parsing values, it seems splitting on ; (an invalid value character in both original and latest specs) and then just returning the string with no further parsing seems like the safest approach. For anything producing cookies, they should follow the latest spec, which is the most restrictive, so any value following that spec will be valid according to all the previous specs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
Status: No status
Development

No branches or pull requests

6 participants