Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ampersands are escaped by default #46

Closed
signebedi opened this issue Jun 14, 2024 · 6 comments
Closed

Ampersands are escaped by default #46

signebedi opened this issue Jun 14, 2024 · 6 comments

Comments

@signebedi
Copy link

signebedi commented Jun 14, 2024

The default settings for the sanitizer result in ampersands getting escaped. Is this desired behavior?

I wrote an element_postprocessor that takes an element and converts escaped ampersands back to their original format, but this does not fix the issue. I looked at the code base and think there might be two places, both occurring after the element_postprocessors are called in sanitizer.py, that might be causing this issue. First, lxml.html.clean.Cleaner may be causing the issue, see:

lxml.html.clean.Cleaner(

Alternatively, lxml's html serialization might be the issue, see:

html = lxml.html.tostring(doc, encoding="unicode")

I think, between the two, it's the unicode encoding that is causing the issue. But that's just a hunch.

Right now, it might be worth either (1) adding a bool config to the sanitize method allowing ampersands to be unescaped before returning the HTML, (2) giving end users more control over encoding, or (3) making note of this in the documentation so users know to run .replace('&', '&') after sanitizing their HTML.

@matthiask
Copy link
Owner

Thanks. Do you have a testcase which shows the incorrect behavior? I have added 440d8b5 and the behavior seems correct to me.

@signebedi
Copy link
Author

signebedi commented Jun 14, 2024

Sure thing. To be fair, this could be an environment specific issue that I'm running into. I am running Python 3.10.12 on Ubuntu 22.04, and this is part of a project using, as far as is relevant here, FastAPI v.0.111.0 and TinyDB v.4.8.0. I am using html-sanitizer v.2.4.4, installed from pip.

I create a sanitizer object as follows:

from html_sanitizer import Sanitizer

sanitizer_config = {
    'tags': {'a', 'br', 'p', 'strong', 'em', 'ul', 'ol', 'li', 'b', 'i', 'u', 'span', 'div', 'img', 'h1', 'h2', 'h3', 'h4', 'h5'},
    'attributes': {
        'a': ['href', 'title'],
        'img': ['src', 'alt'], 
        'div': ['style'],  # Might mkae sense to sanitize style content separately
    },
    'empty': {'br', 'h1', 'h2', 'h3', 'h4', 'h5'},
    'separate': {'a', 'p', 'ul', 'ol', 'li', 'br', 'img'}, 
    'protocols': {'a': ['http', 'https', 'mailto'], 'img': ['http', 'https']},
}

sanitizer = Sanitizer(sanitizer_config)

Then, when I import this sanitizer and use it against some UTF-8 text, it escapes the ampersand, but other special characters are unaffected:

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from path.to.library import sanitizer
>>> sanitizer.sanitize("Some Text & More Text")
'Some Text & More Text'
>>> # But other special characters do not seem to be affected
>>> sanitizer.sanitize("Some Text & More Text with @dditional $pecia! C#haracters )(*^%;")
'Some Text & More Text with @dditional $pecia! C#haracters )(*^%;'

This is also true when I create a sanitizer using the default config:

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from html_sanitizer import Sanitizer
>>> sanitizer = Sanitizer()
>>> sanitizer.sanitize("Some Text & More Text with @dditional $pecia! C#haracters )(*^%;")
'Some Text & More Text with @dditional $pecia! C#haracters )(*^%;'

Although, it looks like < and > are also escaped. Again, if this is the desired behavior, that's fair.

>>> sanitizer.sanitize("& < >")
'&amp; &lt; &gt;'

Hope this is helpful. I'm sure most users are probably somewhat unbothered if ampersands are escaped, so my recommendation - which I think should be idempotent - would be to add a force_unescape_special_chars bool param to the sanitize() method, which defaults to False but, when True, replaces escaped special characters with unescaped ones right before returning. This isn't necessarily the most elegant solution, though... Maybe giving a little more control over the tostring() encoding options might also solve the issue. The result of these changes would be that the default behavior remains the same, but users can customize how they want to handle special characters a little more.

I don't know if you're open to PRs and how you like to receive them. Just let me know and I'm open to forking and playing around with things.

@matthiask
Copy link
Owner

Although, it looks like < and > are also escaped. Again, if this is the desired behavior, that's fair.

Yes, that's desired behavior; I'm not 100% sure what you're trying to do, but generally speaking ampersands and angle brackets have to be escaped, see https://www.w3.org/TR/xhtml1/guidelines.html#C_12

I don't know if you're open to PRs and how you like to receive them. Just let me know and I'm open to forking and playing around with things.

Yes, I'm definitely open to PRs! I like them a lot. In this case I don't see how generating invalid HTML would be an improvement. I don't quite understand why you want to use a HTML sanitizer when you want text in the end, not HTML?

Sometimes I want to post-process HTML to text as well, something like https://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/ or maybe https://pypi.org/project/html2text/ has worked well for those cases. Or, if the few .replace() calls are sufficient that's of course great! But perhaps a more complete solution would work better for you as well.

@signebedi
Copy link
Author

signebedi commented Jun 14, 2024

That is fair! I opened a PR, which you are more than welcome to reject for the very understandable reasons you have just expressed! If you are amenable to the PR, then maybe we rename the setting to something that makes it clear that it will result in incorrect HTML.

My use case is maybe a little unique because we want the data to be sometimes text, sometimes html, so if the answer is to handle these within my own application logic because it's out-of-scope/detrimental for this project, I can definitely do that. It is because my application (a FastAPI RESTful app with a Jinja2 UI built on top) manages data that may contain HTML elements, stores these in a JSON backend, and needs to both render this in the UI and pull the data from the API directly. Rendering the data in the UI has been fine; characters are rendered the way they should be even when properly escaped.

But programmatic users are struggling with the escaping because, among other reasons, it causes some issues with equivalency checks - especially against fields that have an enumerated set of permitted values defined at runtime. It initially made sense for us to sanitize the data when we write it to the database, but maybe we should have done so on reads instead, eg. when a sanitize param is passed to the API...

@matthiask
Copy link
Owner

Thanks for the excellent explanation!

Yeah, I'm going to reject it for this project because it's really geared towards outputting HTML; I definitely had to do the things you're describing too in the past, so I hope the links prove helpful to you!

@signebedi
Copy link
Author

Of course, this makes sense and thank you for taking the time to explain to me. Hopefully, the discussion here is useful to others in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants