Ampersands are escaped by default #46

signebedi · 2024-06-14T15:47:42Z

The default settings for the sanitizer result in ampersands getting escaped. Is this desired behavior?

I wrote an element_postprocessor that takes an element and converts escaped ampersands back to their original format, but this does not fix the issue. I looked at the code base and think there might be two places, both occurring after the element_postprocessors are called in sanitizer.py, that might be causing this issue. First, lxml.html.clean.Cleaner may be causing the issue, see:

html-sanitizer/html_sanitizer/sanitizer.py

Line 419 in 49f04c6

lxml.html.clean.Cleaner(

Alternatively, lxml's html serialization might be the issue, see:

html-sanitizer/html_sanitizer/sanitizer.py

Line 427 in 49f04c6

html = lxml.html.tostring(doc, encoding="unicode")

I think, between the two, it's the unicode encoding that is causing the issue. But that's just a hunch.

Right now, it might be worth either (1) adding a bool config to the sanitize method allowing ampersands to be unescaped before returning the HTML, (2) giving end users more control over encoding, or (3) making note of this in the documentation so users know to run .replace('&', '&') after sanitizing their HTML.

The text was updated successfully, but these errors were encountered:

matthiask · 2024-06-14T16:12:32Z

Thanks. Do you have a testcase which shows the incorrect behavior? I have added 440d8b5 and the behavior seems correct to me.

signebedi · 2024-06-14T18:21:16Z

Sure thing. To be fair, this could be an environment specific issue that I'm running into. I am running Python 3.10.12 on Ubuntu 22.04, and this is part of a project using, as far as is relevant here, FastAPI v.0.111.0 and TinyDB v.4.8.0. I am using html-sanitizer v.2.4.4, installed from pip.

I create a sanitizer object as follows:

from html_sanitizer import Sanitizer

sanitizer_config = {
    'tags': {'a', 'br', 'p', 'strong', 'em', 'ul', 'ol', 'li', 'b', 'i', 'u', 'span', 'div', 'img', 'h1', 'h2', 'h3', 'h4', 'h5'},
    'attributes': {
        'a': ['href', 'title'],
        'img': ['src', 'alt'], 
        'div': ['style'],  # Might mkae sense to sanitize style content separately
    },
    'empty': {'br', 'h1', 'h2', 'h3', 'h4', 'h5'},
    'separate': {'a', 'p', 'ul', 'ol', 'li', 'br', 'img'}, 
    'protocols': {'a': ['http', 'https', 'mailto'], 'img': ['http', 'https']},
}

sanitizer = Sanitizer(sanitizer_config)

Then, when I import this sanitizer and use it against some UTF-8 text, it escapes the ampersand, but other special characters are unaffected:

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from path.to.library import sanitizer
>>> sanitizer.sanitize("Some Text & More Text")
'Some Text &amp; More Text'
>>> # But other special characters do not seem to be affected
>>> sanitizer.sanitize("Some Text & More Text with @dditional $pecia! C#haracters )(*^%;")
'Some Text &amp; More Text with @dditional $pecia! C#haracters )(*^%;'

This is also true when I create a sanitizer using the default config:

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from html_sanitizer import Sanitizer
>>> sanitizer = Sanitizer()
>>> sanitizer.sanitize("Some Text & More Text with @dditional $pecia! C#haracters )(*^%;")
'Some Text &amp; More Text with @dditional $pecia! C#haracters )(*^%;'

Although, it looks like < and > are also escaped. Again, if this is the desired behavior, that's fair.

>>> sanitizer.sanitize("& < >")
'&amp; &lt; &gt;'

Hope this is helpful. I'm sure most users are probably somewhat unbothered if ampersands are escaped, so my recommendation - which I think should be idempotent - would be to add a force_unescape_special_chars bool param to the sanitize() method, which defaults to False but, when True, replaces escaped special characters with unescaped ones right before returning. This isn't necessarily the most elegant solution, though... Maybe giving a little more control over the tostring() encoding options might also solve the issue. The result of these changes would be that the default behavior remains the same, but users can customize how they want to handle special characters a little more.

I don't know if you're open to PRs and how you like to receive them. Just let me know and I'm open to forking and playing around with things.

matthiask · 2024-06-14T19:03:42Z

Although, it looks like < and > are also escaped. Again, if this is the desired behavior, that's fair.

Yes, that's desired behavior; I'm not 100% sure what you're trying to do, but generally speaking ampersands and angle brackets have to be escaped, see https://www.w3.org/TR/xhtml1/guidelines.html#C_12

I don't know if you're open to PRs and how you like to receive them. Just let me know and I'm open to forking and playing around with things.

Yes, I'm definitely open to PRs! I like them a lot. In this case I don't see how generating invalid HTML would be an improvement. I don't quite understand why you want to use a HTML sanitizer when you want text in the end, not HTML?

Sometimes I want to post-process HTML to text as well, something like https://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/ or maybe https://pypi.org/project/html2text/ has worked well for those cases. Or, if the few .replace() calls are sufficient that's of course great! But perhaps a more complete solution would work better for you as well.

signebedi · 2024-06-14T19:08:31Z

That is fair! I opened a PR, which you are more than welcome to reject for the very understandable reasons you have just expressed! If you are amenable to the PR, then maybe we rename the setting to something that makes it clear that it will result in incorrect HTML.

My use case is maybe a little unique because we want the data to be sometimes text, sometimes html, so if the answer is to handle these within my own application logic because it's out-of-scope/detrimental for this project, I can definitely do that. It is because my application (a FastAPI RESTful app with a Jinja2 UI built on top) manages data that may contain HTML elements, stores these in a JSON backend, and needs to both render this in the UI and pull the data from the API directly. Rendering the data in the UI has been fine; characters are rendered the way they should be even when properly escaped.

But programmatic users are struggling with the escaping because, among other reasons, it causes some issues with equivalency checks - especially against fields that have an enumerated set of permitted values defined at runtime. It initially made sense for us to sanitize the data when we write it to the database, but maybe we should have done so on reads instead, eg. when a sanitize param is passed to the API...

matthiask · 2024-06-14T19:18:49Z

Thanks for the excellent explanation!

Yeah, I'm going to reject it for this project because it's really geared towards outputting HTML; I definitely had to do the things you're describing too in the past, so I hope the links prove helpful to you!

signebedi · 2024-06-14T19:20:09Z

Of course, this makes sense and thank you for taking the time to explain to me. Hopefully, the discussion here is useful to others in the future.

This was referenced Jun 14, 2024

handle special characters in display of form submissions, so e.g. curly apostrophe does not show up as ' signebedi/libreforms-fastapi#245

Closed

Using html-sanitize still escapes ampersands signebedi/libreforms-fastapi#258

Closed

signebedi added a commit to signebedi/html-sanitizer that referenced this issue Jun 14, 2024

Added: setting for unescaping special chars (matthiask#46)

f3f71ea

signebedi mentioned this issue Jun 14, 2024

Added: setting for unescaping special chars (#46) #47

Open

signebedi added a commit to signebedi/html-sanitizer that referenced this issue Jun 14, 2024

Added: special character setting to docs (matthiask#46)

096e0bd

signebedi closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ampersands are escaped by default #46

Ampersands are escaped by default #46

signebedi commented Jun 14, 2024 •

edited

Loading

matthiask commented Jun 14, 2024

signebedi commented Jun 14, 2024 •

edited

Loading

matthiask commented Jun 14, 2024

signebedi commented Jun 14, 2024 •

edited

Loading

matthiask commented Jun 14, 2024

signebedi commented Jun 14, 2024

Ampersands are escaped by default #46

Ampersands are escaped by default #46

Comments

signebedi commented Jun 14, 2024 • edited Loading

matthiask commented Jun 14, 2024

signebedi commented Jun 14, 2024 • edited Loading

matthiask commented Jun 14, 2024

signebedi commented Jun 14, 2024 • edited Loading

matthiask commented Jun 14, 2024

signebedi commented Jun 14, 2024

signebedi commented Jun 14, 2024 •

edited

Loading

signebedi commented Jun 14, 2024 •

edited

Loading

signebedi commented Jun 14, 2024 •

edited

Loading