-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ampersands are escaped by default #46
Comments
Thanks. Do you have a testcase which shows the incorrect behavior? I have added 440d8b5 and the behavior seems correct to me. |
Sure thing. To be fair, this could be an environment specific issue that I'm running into. I am running Python 3.10.12 on Ubuntu 22.04, and this is part of a project using, as far as is relevant here, FastAPI v.0.111.0 and TinyDB v.4.8.0. I am using html-sanitizer v.2.4.4, installed from pip. I create a sanitizer object as follows: from html_sanitizer import Sanitizer
sanitizer_config = {
'tags': {'a', 'br', 'p', 'strong', 'em', 'ul', 'ol', 'li', 'b', 'i', 'u', 'span', 'div', 'img', 'h1', 'h2', 'h3', 'h4', 'h5'},
'attributes': {
'a': ['href', 'title'],
'img': ['src', 'alt'],
'div': ['style'], # Might mkae sense to sanitize style content separately
},
'empty': {'br', 'h1', 'h2', 'h3', 'h4', 'h5'},
'separate': {'a', 'p', 'ul', 'ol', 'li', 'br', 'img'},
'protocols': {'a': ['http', 'https', 'mailto'], 'img': ['http', 'https']},
}
sanitizer = Sanitizer(sanitizer_config) Then, when I import this sanitizer and use it against some UTF-8 text, it escapes the ampersand, but other special characters are unaffected:
This is also true when I create a sanitizer using the default config:
Although, it looks like < and > are also escaped. Again, if this is the desired behavior, that's fair.
Hope this is helpful. I'm sure most users are probably somewhat unbothered if ampersands are escaped, so my recommendation - which I think should be idempotent - would be to add a I don't know if you're open to PRs and how you like to receive them. Just let me know and I'm open to forking and playing around with things. |
Yes, that's desired behavior; I'm not 100% sure what you're trying to do, but generally speaking ampersands and angle brackets have to be escaped, see https://www.w3.org/TR/xhtml1/guidelines.html#C_12
Yes, I'm definitely open to PRs! I like them a lot. In this case I don't see how generating invalid HTML would be an improvement. I don't quite understand why you want to use a HTML sanitizer when you want text in the end, not HTML? Sometimes I want to post-process HTML to text as well, something like https://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/ or maybe https://pypi.org/project/html2text/ has worked well for those cases. Or, if the few |
That is fair! I opened a PR, which you are more than welcome to reject for the very understandable reasons you have just expressed! If you are amenable to the PR, then maybe we rename the setting to something that makes it clear that it will result in incorrect HTML. My use case is maybe a little unique because we want the data to be sometimes text, sometimes html, so if the answer is to handle these within my own application logic because it's out-of-scope/detrimental for this project, I can definitely do that. It is because my application (a FastAPI RESTful app with a Jinja2 UI built on top) manages data that may contain HTML elements, stores these in a JSON backend, and needs to both render this in the UI and pull the data from the API directly. Rendering the data in the UI has been fine; characters are rendered the way they should be even when properly escaped. But programmatic users are struggling with the escaping because, among other reasons, it causes some issues with equivalency checks - especially against fields that have an enumerated set of permitted values defined at runtime. It initially made sense for us to sanitize the data when we write it to the database, but maybe we should have done so on reads instead, eg. when a |
Thanks for the excellent explanation! Yeah, I'm going to reject it for this project because it's really geared towards outputting HTML; I definitely had to do the things you're describing too in the past, so I hope the links prove helpful to you! |
Of course, this makes sense and thank you for taking the time to explain to me. Hopefully, the discussion here is useful to others in the future. |
The default settings for the sanitizer result in ampersands getting escaped. Is this desired behavior?
I wrote an element_postprocessor that takes an element and converts escaped ampersands back to their original format, but this does not fix the issue. I looked at the code base and think there might be two places, both occurring after the element_postprocessors are called in sanitizer.py, that might be causing this issue. First,
lxml.html.clean.Cleaner
may be causing the issue, see:html-sanitizer/html_sanitizer/sanitizer.py
Line 419 in 49f04c6
Alternatively, lxml's html serialization might be the issue, see:
html-sanitizer/html_sanitizer/sanitizer.py
Line 427 in 49f04c6
I think, between the two, it's the unicode encoding that is causing the issue. But that's just a hunch.
Right now, it might be worth either (1) adding a bool config to the sanitize method allowing ampersands to be unescaped before returning the HTML, (2) giving end users more control over encoding, or (3) making note of this in the documentation so users know to run
.replace('&', '&')
after sanitizing their HTML.The text was updated successfully, but these errors were encountered: