-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace vendored html5lib with stdlib #10291
Conversation
I have not read your implementation, but html5lib covers a ton of edge cases that |
While I do have some concerns here (mostly along the lines of "why did we vendor html5lib in the first place if that is all we use it for?") I will say that the replacement code (which I've only skimmed so far) is based on the stdlib In particular, the index format is intended to be relatively straightforward to parse, so I think it's reasonable to expect the stdlib parser to work¹. And honestly, I'm not worried about possible security vulnerabilities - if we're not entitled to assume that the Python stdlib has no security vulnerabilities, we're pretty much stuffed. In principle, I'm +1 on this change. (My other motive here is that from reading a couple of the html5lib issues, it feels like removing our dependency would relieve a certain amount of pressure on the html5lib maintainer(s) who don't really want to be "that person in https://xkcd.com/2347/") ¹ I'd go so far as to be OK with amending PEP 503 to say that an index page must be parsable using the stdlib parser, if that helped... |
The code using html5lib doesn't care about standard compliant HTML5 so long as it is parsable as HTML. It is merely looking for If there is specific HTML that we know will not be handled by stdlib but is handled by html5lib, then I think we should get it in test suite regardless of how this PR plays out. It would make proof-of-concepts easier in the future. Do you have one such example? I do not.
html5lib was first introduced in PR #973. This PR states: "Use html5lib to parse HTML instead of using regexs". So, it looks to me like stdlib was either less capable in 2013 or it was not considered at that time. |
Maybe we should ask @dstufft (IIRC he explained somewhere why html5lib is better a while ago)
IIRC html5lib already does that (the reason it’s called html5lib in the first place), and if we remove it we’ll need to do some validation ourselves. |
Sorry, I'm not sure I follow the point. In response to "The code using html5lib doesn't care about standard compliant HTML5 so long as it is parsable as HTML" you refer to validation, but I don't know what specific validation is a concern to pip. pip just needs to find Can you provide a specific example of invalid HTML document or fragment where pip with html5lib does the right but stdlib fails? I'd like to get such an example in the test suite. |
Perhaps we could do a release that switches to the stdlib parser by default and has |
The Simple Repository API specification (PEP 503) says the page must be a valid HTML5 document. Pip’s parsing logic only finds The other, easier route for pip is to forgo that validation part. This means pip will start to accept technically invalid PEP 503 inputs, which is similar to web browsers implements JavaScript not conforming to specs. Since pip is to Python packaging like Google Chrome is to web right now (or IE if you’re old like me), we deliberately accepting bad inputs would allow them be able to say “but pip accepts this” and put alternative PEP 503 clients in an awkward position. This is what I want to avoid. |
Another aspect to consider is performance. |
Honestly, my view is that it's not up to pip to do the validation, and we have every right to consider the behaviour if users supply a non-conformant index as being "undefined". It's a position we've been pushing relatively consistently (the idea that pip's behaviour is defined by the standards, not the other way around) so I think this fits that approach. I guess if it really bothers you, we could add def handle_decl(self, decl):
# Fail more gracefully here if you want
assert decl == "DOCTYPE html" It's not exactly difficult - and I'm definitely OK with blaming the user if their index claims to be HTML 5 but isn't... |
@uranusjr Thanks for the deeper explanation. I understand your POV better now. FWIW, html5lib is a non-validating parser. Here is an unrelated issue I filed back in 2016 to demonstrate: html5lib/html5lib-python#283 With some effort, you can find other such examples that parse in html5lib and do not pass an HTML5 validator such as https://validator.w3.org/nu/. |
Doesn't |
That would be an issue, but without specifics I'd like to assume that counts as a CPython bug (and quite possibly one that's been fixed by now). Also, while it's technically acceptable for an index to serve arbitrary HTML, realistically we're only likely to see relatively straightforward content. I think we can probably assume it's fine and deal with bug reports if they happen (although I do agree it might be reasonable to provide a deprecation period so we can tease out such cases - although my instinct is we won't see any). |
My 2 cents: Let's do this and keep html5lib-based parsing behind a flag for a release. Check that the I definitely empathise with the argument that "what pip accepts == what works" is how stuff works, but there's absolutely no reason to believe that index pages will have sufficent edge-cases to behave incorrectly. We should probably at least keep the html5lib codepath for 3 months (i.e. one release) behind a |
I can explore the deprecation route and incorporate the other suggestions. Moving to draft until I have some time to look into it. |
I’m suddenly remembering there’s actually already an implementation for this: https://github.com/brettcannon/mousebender Since iirc @brettcannon already expressed interest including this in I think we can even drop some more logic in pip if we use mousebender. |
There's pypa/packaging#424 to upstream the Simple Index code from mousebender to 'packaging', but it never got much traction. I'm still up for seeing that happen to get more support from folks and so I can experiment in other things in mousebender and to not have any dependency limitations. But if you all want to depend on mousebender directly then let me know and I can check with my co-maintainer on how he feels about that. We would probably switch to CalVer and minimize external dependencies in that instance. |
Changes:
I believe all feedback has been addressed, so this is ready for more review. Thanks! |
Thanks @uranusjr and @pradyunsg. I applied the latest suggestions. |
src/pip/_internal/index/collector.py
Outdated
raise ValueError( | ||
"HTML doctype missing or incorrect. Expected <!DOCTYPE html>.\n\n" | ||
"If you believe this error to be incorrect, try passing the " | ||
"command line option --use-deprecated=html5lib." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also tell the user to report this somewhere. (Let's do it when we decide this is ready for merge.)
FWIW, yet-another reason to do this is that the html5lib dependency is the blocker for us to drop chardet, which is the largest dependency that we vendor, and also helps progress toward #9824's motivating licensing concerns; now that requests supports using charset_normalizer. :) |
@jdufresne Could you rebase this on the current |
FYI: the tests aren't happy. |
Sorry about that. @pradyunsg Are you able to put the finishing touches on this PR? Sorry, but I no longer have the time to dedicate to this work. If that means it won't make the release, I understand. |
Neato. I'll try and pick this up in time for release. |
The html5lib library isn't strictly required as the same functionality can be achieved through the stdlib html.parser module. The html5lib is one of the largest uses of the six library. By dropping this unnecessary dependency, the pip project is closer to dropping the six library. Additionally, html5lib maintenance has slowed down and the project has rejected pull requests to drop Python 2 support. For now, the html5lib code remains, but is gated behind a command line option: `--use-deprecated=html5lib`. After a sufficient amount of time has passed without any reported bugs, the vendored library and this flag can be removed completely.
Ahahahah, looks like nearly everyone except PyPI is passing non-compliant pages to pip. :( |
Bumps [pip](https://github.com/pypa/pip) from 21.3.1 to 22.0.2. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/pypa/pip/blob/main/NEWS.rst">pip's changelog</a>.</em></p> <blockquote> <h1>22.0.2 (2022-01-30)</h1> <h2>Deprecations and Removals</h2> <ul> <li>Instead of failing on index pages that use non-compliant HTML 5, print a deprecation warning and fall back to <code>html5lib</code>-based parsing for now. This simplifies the migration for non-compliant index pages, by letting such indexes function with a warning. (<code>[#10847](pypa/pip#10847) <https://github.com/pypa/pip/issues/10847></code>_)</li> </ul> <h1>22.0.1 (2022-01-30)</h1> <h2>Bug Fixes</h2> <ul> <li>Accept lowercase <code><!doctype html></code> on index pages. (<code>[#10844](pypa/pip#10844) <https://github.com/pypa/pip/issues/10844></code>_)</li> <li>Properly handle links parsed by html5lib, when using <code>--use-deprecated=html5lib</code>. (<code>[#10846](pypa/pip#10846) <https://github.com/pypa/pip/issues/10846></code>_)</li> </ul> <h1>22.0 (2022-01-29)</h1> <h2>Process</h2> <ul> <li>Completely replace :pypi:<code>tox</code> in our development workflow, with :pypi:<code>nox</code>.</li> </ul> <h2>Deprecations and Removals</h2> <ul> <li> <p>Deprecate alternative progress bar styles, leaving only <code>on</code> and <code>off</code> as available choices. (<code>[#10462](pypa/pip#10462) <https://github.com/pypa/pip/issues/10462></code>_)</p> </li> <li> <p>Drop support for Python 3.6. (<code>[#10641](pypa/pip#10641) <https://github.com/pypa/pip/issues/10641></code>_)</p> </li> <li> <p>Disable location mismatch warnings on Python versions prior to 3.10.</p> <p>These warnings were helping identify potential issues as part of the sysconfig -> distutils transition, and we no longer need to rely on reports from older Python versions for information on the transition. (<code>[#10840](pypa/pip#10840) <https://github.com/pypa/pip/issues/10840></code>_)</p> </li> </ul> <h2>Features</h2> <ul> <li> <p>Changed <code>PackageFinder</code> to parse HTML documents using the stdlib :class:<code>html.parser.HTMLParser</code> class instead of the <code>html5lib</code> package.</p> <p>For now, the deprecated <code>html5lib</code> code remains and can be used with the <code>--use-deprecated=html5lib</code> command line option. However, it will be removed in a future pip release. (<code>[#10291](pypa/pip#10291) <https://github.com/pypa/pip/issues/10291></code>_)</p> </li> <li> <p>Utilise <code>rich</code> for presenting pip's default download progress bar. (<code>[#10462](pypa/pip#10462) <https://github.com/pypa/pip/issues/10462></code>_)</p> </li> <li> <p>Present a better error message when an invalid wheel file is encountered, providing more context where the invalid wheel file is. (<code>[#10535](pypa/pip#10535) <https://github.com/pypa/pip/issues/10535></code>_)</p> </li> <li> <p>Documents the <code>--require-virtualenv</code> flag for <code>pip install</code>. (<code>[#10588](pypa/pip#10588) <https://github.com/pypa/pip/issues/10588></code>_)</p> </li> <li> <p><code>pip install <tab></code> autocompletes paths. (<code>[#10646](pypa/pip#10646) <https://github.com/pypa/pip/issues/10646></code>_)</p> </li> <li> <p>Allow Python distributors to opt-out from or opt-in to the <code>sysconfig</code> installation scheme backend by setting <code>sysconfig._PIP_USE_SYSCONFIG</code> to <code>True</code> or <code>False</code>. (<code>[#10647](pypa/pip#10647) <https://github.com/pypa/pip/issues/10647></code>_)</p> </li> <li> <p>Make it possible to deselect tests requiring cryptography package on systems where it cannot be installed. (<code>[#10686](pypa/pip#10686) <https://github.com/pypa/pip/issues/10686></code>_)</p> </li> <li> <p>Start using Rich for presenting error messages in a consistent format. (<code>[#10703](pypa/pip#10703) <https://github.com/pypa/pip/issues/10703></code>_)</p> </li> <li> <p>Improve presentation of errors from subprocesses. (<code>[#10705](pypa/pip#10705) <https://github.com/pypa/pip/issues/10705></code>_)</p> </li> </ul> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/pypa/pip/commit/c721f03190ef888711e8e9efe6ca8345ec6464f3"><code>c721f03</code></a> Bump for release</li> <li><a href="https://github.com/pypa/pip/commit/844b799c9cf68629cc3c45b75c6e3b7f41086d49"><code>844b799</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/pypa/pip/issues/10847">#10847</a> from pradyunsg/better-html5lib-fallback</li> <li><a href="https://github.com/pypa/pip/commit/a78845ab3387adfe590e2a0ae044a6d3b20ada55"><code>a78845a</code></a> Pacify functional tests that don't start with <code>\<!doctype html></code></li> <li><a href="https://github.com/pypa/pip/commit/c3a42f0679d06bfdb3475801618273be5bbce1e8"><code>c3a42f0</code></a> 📰</li> <li><a href="https://github.com/pypa/pip/commit/c01b0b2729ee096c73416059de825e2f2f01bed9"><code>c01b0b2</code></a> Gracefully fallback to html5lib for parsing non-compliant index pages</li> <li><a href="https://github.com/pypa/pip/commit/cc35c930b2d7096babb267ddce9382c30c58c7e1"><code>cc35c93</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/pypa/pip/issues/10850">#10850</a> from pradyunsg/release/22.0.1</li> <li><a href="https://github.com/pypa/pip/commit/1b6ef5d0b387d84b1909799d1e18d0b4431038c3"><code>1b6ef5d</code></a> Bump for development</li> <li><a href="https://github.com/pypa/pip/commit/c73ac8d6bcf4f64041cafeccd2125cca052abed9"><code>c73ac8d</code></a> Bump for release</li> <li><a href="https://github.com/pypa/pip/commit/9a9c1def6e3bba1f2a860874c8c73b5c55f7f43c"><code>9a9c1de</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/pypa/pip/issues/10846">#10846</a> from pradyunsg/fix-html5lib-fallback</li> <li><a href="https://github.com/pypa/pip/commit/80609e8c20a8db26c97037f252b29307ab44b0e2"><code>80609e8</code></a> Properly yield results from <code>html5lib</code> parsing</li> <li>Additional commits viewable in <a href="https://github.com/pypa/pip/compare/21.3.1...22.0.2">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pip&package-manager=pip&previous-version=21.3.1&new-version=22.0.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details>
The html5lib library isn't strictly required as the same functionality
can be achieved through the stdlib html.parser module.
The html5lib is one of the largest uses of the six library. By dropping
this unnecessary dependency, the pip project is closer to dropping the
six library.
Additionally, html5lib maintenance has slowed down and the project has
rejected pull requests to drop Python 2 support.
For now, the html5lib code remains, but is hidden behind the command
line options: --use-deprecated=html5lib. After a sufficient amount of
time has passed without any reported bugs, the vendored library can be
removed completely.