Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace vendored html5lib with stdlib #10291

Merged
merged 2 commits into from
Jan 28, 2022
Merged

Replace vendored html5lib with stdlib #10291

merged 2 commits into from
Jan 28, 2022

Conversation

jdufresne
Copy link
Contributor

@jdufresne jdufresne commented Aug 8, 2021

The html5lib library isn't strictly required as the same functionality
can be achieved through the stdlib html.parser module.

The html5lib is one of the largest uses of the six library. By dropping
this unnecessary dependency, the pip project is closer to dropping the
six library.

Additionally, html5lib maintenance has slowed down and the project has
rejected pull requests to drop Python 2 support.

For now, the html5lib code remains, but is hidden behind the command
line options: --use-deprecated=html5lib. After a sufficient amount of
time has passed without any reported bugs, the vendored library can be
removed completely.

@uranusjr
Copy link
Member

uranusjr commented Aug 9, 2021

the same functionality can be achieved through the stdlib html.parser module

I have not read your implementation, but html5lib covers a ton of edge cases that html doesn’t that we’ll need to replicate. If we’re dropping html5lib, I’d prefer doing it with a new library that is able toe replicate html5lib (e.g. passing its test suite or some kind of HTML5 spec validator) and vendor it in. HTML is a very hairy format to parse.

@pfmoore
Copy link
Member

pfmoore commented Aug 9, 2021

While I do have some concerns here (mostly along the lines of "why did we vendor html5lib in the first place if that is all we use it for?") I will say that the replacement code (which I've only skimmed so far) is based on the stdlib html.parser module and I think it's entirely reasonable for us to assume that's a valid HTML parser (it claims to be a "simple HTML and XHTML parser", but I don't see why "simple" should be taken to mean "potentially buggy").

In particular, the index format is intended to be relatively straightforward to parse, so I think it's reasonable to expect the stdlib parser to work¹. And honestly, I'm not worried about possible security vulnerabilities - if we're not entitled to assume that the Python stdlib has no security vulnerabilities, we're pretty much stuffed.

In principle, I'm +1 on this change.

(My other motive here is that from reading a couple of the html5lib issues, it feels like removing our dependency would relieve a certain amount of pressure on the html5lib maintainer(s) who don't really want to be "that person in https://xkcd.com/2347/")

¹ I'd go so far as to be OK with amending PEP 503 to say that an index page must be parsable using the stdlib parser, if that helped...

@jdufresne
Copy link
Contributor Author

jdufresne commented Aug 10, 2021

I have not read your implementation, but html5lib covers a ton of edge cases that html doesn’t that we’ll need to replicate. If we’re dropping html5lib, I’d prefer doing it with a new library that is able toe replicate html5lib (e.g. passing its test suite or some kind of HTML5 spec validator) and vendor it in. HTML is a very hairy format to parse.

The code using html5lib doesn't care about standard compliant HTML5 so long as it is parsable as HTML. It is merely looking for <a> and <base> elements which the stdlib HTML parser is suited to doing. If those <a> elements happen to exist in an illegal context (as far as WHATWG is concerned) then I don't see why pip really cares, just read the HREF.

If there is specific HTML that we know will not be handled by stdlib but is handled by html5lib, then I think we should get it in test suite regardless of how this PR plays out. It would make proof-of-concepts easier in the future. Do you have one such example? I do not.

why did we vendor html5lib in the first place if that is all we use it for?

html5lib was first introduced in PR #973. This PR states: "Use html5lib to parse HTML instead of using regexs". So, it looks to me like stdlib was either less capable in 2013 or it was not considered at that time.

@uranusjr
Copy link
Member

Maybe we should ask @dstufft (IIRC he explained somewhere why html5lib is better a while ago)

The code using html5lib doesn't care about standard compliant HTML5 so long as it is parsable as HTML.

IIRC html5lib already does that (the reason it’s called html5lib in the first place), and if we remove it we’ll need to do some validation ourselves.

@jdufresne
Copy link
Contributor Author

The code using html5lib doesn't care about standard compliant HTML5 so long as it is parsable as HTML.

IIRC html5lib already does that (the reason it’s called html5lib in the first place), and if we remove it we’ll need to do some validation ourselves.

Sorry, I'm not sure I follow the point. In response to "The code using html5lib doesn't care about standard compliant HTML5 so long as it is parsable as HTML" you refer to validation, but I don't know what specific validation is a concern to pip. pip just needs to find a elements. I appreciate that html5lib is more sophisticated, I'm just not sure that matters to pip. Where does that behavior come into play for merely finding a elements?

Can you provide a specific example of invalid HTML document or fragment where pip with html5lib does the right but stdlib fails? I'd like to get such an example in the test suite.

@sbidoul
Copy link
Member

sbidoul commented Aug 11, 2021

Perhaps we could do a release that switches to the stdlib parser by default and has --use-deprecated=html5lib (or some better flag name that does not make people think that html5lib itself is deprecated).

@uranusjr
Copy link
Member

Sorry, I'm not sure I follow the point. In response to "The code using html5lib doesn't care about standard compliant HTML5 so long as it is parsable as HTML" you refer to validation, but I don't know what specific validation is a concern to pip. pip just needs to find a elements. I appreciate that html5lib is more sophisticated, I'm just not sure that matters to pip. Where does that behaviour come into play for merely finding a elements?

The Simple Repository API specification (PEP 503) says the page must be a valid HTML5 document. Pip’s parsing logic only finds <a> tags because html5lib already performs that validation for us, so when pip’s code is accessed, we already know we’re dealing with a valid HTML5 and can simply proceed onto other parts of the API. So if we drop html5lib, pip has two choices: The first is to stick to the specification and re-implement those HTML5 validations. This is not technically difficult, but also arguably should be a separate lib.

The other, easier route for pip is to forgo that validation part. This means pip will start to accept technically invalid PEP 503 inputs, which is similar to web browsers implements JavaScript not conforming to specs. Since pip is to Python packaging like Google Chrome is to web right now (or IE if you’re old like me), we deliberately accepting bad inputs would allow them be able to say “but pip accepts this” and put alternative PEP 503 clients in an awkward position. This is what I want to avoid.

@sbidoul
Copy link
Member

sbidoul commented Aug 11, 2021

Another aspect to consider is performance.

@pfmoore
Copy link
Member

pfmoore commented Aug 11, 2021

The other, easier route for pip is to forgo that validation part.

Honestly, my view is that it's not up to pip to do the validation, and we have every right to consider the behaviour if users supply a non-conformant index as being "undefined". It's a position we've been pushing relatively consistently (the idea that pip's behaviour is defined by the standards, not the other way around) so I think this fits that approach.

I guess if it really bothers you, we could add

def handle_decl(self, decl):
    # Fail more gracefully here if you want
    assert decl == "DOCTYPE html"

It's not exactly difficult - and I'm definitely OK with blaming the user if their index claims to be HTML 5 but isn't...

@jdufresne
Copy link
Contributor Author

jdufresne commented Aug 11, 2021

@uranusjr Thanks for the deeper explanation. I understand your POV better now.

FWIW, html5lib is a non-validating parser. Here is an unrelated issue I filed back in 2016 to demonstrate: html5lib/html5lib-python#283

With some effort, you can find other such examples that parse in html5lib and do not pass an HTML5 validator such as https://validator.w3.org/nu/.

@dstufft
Copy link
Member

dstufft commented Aug 11, 2021

Doesn't html.parser fail on some well formed HTML5 output? I could be remembering wrong, but I thought that was a problem with it.

@pfmoore
Copy link
Member

pfmoore commented Aug 11, 2021

That would be an issue, but without specifics I'd like to assume that counts as a CPython bug (and quite possibly one that's been fixed by now). Also, while it's technically acceptable for an index to serve arbitrary HTML, realistically we're only likely to see relatively straightforward content.

I think we can probably assume it's fine and deal with bug reports if they happen (although I do agree it might be reasonable to provide a deprecation period so we can tease out such cases - although my instinct is we won't see any).

@pradyunsg
Copy link
Member

pradyunsg commented Aug 11, 2021

My 2 cents: Let's do this and keep html5lib-based parsing behind a flag for a release.

Check that the decl is correct should be sufficient IMO. I don't expect we'd see any show-stoppers, and if someone is passing horrible pages to pip, well, IMO that's a problem we can deal with when it happens.

I definitely empathise with the argument that "what pip accepts == what works" is how stuff works, but there's absolutely no reason to believe that index pages will have sufficent edge-cases to behave incorrectly.

We should probably at least keep the html5lib codepath for 3 months (i.e. one release) behind a --use-deprecated=... to act as an escape hatch for someone affected by this. That way, we'd at least have a vent for letting the not-so-noisy users pass through without pain.

@jdufresne jdufresne marked this pull request as draft August 11, 2021 19:43
@jdufresne
Copy link
Contributor Author

I can explore the deprecation route and incorporate the other suggestions. Moving to draft until I have some time to look into it.

@github-actions github-actions bot added the needs rebase or merge PR has conflicts with current master label Aug 15, 2021
@uranusjr
Copy link
Member

I’m suddenly remembering there’s actually already an implementation for this: https://github.com/brettcannon/mousebender

Since iirc @brettcannon already expressed interest including this in packaging, I think he’s probably not that too worried being that person living in Nebraska 🙂

I think we can even drop some more logic in pip if we use mousebender.

@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label Aug 20, 2021
@brettcannon
Copy link
Member

There's pypa/packaging#424 to upstream the Simple Index code from mousebender to 'packaging', but it never got much traction. I'm still up for seeing that happen to get more support from folks and so I can experiment in other things in mousebender and to not have any dependency limitations.

But if you all want to depend on mousebender directly then let me know and I can check with my co-maintainer on how he feels about that. We would probably switch to CalVer and minimize external dependencies in that instance.

@jdufresne jdufresne marked this pull request as ready for review August 26, 2021 20:30
@jdufresne
Copy link
Contributor Author

Changes:

  • Rebased on latest main branch
  • Added the doctype checker as suggested in Replace vendored html5lib with stdlib #10291 (comment)
  • Added the --use-deprecated=html5lib command line option to allow a deprecation period
  • Added some additional tests around the new changes
  • Added a news entry

I believe all feedback has been addressed, so this is ready for more review. Thanks!

@github-actions github-actions bot added the needs rebase or merge PR has conflicts with current master label Sep 5, 2021
@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label Sep 5, 2021
@github-actions github-actions bot added the needs rebase or merge PR has conflicts with current master label Sep 26, 2021
@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label Sep 26, 2021
@pradyunsg pradyunsg added C: finder PackageFinder and index related code type: deprecation Related to deprecation / removal. type: enhancement Improvements to functionality labels Sep 26, 2021
@jdufresne
Copy link
Contributor Author

Thanks @uranusjr and @pradyunsg. I applied the latest suggestions.

@github-actions github-actions bot added the needs rebase or merge PR has conflicts with current master label Oct 2, 2021
@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label Oct 2, 2021
Comment on lines 426 to 431
raise ValueError(
"HTML doctype missing or incorrect. Expected <!DOCTYPE html>.\n\n"
"If you believe this error to be incorrect, try passing the "
"command line option --use-deprecated=html5lib."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also tell the user to report this somewhere. (Let's do it when we decide this is ready for merge.)

@pradyunsg
Copy link
Member

pradyunsg commented Oct 16, 2021

FWIW, yet-another reason to do this is that the html5lib dependency is the blocker for us to drop chardet, which is the largest dependency that we vendor, and also helps progress toward #9824's motivating licensing concerns; now that requests supports using charset_normalizer. :)

@pradyunsg
Copy link
Member

@jdufresne Could you rebase this on the current main and include a link to #10825, in the spot requested by @uranusjr?

@pradyunsg
Copy link
Member

FYI: the tests aren't happy.

@jdufresne
Copy link
Contributor Author

jdufresne commented Jan 26, 2022

Sorry about that.

@pradyunsg Are you able to put the finishing touches on this PR? Sorry, but I no longer have the time to dedicate to this work. If that means it won't make the release, I understand.

@pradyunsg
Copy link
Member

Neato. I'll try and pick this up in time for release.

@github-actions github-actions bot added the needs rebase or merge PR has conflicts with current master label Jan 27, 2022
The html5lib library isn't strictly required as the same functionality
can be achieved through the stdlib html.parser module.

The html5lib is one of the largest uses of the six library. By dropping
this unnecessary dependency, the pip project is closer to dropping the
six library.

Additionally, html5lib maintenance has slowed down and the project has
rejected pull requests to drop Python 2 support.

For now, the html5lib code remains, but is gated behind a command
line option: `--use-deprecated=html5lib`. After a sufficient amount of
time has passed without any reported bugs, the vendored library and this
flag can be removed completely.
@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label Jan 28, 2022
@pradyunsg pradyunsg merged commit 649048b into pypa:main Jan 28, 2022
@pradyunsg
Copy link
Member

Check that the decl is correct should be sufficient IMO. I don't expect we'd see any show-stoppers, and if someone is passing horrible pages to pip, well, IMO that's a problem we can deal with when it happens.

I definitely empathise with the argument that "what pip accepts == what works" is how stuff works, but there's absolutely no reason to believe that index pages will have sufficent edge-cases to behave incorrectly.

Ahahahah, looks like nearly everyone except PyPI is passing non-compliant pages to pip. :(

mergify bot pushed a commit to andrewbolster/bolster that referenced this pull request Jan 31, 2022
Bumps [pip](https://github.com/pypa/pip) from 21.3.1 to 22.0.2.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/pypa/pip/blob/main/NEWS.rst">pip's changelog</a>.</em></p>
<blockquote>
<h1>22.0.2 (2022-01-30)</h1>
<h2>Deprecations and Removals</h2>
<ul>
<li>Instead of failing on index pages that use non-compliant HTML 5, print a deprecation warning and fall back to <code>html5lib</code>-based parsing for now. This simplifies the migration for non-compliant index pages, by letting such indexes function with a warning. (<code>[#10847](pypa/pip#10847) &lt;https://github.com/pypa/pip/issues/10847&gt;</code>_)</li>
</ul>
<h1>22.0.1 (2022-01-30)</h1>
<h2>Bug Fixes</h2>
<ul>
<li>Accept lowercase <code>&lt;!doctype html&gt;</code> on index pages. (<code>[#10844](pypa/pip#10844) &lt;https://github.com/pypa/pip/issues/10844&gt;</code>_)</li>
<li>Properly handle links parsed by html5lib, when using <code>--use-deprecated=html5lib</code>. (<code>[#10846](pypa/pip#10846) &lt;https://github.com/pypa/pip/issues/10846&gt;</code>_)</li>
</ul>
<h1>22.0 (2022-01-29)</h1>
<h2>Process</h2>
<ul>
<li>Completely replace :pypi:<code>tox</code> in our development workflow, with :pypi:<code>nox</code>.</li>
</ul>
<h2>Deprecations and Removals</h2>
<ul>
<li>
<p>Deprecate alternative progress bar styles, leaving only <code>on</code> and <code>off</code> as available choices. (<code>[#10462](pypa/pip#10462) &lt;https://github.com/pypa/pip/issues/10462&gt;</code>_)</p>
</li>
<li>
<p>Drop support for Python 3.6. (<code>[#10641](pypa/pip#10641) &lt;https://github.com/pypa/pip/issues/10641&gt;</code>_)</p>
</li>
<li>
<p>Disable location mismatch warnings on Python versions prior to 3.10.</p>
<p>These warnings were helping identify potential issues as part of the sysconfig -&gt; distutils transition, and we no longer need to rely on reports from older Python versions for information on the transition. (<code>[#10840](pypa/pip#10840) &lt;https://github.com/pypa/pip/issues/10840&gt;</code>_)</p>
</li>
</ul>
<h2>Features</h2>
<ul>
<li>
<p>Changed <code>PackageFinder</code> to parse HTML documents using the stdlib :class:<code>html.parser.HTMLParser</code> class instead of the <code>html5lib</code> package.</p>
<p>For now, the deprecated <code>html5lib</code> code remains and can be used with the <code>--use-deprecated=html5lib</code> command line option. However, it will be removed in a future pip release. (<code>[#10291](pypa/pip#10291) &lt;https://github.com/pypa/pip/issues/10291&gt;</code>_)</p>
</li>
<li>
<p>Utilise <code>rich</code> for presenting pip's default download progress bar. (<code>[#10462](pypa/pip#10462) &lt;https://github.com/pypa/pip/issues/10462&gt;</code>_)</p>
</li>
<li>
<p>Present a better error message when an invalid wheel file is encountered, providing more context where the invalid wheel file is. (<code>[#10535](pypa/pip#10535) &lt;https://github.com/pypa/pip/issues/10535&gt;</code>_)</p>
</li>
<li>
<p>Documents the <code>--require-virtualenv</code> flag for <code>pip install</code>. (<code>[#10588](pypa/pip#10588) &lt;https://github.com/pypa/pip/issues/10588&gt;</code>_)</p>
</li>
<li>
<p><code>pip install &lt;tab&gt;</code> autocompletes paths. (<code>[#10646](pypa/pip#10646) &lt;https://github.com/pypa/pip/issues/10646&gt;</code>_)</p>
</li>
<li>
<p>Allow Python distributors to opt-out from or opt-in to the <code>sysconfig</code> installation scheme backend by setting <code>sysconfig._PIP_USE_SYSCONFIG</code> to <code>True</code> or <code>False</code>. (<code>[#10647](pypa/pip#10647) &lt;https://github.com/pypa/pip/issues/10647&gt;</code>_)</p>
</li>
<li>
<p>Make it possible to deselect tests requiring cryptography package on systems where it cannot be installed. (<code>[#10686](pypa/pip#10686) &lt;https://github.com/pypa/pip/issues/10686&gt;</code>_)</p>
</li>
<li>
<p>Start using Rich for presenting error messages in a consistent format. (<code>[#10703](pypa/pip#10703) &lt;https://github.com/pypa/pip/issues/10703&gt;</code>_)</p>
</li>
<li>
<p>Improve presentation of errors from subprocesses. (<code>[#10705](pypa/pip#10705) &lt;https://github.com/pypa/pip/issues/10705&gt;</code>_)</p>
</li>
</ul>

</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/pypa/pip/commit/c721f03190ef888711e8e9efe6ca8345ec6464f3"><code>c721f03</code></a> Bump for release</li>
<li><a href="https://github.com/pypa/pip/commit/844b799c9cf68629cc3c45b75c6e3b7f41086d49"><code>844b799</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/pypa/pip/issues/10847">#10847</a> from pradyunsg/better-html5lib-fallback</li>
<li><a href="https://github.com/pypa/pip/commit/a78845ab3387adfe590e2a0ae044a6d3b20ada55"><code>a78845a</code></a> Pacify functional tests that don't start with <code>\&lt;!doctype html&gt;</code></li>
<li><a href="https://github.com/pypa/pip/commit/c3a42f0679d06bfdb3475801618273be5bbce1e8"><code>c3a42f0</code></a> 📰</li>
<li><a href="https://github.com/pypa/pip/commit/c01b0b2729ee096c73416059de825e2f2f01bed9"><code>c01b0b2</code></a> Gracefully fallback to html5lib for parsing non-compliant index pages</li>
<li><a href="https://github.com/pypa/pip/commit/cc35c930b2d7096babb267ddce9382c30c58c7e1"><code>cc35c93</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/pypa/pip/issues/10850">#10850</a> from pradyunsg/release/22.0.1</li>
<li><a href="https://github.com/pypa/pip/commit/1b6ef5d0b387d84b1909799d1e18d0b4431038c3"><code>1b6ef5d</code></a> Bump for development</li>
<li><a href="https://github.com/pypa/pip/commit/c73ac8d6bcf4f64041cafeccd2125cca052abed9"><code>c73ac8d</code></a> Bump for release</li>
<li><a href="https://github.com/pypa/pip/commit/9a9c1def6e3bba1f2a860874c8c73b5c55f7f43c"><code>9a9c1de</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/pypa/pip/issues/10846">#10846</a> from pradyunsg/fix-html5lib-fallback</li>
<li><a href="https://github.com/pypa/pip/commit/80609e8c20a8db26c97037f252b29307ab44b0e2"><code>80609e8</code></a> Properly yield results from <code>html5lib</code> parsing</li>
<li>Additional commits viewable in <a href="https://github.com/pypa/pip/compare/21.3.1...22.0.2">compare view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pip&package-manager=pip&previous-version=21.3.1&new-version=22.0.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)


</details>
@q0w q0w mentioned this pull request Feb 5, 2022
2 tasks
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C: finder PackageFinder and index related code type: deprecation Related to deprecation / removal. type: enhancement Improvements to functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants