Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully fallback to html5lib for parsing non-compliant index pages #10847

Merged
merged 3 commits into from
Jan 30, 2022

Conversation

pradyunsg
Copy link
Member

@pradyunsg pradyunsg commented Jan 30, 2022

Builds upon #10846

Toward #10825, since it looks like using non-compliant HTML 5 documents is really common across the entire ecosystem outside of PyPI.

@pradyunsg
Copy link
Member Author

Huh. I don't understand the test failure, and am unable to reproduce this locally.

@pradyunsg pradyunsg force-pushed the better-html5lib-fallback branch from 48f5bea to 1b10b91 Compare January 30, 2022 18:26
@pradyunsg pradyunsg added the skip news Does not need a NEWS file entry (eg: trivial changes) label Jan 30, 2022
@notatallshaw
Copy link
Member

Huh. I don't understand the test failure, and am unable to reproduce this locally.

Tried to help but I also couldn't reproduce test failures locally (tried on Windows and Ubuntu)

@pradyunsg pradyunsg modified the milestones: 22.0.1, 22.0.2 Jan 30, 2022
@pradyunsg
Copy link
Member Author

I guess the computer gods don't want this to be changed. 🤷🏽

@pradyunsg
Copy link
Member Author

❯ pip index versions simple -i localhost:53495
WARNING: pip index is currently an experimental command. It may be removed/changed in a future release without prior warning.
DEPRECATION: The HTML index page being used (file:///Users/pradyunsg/Developer/pip/tests/data/indexes/yanked/simple/index.html) is not a proper HTML 5 document. This is in violation of PEP 503 which requires these pages to be well-formed HTML 5 documents. Please reach out to the owners of this index page, and ask them to update this index page to a valid HTML 5 document. pip 22.2 will enforce this behaviour change. Discussion can be found at https://github.com/pypa/pip/issues/10825
simple (2.0)
Available versions: 2.0, 1.0

I'm very confused, because this definitely behaves correctly for me. I'm very confused what the CI is upto.

This reworks the HTML parsing logic, to gracefully use `html5lib` on
non-compliant HTML 5 documents. This warning softens the failure mode
for users who are using commercial package index solutions that do not
follow the requisite standards and serve malformed HTML documents.
@pradyunsg pradyunsg force-pushed the better-html5lib-fallback branch from 213dcce to c01b0b2 Compare January 30, 2022 21:21
@pradyunsg
Copy link
Member Author

The test failures were due to cache-related shenenigans: https://github.com/pypa/pip/runs/4998934472?check_suite_focus=true#step:5:229

@pradyunsg pradyunsg marked this pull request as ready for review January 30, 2022 21:24
@pradyunsg pradyunsg added C: finder PackageFinder and index related code type: deprecation Related to deprecation / removal. and removed skip news Does not need a NEWS file entry (eg: trivial changes) labels Jan 30, 2022
@pradyunsg pradyunsg requested a review from a team January 30, 2022 21:27
# Check if the page starts with a valid doctype, to decide whether to use
# http.parser or (deprecated) html5lib for parsing -- unless explicitly
# requested to use html5lib.
if not use_deprecated_html5lib:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess --use-deprecated=html5lib is also "suppress the warning" flag now, in addition to being a "oh no, the new parser doesn't work for me and I need something NOW" flag.

@pradyunsg
Copy link
Member Author

pradyunsg commented Jan 30, 2022

The relevant tests have passed, so I'm gonna say that this is gonna end up green. I'm not merging without an OK from at least one other member of @pypa/pip-committers.

If folks are fine with this, this is an easy 22.0.2 release. :)

Copy link
Member

@sbidoul sbidoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Out of curiosity why are we so strict about the doctype declaration specifically ?
It looks like html.parser is going to accept many other flavors of invalid html5 anyway.

@pradyunsg
Copy link
Member Author

Based on the discussion in #10291 (comment), it is intended to be stricter in general in what index HTML can be; to push indexes to use standards-compliant documents which make it easier for other PEP 503 clients to support them.

@sbidoul
Copy link
Member

sbidoul commented Jan 30, 2022

I see. Maybe a warning would be enough to achieve the goal of pushing the ecosystem towards compliant indexes ? What itches me is that the page could be a complete tag soup under a valid doctype declaration, and pip will accept it, since html.parser seems to be very lenient.

@pradyunsg
Copy link
Member Author

Maybe a warning would be enough to achieve the goal of pushing the ecosystem towards compliant indexes ?

Yup, which is what this PR brings us to. :)

@pradyunsg pradyunsg merged commit 844b799 into pypa:main Jan 30, 2022
@pradyunsg pradyunsg deleted the better-html5lib-fallback branch January 30, 2022 22:27
@ps-jay
Copy link

ps-jay commented Jan 30, 2022

Amazing work @pradyunsg, thanks

mergify bot pushed a commit to andrewbolster/bolster that referenced this pull request Jan 31, 2022
Bumps [pip](https://github.com/pypa/pip) from 21.3.1 to 22.0.2.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/pypa/pip/blob/main/NEWS.rst">pip's changelog</a>.</em></p>
<blockquote>
<h1>22.0.2 (2022-01-30)</h1>
<h2>Deprecations and Removals</h2>
<ul>
<li>Instead of failing on index pages that use non-compliant HTML 5, print a deprecation warning and fall back to <code>html5lib</code>-based parsing for now. This simplifies the migration for non-compliant index pages, by letting such indexes function with a warning. (<code>[#10847](pypa/pip#10847) &lt;https://github.com/pypa/pip/issues/10847&gt;</code>_)</li>
</ul>
<h1>22.0.1 (2022-01-30)</h1>
<h2>Bug Fixes</h2>
<ul>
<li>Accept lowercase <code>&lt;!doctype html&gt;</code> on index pages. (<code>[#10844](pypa/pip#10844) &lt;https://github.com/pypa/pip/issues/10844&gt;</code>_)</li>
<li>Properly handle links parsed by html5lib, when using <code>--use-deprecated=html5lib</code>. (<code>[#10846](pypa/pip#10846) &lt;https://github.com/pypa/pip/issues/10846&gt;</code>_)</li>
</ul>
<h1>22.0 (2022-01-29)</h1>
<h2>Process</h2>
<ul>
<li>Completely replace :pypi:<code>tox</code> in our development workflow, with :pypi:<code>nox</code>.</li>
</ul>
<h2>Deprecations and Removals</h2>
<ul>
<li>
<p>Deprecate alternative progress bar styles, leaving only <code>on</code> and <code>off</code> as available choices. (<code>[#10462](pypa/pip#10462) &lt;https://github.com/pypa/pip/issues/10462&gt;</code>_)</p>
</li>
<li>
<p>Drop support for Python 3.6. (<code>[#10641](pypa/pip#10641) &lt;https://github.com/pypa/pip/issues/10641&gt;</code>_)</p>
</li>
<li>
<p>Disable location mismatch warnings on Python versions prior to 3.10.</p>
<p>These warnings were helping identify potential issues as part of the sysconfig -&gt; distutils transition, and we no longer need to rely on reports from older Python versions for information on the transition. (<code>[#10840](pypa/pip#10840) &lt;https://github.com/pypa/pip/issues/10840&gt;</code>_)</p>
</li>
</ul>
<h2>Features</h2>
<ul>
<li>
<p>Changed <code>PackageFinder</code> to parse HTML documents using the stdlib :class:<code>html.parser.HTMLParser</code> class instead of the <code>html5lib</code> package.</p>
<p>For now, the deprecated <code>html5lib</code> code remains and can be used with the <code>--use-deprecated=html5lib</code> command line option. However, it will be removed in a future pip release. (<code>[#10291](pypa/pip#10291) &lt;https://github.com/pypa/pip/issues/10291&gt;</code>_)</p>
</li>
<li>
<p>Utilise <code>rich</code> for presenting pip's default download progress bar. (<code>[#10462](pypa/pip#10462) &lt;https://github.com/pypa/pip/issues/10462&gt;</code>_)</p>
</li>
<li>
<p>Present a better error message when an invalid wheel file is encountered, providing more context where the invalid wheel file is. (<code>[#10535](pypa/pip#10535) &lt;https://github.com/pypa/pip/issues/10535&gt;</code>_)</p>
</li>
<li>
<p>Documents the <code>--require-virtualenv</code> flag for <code>pip install</code>. (<code>[#10588](pypa/pip#10588) &lt;https://github.com/pypa/pip/issues/10588&gt;</code>_)</p>
</li>
<li>
<p><code>pip install &lt;tab&gt;</code> autocompletes paths. (<code>[#10646](pypa/pip#10646) &lt;https://github.com/pypa/pip/issues/10646&gt;</code>_)</p>
</li>
<li>
<p>Allow Python distributors to opt-out from or opt-in to the <code>sysconfig</code> installation scheme backend by setting <code>sysconfig._PIP_USE_SYSCONFIG</code> to <code>True</code> or <code>False</code>. (<code>[#10647](pypa/pip#10647) &lt;https://github.com/pypa/pip/issues/10647&gt;</code>_)</p>
</li>
<li>
<p>Make it possible to deselect tests requiring cryptography package on systems where it cannot be installed. (<code>[#10686](pypa/pip#10686) &lt;https://github.com/pypa/pip/issues/10686&gt;</code>_)</p>
</li>
<li>
<p>Start using Rich for presenting error messages in a consistent format. (<code>[#10703](pypa/pip#10703) &lt;https://github.com/pypa/pip/issues/10703&gt;</code>_)</p>
</li>
<li>
<p>Improve presentation of errors from subprocesses. (<code>[#10705](pypa/pip#10705) &lt;https://github.com/pypa/pip/issues/10705&gt;</code>_)</p>
</li>
</ul>

</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/pypa/pip/commit/c721f03190ef888711e8e9efe6ca8345ec6464f3"><code>c721f03</code></a> Bump for release</li>
<li><a href="https://github.com/pypa/pip/commit/844b799c9cf68629cc3c45b75c6e3b7f41086d49"><code>844b799</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/pypa/pip/issues/10847">#10847</a> from pradyunsg/better-html5lib-fallback</li>
<li><a href="https://github.com/pypa/pip/commit/a78845ab3387adfe590e2a0ae044a6d3b20ada55"><code>a78845a</code></a> Pacify functional tests that don't start with <code>\&lt;!doctype html&gt;</code></li>
<li><a href="https://github.com/pypa/pip/commit/c3a42f0679d06bfdb3475801618273be5bbce1e8"><code>c3a42f0</code></a> 📰</li>
<li><a href="https://github.com/pypa/pip/commit/c01b0b2729ee096c73416059de825e2f2f01bed9"><code>c01b0b2</code></a> Gracefully fallback to html5lib for parsing non-compliant index pages</li>
<li><a href="https://github.com/pypa/pip/commit/cc35c930b2d7096babb267ddce9382c30c58c7e1"><code>cc35c93</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/pypa/pip/issues/10850">#10850</a> from pradyunsg/release/22.0.1</li>
<li><a href="https://github.com/pypa/pip/commit/1b6ef5d0b387d84b1909799d1e18d0b4431038c3"><code>1b6ef5d</code></a> Bump for development</li>
<li><a href="https://github.com/pypa/pip/commit/c73ac8d6bcf4f64041cafeccd2125cca052abed9"><code>c73ac8d</code></a> Bump for release</li>
<li><a href="https://github.com/pypa/pip/commit/9a9c1def6e3bba1f2a860874c8c73b5c55f7f43c"><code>9a9c1de</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/pypa/pip/issues/10846">#10846</a> from pradyunsg/fix-html5lib-fallback</li>
<li><a href="https://github.com/pypa/pip/commit/80609e8c20a8db26c97037f252b29307ab44b0e2"><code>80609e8</code></a> Properly yield results from <code>html5lib</code> parsing</li>
<li>Additional commits viewable in <a href="https://github.com/pypa/pip/compare/21.3.1...22.0.2">compare view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pip&package-manager=pip&previous-version=21.3.1&new-version=22.0.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)


</details>
inmantaci pushed a commit to inmanta/inmanta-core that referenced this pull request Feb 14, 2022
Bumps [pip](https://github.com/pypa/pip) from 21.3.1 to 22.0.3.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/pypa/pip/blob/main/NEWS.rst">pip's changelog</a>.</em></p>
<blockquote>
<h1>22.0.3 (2022-02-03)</h1>
<h2>Features</h2>
<ul>
<li>Print the exception via <code>rich.traceback</code>, when running with <code>--debug</code>. (<code>[#10791](pypa/pip#10791) &lt;https://github.com/pypa/pip/issues/10791&gt;</code>_)</li>
</ul>
<h2>Bug Fixes</h2>
<ul>
<li>
<p>Only calculate topological installation order, for packages that are going to be installed/upgraded.</p>
<p>This fixes an <code>AssertionError</code> that occured when determining installation order, for a very specific combination of upgrading-already-installed-package + change of dependencies + fetching some packages from a package index. This combination was especially common in Read the Docs' builds. (<code>[#10851](pypa/pip#10851) &lt;https://github.com/pypa/pip/issues/10851&gt;</code>_)</p>
</li>
<li>
<p>Use <code>html.parser</code> by default, instead of falling back to <code>html5lib</code> when <code>--use-deprecated=html5lib</code> is not passed. (<code>[#10869](pypa/pip#10869) &lt;https://github.com/pypa/pip/issues/10869&gt;</code>_)</p>
</li>
</ul>
<h2>Improved Documentation</h2>
<ul>
<li>Clarify that using per-requirement overrides disables the usage of wheels. (<code>[#9674](pypa/pip#9674) &lt;https://github.com/pypa/pip/issues/9674&gt;</code>_)</li>
</ul>
<h1>22.0.2 (2022-01-30)</h1>
<h2>Deprecations and Removals</h2>
<ul>
<li>Instead of failing on index pages that use non-compliant HTML 5, print a deprecation warning and fall back to <code>html5lib</code>-based parsing for now. This simplifies the migration for non-compliant index pages, by letting such indexes function with a warning. (<code>[#10847](pypa/pip#10847) &lt;https://github.com/pypa/pip/issues/10847&gt;</code>_)</li>
</ul>
<h1>22.0.1 (2022-01-30)</h1>
<h2>Bug Fixes</h2>
<ul>
<li>Accept lowercase <code>&lt;!doctype html&gt;</code> on index pages. (<code>[#10844](pypa/pip#10844) &lt;https://github.com/pypa/pip/issues/10844&gt;</code>_)</li>
<li>Properly handle links parsed by html5lib, when using <code>--use-deprecated=html5lib</code>. (<code>[#10846](pypa/pip#10846) &lt;https://github.com/pypa/pip/issues/10846&gt;</code>_)</li>
</ul>
<h1>22.0 (2022-01-29)</h1>
<h2>Process</h2>
<ul>
<li>Completely replace :pypi:<code>tox</code> in our development workflow, with :pypi:<code>nox</code>.</li>
</ul>
<p>Deprecations and Removals</p>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/pypa/pip/commit/44018de50cafba25445a225c1a1986d6312e1ef3"><code>44018de</code></a> Bump for release</li>
<li><a href="https://github.com/pypa/pip/commit/65f096c432d60d5f0214793becd592e1c1c3b624"><code>65f096c</code></a> Update AUTHORS.txt</li>
<li><a href="https://github.com/pypa/pip/commit/7d50964bcb1b25f9fe2c49fe447ab58aad2b4247"><code>7d50964</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/pypa/pip/issues/10876">#10876</a> from mbacchi/vcs_support_typo</li>
<li><a href="https://github.com/pypa/pip/commit/ff8dbb458a59905c5462d339a63536257aad497a"><code>ff8dbb4</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/pypa/pip/issues/10867">#10867</a> from mauritsvanrees/maurits-topoligical-weights-req...</li>
<li><a href="https://github.com/pypa/pip/commit/b3f5cad73241e25a25ce7d50eb9175dbafcfd8db"><code>b3f5cad</code></a> Update news/10851.bugfix.rst</li>
<li><a href="https://github.com/pypa/pip/commit/cf4655f474cb8a04fa6b274ee0edaf774546a79b"><code>cf4655f</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/pypa/pip/issues/10869">#10869</a> from pradyunsg/put-html5lib-behind-flag</li>
<li><a href="https://github.com/pypa/pip/commit/3608b42ef0ab39a2d50335356644f8f3464f651a"><code>3608b42</code></a> Fix minor typo in vcs support doc</li>
<li><a href="https://github.com/pypa/pip/commit/6c92a33b6e22f099edac8f4df594ffe6a18eb6e2"><code>6c92a33</code></a> Place the link as &quot;context&quot; instead of &quot;Link:&quot;</li>
<li><a href="https://github.com/pypa/pip/commit/7a3b0f1ae1cc59ae6566694e47887728a7976ab9"><code>7a3b0f1</code></a> 📰</li>
<li><a href="https://github.com/pypa/pip/commit/d7fed8fe9382c4f4442d7aa6216f41c8ed6f1ea3"><code>d7fed8f</code></a> Use rich.traceback with debug mode (<a href="https://github-redirect.dependabot.com/pypa/pip/issues/10832">#10832</a>)</li>
<li>Additional commits viewable in <a href="https://github.com/pypa/pip/compare/21.3.1...22.0.3">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pip&package-manager=pip&previous-version=21.3.1&new-version=22.0.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

</details>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C: finder PackageFinder and index related code type: deprecation Related to deprecation / removal.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants