Fix #5874: Filter out candidates with hash mismatches. #6464

alexbecker · 2019-05-05T23:31:57Z

Implements @di's proposed solution to #5874.

Changes the behavior of pip install in hash-checking mode to filter out any candidates whose hashes (obtained via URL) do not match the hashes provided. This prevents a HashMismatch error when a more preferred binary distribution is upload for a release after a user pins the hashes for that release. Instead a warning is logged when a distribution is skipped due to hash mismatches:

Collecting tox==3.9.0 (from -r requirements.txt (line 1))               
  WARNING: candidate tox-3.9.0-py2.py3-none-any.whl ignored: hash sha256:1b166b93d2ce66bb7b253ba944d2be89e0c9d432d49eeb9da2988b4902a4684e not among provided hashes
  WARNING: candidate tox-3.9.0.tar.gz ignored: hash sha256:665cbdd99f5c196dd80d1d8db8c8cf5d48b1ae1f778bccd1bdf14d5aaf4ca0fc not among provided hashes

Note that a second hash comparison is performed when the candidate is downloaded. This is important because the first check is not secure: it trusts that the hash in the URL is the same as the hash in the content, and it also does not error when the user is in hash-checking mode but has not provided a hash.

This still needs tests, but I would like to check that this is the right approach before adding them.

Changes the behavior of `pip install` in hash-checking mode to filter out any candidates whose hashes (obtained via URL) do not match the hashes provided. This prevents a HashMismatch error when a more preferred binary distribution is upload for a release after a user pins the hashes for that release. Note that a second hash comparison is performed when the candidate is downloaded. This is important because the first check is not secure: it trusts that the hash in the URL is the same as the hash in the content, and it also does not error when the user is in hash-checking mode but has not provided a hash.

uranusjr · 2019-05-08T22:54:31Z

src/pip/_internal/index.py

@@ -764,7 +787,9 @@ def find_requirement(self, req, upgrade):
        Raises DistributionNotFound or BestVersionAlreadyInstalled otherwise
        """
        candidates = self.find_candidates(req.name, req.specifier)
-        best_candidate = candidates.get_best()
+        # Get any hashes supplied by the user to filter candidates.
+        hashes = req.hashes(trust_internet=False)


Conceptually, there should be a difference between supplying --require-hashes (i.e. hash mode with an empty hash list) and not (not hash mode). candidates.get_best() should only receieve the hash comparer in hash mode, and None otherwise (and get_best probably needs to check for None instead of a falsy hash comparer).

I think get_best should not filter on falsy hashes. If I supply --require-hashes without any hashes and filter on hashes, this filters out all candidates and results in the following error:

ERROR: Could not find a version that satisfies the requirement <package_name> (from versions: ...)

If I don't filter on falsy hashes then using --require-hashes without any hashes results in:

ERROR: Hashes are required in --require-hashes mode, but they are missing from some requirements. Here is a list of those requirements along with the hashes their downloaded archives actually had. Add lines like these to your requirements files to prevent tampering. (If you did not enable --require-hashes manually, note that it turns on automatically when any package has a hash.) tox==3.0 --hash=sha256:9ee7de958a43806402a38c0d2aa07fa8553f4d2c20a15b140e9f771c2afeade0

I think the latter is much more helpful, and clearly some work went into it via the MissingHashes class.

Although perhaps this is a better place to implement MissingHashes logic?

Tried replacing MissingHashes with a raise in get_best. It works, except that it makes the error about missing hashes take precedence over the error about not having versions pinned. Currently not having versions pinned in hash mode takes precedence. I am not sure whether this order of precedence is intentional.

Maybe it’s a good idea to use MissingHashes here (but I don’t think you can just move it here?)

I think I would prefer to not filter, with a comment about how if hashes is None, an error will be raised in unpack_url which explains what hash to pin to. If desired, in a future PR I could refactor away MissingHashes and raise the same error in get_best instead (which is what I meant by "perhaps this is a better place to implement MissingHashes logic"). But I'd rather not mix refactoring with feature work, and it would require some care to get the order of precedence for error messages right (there's a lot of error handling logic right now in RequirementPreparer.prepare_linked_requirement which would need to get moved around).

Do you have a plan how the refactoring would be done? If so, maybe it would be better to refactor first, and implement the feature on top of that.

It turns out that such a refactoring has a downside the current implementation of MissingHashes will provide the correct hash to the user even if there is no remote hash, whereas if I were to move the raise into get_best it would happen before the candidate is downloaded and so I could only provide the hash to the user if it was provided by the index. So I think it is not worth doing.

Alright then, let’s work on what we have and improve (if possible) in the future.

uranusjr · 2019-05-08T22:56:14Z

src/pip/_internal/index.py

+                    )
+                return is_match
+
+            candidates = [c for c in candidates if test_against_hashes(c)]


This combined with the list() call above unnecessarily loops through the candidate list twice.

I think instead of filtering, I want to only test the best value (in the common case where the best value's hash is correct). This would be both more performant and avoid spammy warnings.

alexbecker · 2019-05-09T14:47:06Z

I updated the PR to:

Look through the candidates in the reverse order of preference, and only warn when a candidate that would otherwise be preferred is skipped due to missing hashes.
Get rid of the local test_againt_hashes function, which I don't think adds anything now that I don't need it for a list comprehension.
Comment on why I don't filter out all candidates if we are in hash-checking mode but no hashes were provided.

cjerdonek · 2019-05-17T06:35:13Z

Some high-level comments after looking at this quickly:

It doesn't seem like a warning should be logged every time there is a hash mismatch -- only if the file isn't being used that would otherwise be preferred if hashes weren't being checked.
I think the hash-checking logic would be better located in CandidateEvaluator rather than FoundCandidates. (FoundCandidates should be viewed more as a container return value rather than implementing actual logic.)
The logic regarding prereleases also needs to be thought through, to make sure no changes are needed there.
It seems like some unnecessary code-branching / duplication is getting added in the PR (e.g. CandidateEvaluator.get_best_candidates() getting introduced alongside CandidateEvaluator.get_best_candidate(), and if-branching inside FoundCandidates.get_best()), which leads to more cases that need to be tested. It would be better if the same code path is used, when possible.

alexbecker · 2019-05-18T14:25:48Z

@cjerdonek Thanks for looking! Point-by-point responses:

Since this PR traverses the candidates in reverse order of preference, it only emits a warning when a more preferred candidate is skipped due to hash mismatches.
I can move the check in to get_best_candidate, eliminating the need for a new get_best_candidates function. The only downside is that's another level I have to pass hashes through.
I don't see how pre-releases are any different here, but I'm not very familiar with pip's special-case handling of pre-releases. Can you expand on your concerns?
I think moving the check into CandidateEvaluator mostly fixes this concern. I introduced the duplication so that we could continue to use max (which is O(n)) instead of sorted (which is O(n log n)) to find the best candidate when not in hash-checking mode. But the performance impact should be so minimal that it's not worth having two code paths.

uranusjr · 2019-05-20T06:57:55Z

Regarding the complexity of sorted and max, the practical difference is small enough to ignore IMO. But I also think it’s better to put the logic inside CandidateEvaluator instead, and this would avoid sorted altogether. One way to do that would be to add hashes=None to get_best_candidate(), and filter the list before passing it to max.

cjerdonek · 2019-05-20T14:51:15Z

Since this PR traverses the candidates in reverse order of preference, it only emits a warning when a more preferred candidate is skipped due to hash mismatches.

A couple things related to this:

First, I'm not sure this should necessarily be a warning, but perhaps just an info() message. Also, I think it might be better for such a message to occur only once (for the first / most preferred candidate). After that, new information isn't really being added. So subsequent ones can probably occur as debug messages.

Second, I would also take a look at @dstufft's original comment here to make sure you're following what he suggested:

I think it's a good idea to do the solution described here, to limit our installation to only things we have hashes for. Ideally I think we'd print a warning of some kind if we're not installing the preferred file due to it not having a hash listed though.

He only mentioned a warning if the hash isn't listed -- not if it's different. (If it's not listed, I guess the issue is skipping over a possible correct match?)

I don't see how pre-releases are any different here, but I'm not very familiar with pip's special-case handling of pre-releases. Can you expand on your concerns?

It wasn't so much a concern as a suggestion for you to look at that code, since you're working on a PR. For example, the hash-checking docs here say, "Requirements that take the form of project names (rather than URLs or local filesystem paths) must be pinned to a specific version using ==." So I'm wondering if it's possible for someone to allow pre-releases if checking hashes or if that combination should be disallowed (or maybe it's already disallowed somewhere). It looks like the filter() methods in packaging/specifiers.py might have logic to yield pre-releases if no version matches. I'm also wondering if hash-checking means that the set of version strings from installation candidates will always have length one, or if that should be enforced (or if it's already enforced somewhere).

Re: sorting, my suggestion was more about keeping the logic together in CandidateEvaluator. If there is an early return / fast path that uses max, that could certainly be okay. We just have to see what the code looks like first.

alexbecker · 2019-05-21T15:19:14Z

@uranusjr thanks, I'll start moving the code.

@cjerdonek responses inline:

First, I'm not sure this should necessarily be a warning, but perhaps just an info() message.

I think it should be a warning if it's something we want the user to act on, and an info message if not. So to me, the question boils down to: if a more preferred distribution is published after you pin hashes, should you switch to it? I think the answer is yes, because even with this fix you can't assume that every tool you use will skip the more preferred release (e.g. older versions of pip, or tools that vendor pip code).

Also, I think it might be better for such a message to occur only once (for the first / most preferred candidate). After that, new information isn't really being added. So subsequent ones can probably occur as debug messages.

I could do that, but I'm not sure it's worth the (admittedly small) additional complexity. This warning is generally triggered when a platform-specific wheel gets published for an old release, and I haven't run across a case where more than one such wheel would be preferred over all existing distributions.

Actually, it might be that the most likely cause of multiple skipping multiple candidates is a MitM or compromise of pypi.org, in which case a warning seems desirable.

He only mentioned a warning if the hash isn't listed -- not if it's different.

I don't think there's any way to distinguish between a hash not being listed and being different, since pinned hashes do not specify what distribution they are for. Note that this PR changes the error a user will observe if pypi.org is MitM-ed/compromised, from a "Hash mismatch" error to a series of warnings about skipping distributions due to hash mismatch, followed by a "No matching candidates" error. (If just pythonhosted.org is MitM-ed/compromised, users will still see a "Hash mismatch" error.)

(If it's not listed, I guess the issue is skipping over a possible correct match?)

Are you asking what the undesirable behavior in #5874 is? (It's that, if the hash is not listed, pip will download the distribution anyway and then raise a "Hash mismatch" error.)

So I'm wondering if it's possible for someone to allow pre-releases if checking hashes or if that combination should be disallowed (or maybe it's already disallowed somewhere).

I believe it's possible but might require an additional flag. At any rate it isn't affected by this PR.

I'm also wondering if hash-checking means that the set of version strings from installation candidates will always have length one, or if that should be enforced (or if it's already enforced somewhere).

It will. This is enforced in operations/prepare.py, L298-316.

uranusjr · 2019-05-21T19:20:23Z

I'm also wondering if hash-checking means that the set of version strings from installation candidates will always have length one, or if that should be enforced (or if it's already enforced somewhere).

It will. This is enforced in operations/prepare.py, L298-316.

Requirements in hash-checking mode must be pinned with == (or ===), othereise you get this:

ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
These do not:
...

It is still possible to have matches of length zero. But that would just cause an error in a later stage.

cjerdonek · 2019-05-21T20:49:34Z

(If it's not listed, I guess the issue is skipping over a possible correct match?)

Are you asking what the undesirable behavior in #5874 is? (It's that, if the hash is not listed, pip will download the distribution anyway and then raise a "Hash mismatch" error.)

To clarify, I meant if the hash isn't listed in the URL. In your PR, you return those as matches immediately, without examining later candidates:

for c in candidates:
    link = c.location
    if not link.hash:
        # We can only filter candidates with hashes in their URLs.
        return c
    ...

Wouldn't you want to treat this as a non-match (at least initially) and continue looking so you have the chance to find a matching hash that might appear later?

alexbecker · 2019-05-21T21:54:27Z

@cjerdonek Ah, so you are suggesting that, if some but not all of the candidates have hashes in their URLs, we should prefer candidates with hashes in their URLs that match the pinned hashes?

This is possible, but I think it's undesirable for a few reasons:

Candidates without hashes in the URL are still probably matches for our pinned URLs, the index is just not advertising that fact.
Thanks to 1, there could be cases where pip was installing packages in hash-checking mode just fine before, but then after this change it starts installing a less-preferred distribution (presumably an sdist), which may fail.
I believe it's very rare for some but not all candidates to have hashes in their URLs.

cjerdonek · 2019-05-22T06:42:18Z

Ah, so you are suggesting that, if some but not all of the candidates have hashes in their URLs, we should prefer candidates with hashes in their URLs that match the pinned hashes?

Yeah, I was suggesting choosing the first candidate that has a matching hash in the URL (and going back to the first if none match). That's how I interpreted the original proposal. This is also how I interpreted the suggestion I quoted above (i.e. if the first file doesn't have a hash in the URL):

Ideally I think we'd print a warning of some kind if we're not installing the preferred file due to it not having a hash listed though.

Otherwise, if a matching hash occurs later in the list, but you choose the first that doesn't have a hash in the URL, what happens if the latter winds up not matching -- would it fail?

alexbecker · 2019-05-22T07:00:54Z

Otherwise, if a matching hash occurs later in the list, but you choose the first that doesn't have a hash in the URL, what happens if the latter winds up not matching -- would it fail?

It would fail (specifically it would raise a HashMismatch error). This is the same as the current behavior. I agree that in this case choosing a different candidate with a matching hash is a better experience, but for the 3 reasons I outlined above I don't think the trade-off is worth it (I really want to avoid breaking anyone's builds with this PR that are currently working).

cjerdonek · 2019-05-22T07:18:37Z

It would fail (specifically it would raise a HashMismatch error). This is the same as the current behavior.

Right, and this is also the thing that #5874 wants to fix! The logic seems straightforward to me. If you're worried about making an error, I and I'm sure others would be happy to look it over for correctness.

alexbecker · 2019-05-22T15:19:34Z

@cjerdonek I think you misunderstand my concern. Currently, if someone is using hash-checking mode and not getting errors, this PR will never change what distributions they install. I.e. this PR will never break a build that isn't already broken. However, if I prefer candidates with hashes over those without, someone who is using hash-checking mode with no errors currently may start installing a different distribution then they were previously, which could potentially fail to install.

It may be worth doing what you're suggesting, but it's a much more radical change than what I've implemented, and I think it requires a broader discussion before implementing.

uranusjr · 2019-05-22T17:16:12Z

I have a feeling you two are talking past each other 🤔 What Chris meant was specifically to the for loop + early return:

for c in candidates:
    link = c.location
    if not link.hash:
        # We can only filter candidates with hashes in their URLs.
        return c
    ...

This does not change the (current) behaviour of candidate selection, but it changes how that is done. I believe what Chris was implying to is that self._evaluator.get_best_candidate() should always be used to select the candidate to return. Instead of implementing a new code path to do the selecting, the if hashes: block should instead filter the candidates iterable (emitting warning messages when needed), and pass the filtered result to the evaluator for the final selection.

alexbecker · 2019-05-25T22:16:38Z

@uranusjr I've moved the logic into get_best_candidate and removed get_best_candidates. I don't think this is quite in line with your most recent comment. The problem with filtering the candidates iterable before passing to the evaluator is that (at least to my understanding) we only want to warn when a preferred candidate is skipped, not when any candidate is skipped. So the check against hashes needs to happen either during or post evaluation.

@cjerdonek I think I've been misunderstanding the change you want, partly because I was trying to understand them relative to the changes I had already planned but not committed. Can you suggest the change you're thinking of against this updated code?

cjerdonek · 2019-05-25T22:42:17Z

I have a feeling you two are talking past each other

@uranusjr I'm pretty sure @alexbecker and I are discussing a difference in how we think things should behave rather than just an implementation detail.

@alexbecker I think you understood what we were discussing correctly, but @uranusjr's message might have caused you to second guess your interpretation. Regarding your next-to-last message, here's a question to consider about the two scenarios, for comparison:

In your hypothetical, with my approach of choosing a later file whose hash matches exactly over a first file with no hash listed, if that choice were to fail for some reason, what would be the corrective action the user should take?

Similarly, with your approach of choosing a first file with no hash listed over a later file whose hash matches exactly, if that choice were to fail for some reason, what would be the corrective action the user should take?

I'm happy to ask / get confirmation on the other thread about how things should behave.

alexbecker · 2019-05-25T23:11:35Z

@cjerdonek That's a good question. I'll go through what I would do in each case (ignoring questions about auditing packages in response to hash changes), which is the closest I can get to the correct course of action:

In your approach, if a less-preferred file is chosen because it has a provided hash that matches exactly, but that install fails, the user would see:

A warning from the hash-filtering code that a preferred distribution had been skipped due to missing hashes.
A bunch of gcc/other tool errors relating to the distribution failing to build.

In this case, the user should pip download the failed package directly, run pip hash on the downloaded file, then replace the hash in their requirements.txt file (or wherever else they invoke pip from). However, I do not think most users would know to do this, since I think 1) most Python users don't really understand the difference between different kinds of distributions, and 2) there would be potentially pages of errors from the installation that seem more directly relevant.

In my approach, if the file with no hashes provided in the URL does not actually match the hashes provided to pip, the user would see an error like:

THESE PACKAGES DO NOT MATCH THE HASHES
    tracefront==0.1:
        Expected sha256 123451234512345123451234512345123451234512345
        Got sha256 bcdefbcdefbcdefbcdefbcdefbcdefbcdefbcdefbcdef

In this case the user should replace the hash in requirements.txt with bcdefbcdefbcdefbcdefbcdefbcdefbcdefbcdefbcdef.

cjerdonek · 2019-05-25T23:36:38Z

In this case, the user should pip download the failed package directly, run pip hash on the downloaded file, then replace the hash in their requirements.txt file (or wherever else they invoke pip from).

Couldn't the user also just delete the hash that resulted in the failure? In the hypothetical scenario you described, you were worried about a situation where things were working for a user when it selected a first file with no hash listed, but then things not working if pip were to change to choosing a later file that does list a hash and that matches.

For the second case I described, I was asking about a situation where the user doesn't want the first file but wants the later one they have the matching hash for. With the PR that you're proposing, what corrective action could the user take to get pip to select the later file that matches their hash exactly?

alexbecker · 2019-05-26T00:07:41Z

I guess they could delete the hash of the distribution they're having trouble installing from the list of hashes for that package in their requirements.txt (which is how I understand your suggestion). This seems more brittle to me than adding the correct hash, so I wouldn't recommend it, and I also don't think it would be obvious to a user encountering this problem.

In the second case, if the user doesn't want the file that pip prefers, the course of action is the same as any other time you want a non-preferred file. Most often this means you want an sdist, so you should supply the --no-binary <package_name> flag. If you want a binary distribution that is not the most preferred distribution... pip does not have a good story for that AFAIK (see e.g. #6523), but I don't think hash-checking mode should be how you define which binary distribution you want.

cjerdonek · 2019-05-26T00:21:57Z

This seems more brittle to me than adding the correct hash, so I wouldn't recommend it,

Wouldn't they already have to have it listed though? In the hypothetical scenario you described, you were worried about a case where things were originally working for a user in hash-checking mode when it selected a first file with no hash listed, but then not working if choosing a later file whose hash does match one of the hashes provided by the user. I'm just trying to understand better the case that you were worried about (which is the reason for my questions).

alexbecker · 2019-05-26T00:28:04Z

I'm worried about a case where pip selects a file with no hash in the URL on PyPI (i.e. link.hash is None), but where the actual hash of the file (computed after download) matches what they have provided in requirements.txt.

cjerdonek · 2019-05-26T00:37:45Z

I'm worried about a case where pip selects a file with no hash in the URL on PyPI (i.e. link.hash is None), but where the actual hash of the file (computed after download) matches what they have provided in requirements.txt.

Yes, I understand that. So won't that continue to work with the approach I described as long as the user doesn't have additional hashes in their requirements.txt that match less-preferred files whose hashes are listed in the URL's on PyPI?

alexbecker · 2019-05-26T02:13:17Z

Ah, you are right that they would have to have the correct hash (by which I mean the hash for the more preferred file, which up until this change they have been installing without issue) in their requirements.txt. And my suggestion of adding that hash makes no sense because that doesn't change the filtering behavior. The only way to change the filtering behavior would be to remove the hashes of every distribution that does have a hash in the URL on the index server. I guess I would recommend they pip download and pip hash the file they actually want, and remove all other hashes. If anything, I think this is a more confusing process for the user to go through IMO.

BrownTruck · 2019-06-27T12:30:06Z

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

cjerdonek · 2019-07-24T03:55:12Z

Closing as this was instead implemented by PR #6699. Thanks for your initial work on this, though, @alexbecker!

alexbecker added 4 commits May 5, 2019 16:27

handle candidates with no hash in their URL

bcc8de3

fix fake hash causing broken test

b704238

fix mypy import error

a972286

alexbecker mentioned this pull request May 8, 2019

Hash-checking mode breaks after new distributions are added #5874

Closed

uranusjr reviewed May 8, 2019

View reviewed changes

alexbecker added 2 commits May 9, 2019 07:21

only warn when best candidate is skipped due to hash check

2785d08

remove local function test_against_hashes, it does not add clarity

289e75e

cjerdonek added the type: enhancement Improvements to functionality label May 17, 2019

move filtering into get_best_candidate; remove get_best_candidates

a6ad9a6

BrownTruck added the needs rebase or merge PR has conflicts with current master label Jun 27, 2019

minrk mentioned this pull request Jun 28, 2019

18.0.1 files re-released on Jun 27 with new hashes zeromq/pyzmq#1304

Closed

cjerdonek mentioned this pull request Jul 7, 2019

Add PackageFinder.make_candidate_evaluator() #6687

Merged

uranusjr mentioned this pull request Jul 10, 2019

To be truly deterministic, a maintainer of a Pipfile should be able to specify whether to use binary or source pypa/pipenv#3782

Closed

cjerdonek mentioned this pull request Jul 11, 2019

Address #5874: Prefer candidates with allowed hashes #6699

Merged

cjerdonek closed this Jul 24, 2019

lock bot added the auto-locked Outdated issues that have been locked by automation label Aug 23, 2019

lock bot locked as resolved and limited conversation to collaborators Aug 23, 2019

Fix #5874: Filter out candidates with hash mismatches. #6464

Fix #5874: Filter out candidates with hash mismatches. #6464

Conversation

alexbecker commented May 5, 2019

Choose a reason for hiding this comment

alexbecker May 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexbecker commented May 9, 2019

cjerdonek commented May 17, 2019

alexbecker commented May 18, 2019

uranusjr commented May 20, 2019 • edited Loading

cjerdonek commented May 20, 2019

alexbecker commented May 21, 2019

uranusjr commented May 21, 2019 • edited Loading

cjerdonek commented May 21, 2019

alexbecker commented May 21, 2019

cjerdonek commented May 22, 2019

alexbecker commented May 22, 2019

cjerdonek commented May 22, 2019

alexbecker commented May 22, 2019

uranusjr commented May 22, 2019

alexbecker commented May 25, 2019 • edited Loading

cjerdonek commented May 25, 2019 • edited Loading

alexbecker commented May 25, 2019

cjerdonek commented May 25, 2019

alexbecker commented May 26, 2019

cjerdonek commented May 26, 2019

alexbecker commented May 26, 2019

cjerdonek commented May 26, 2019

alexbecker commented May 26, 2019

BrownTruck commented Jun 27, 2019

cjerdonek commented Jul 24, 2019

alexbecker May 9, 2019 •

edited

Loading

uranusjr commented May 20, 2019 •

edited

Loading

uranusjr commented May 21, 2019 •

edited

Loading

alexbecker commented May 25, 2019 •

edited

Loading

cjerdonek commented May 25, 2019 •

edited

Loading