Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #5874: Filter out candidates with hash mismatches. #6464

Closed
wants to merge 7 commits into from

Conversation

alexbecker
Copy link

Implements @di's proposed solution to #5874.

Changes the behavior of pip install in hash-checking mode to filter out any candidates whose hashes (obtained via URL) do not match the hashes provided. This prevents a HashMismatch error when a more preferred binary distribution is upload for a release after a user pins the hashes for that release. Instead a warning is logged when a distribution is skipped due to hash mismatches:

Collecting tox==3.9.0 (from -r requirements.txt (line 1))               
  WARNING: candidate tox-3.9.0-py2.py3-none-any.whl ignored: hash sha256:1b166b93d2ce66bb7b253ba944d2be89e0c9d432d49eeb9da2988b4902a4684e not among provided hashes
  WARNING: candidate tox-3.9.0.tar.gz ignored: hash sha256:665cbdd99f5c196dd80d1d8db8c8cf5d48b1ae1f778bccd1bdf14d5aaf4ca0fc not among provided hashes

Note that a second hash comparison is performed when the candidate is downloaded. This is important because the first check is not secure: it trusts that the hash in the URL is the same as the hash in the content, and it also does not error when the user is in hash-checking mode but has not provided a hash.

This still needs tests, but I would like to check that this is the right approach before adding them.

alexbecker added 4 commits May 5, 2019 16:27
Changes the behavior of `pip install` in hash-checking mode to filter
out any candidates whose hashes (obtained via URL) do not match the
hashes provided. This prevents a HashMismatch error when a more
preferred binary distribution is upload for a release after a user
pins the hashes for that release.

Note that a second hash comparison is performed when the candidate is
downloaded. This is important because the first check is not secure: it
trusts that the hash in the URL is the same as the hash in the content,
and it also does not error when the user is in hash-checking mode but
has not provided a hash.
@@ -764,7 +787,9 @@ def find_requirement(self, req, upgrade):
Raises DistributionNotFound or BestVersionAlreadyInstalled otherwise
"""
candidates = self.find_candidates(req.name, req.specifier)
best_candidate = candidates.get_best()
# Get any hashes supplied by the user to filter candidates.
hashes = req.hashes(trust_internet=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, there should be a difference between supplying --require-hashes (i.e. hash mode with an empty hash list) and not (not hash mode). candidates.get_best() should only receieve the hash comparer in hash mode, and None otherwise (and get_best probably needs to check for None instead of a falsy hash comparer).

Copy link
Author

@alexbecker alexbecker May 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think get_best should not filter on falsy hashes. If I supply --require-hashes without any hashes and filter on hashes, this filters out all candidates and results in the following error:

ERROR: Could not find a version that satisfies the requirement <package_name> (from versions: ...)

If I don't filter on falsy hashes then using --require-hashes without any hashes results in:

ERROR: Hashes are required in --require-hashes mode, but they are missing from some requirements. Here is a list of those requirements along with the hashes their downloaded archives actually had. Add lines like these to your requirements files to prevent tampering. (If you did not enable --require-hashes manually, note that it turns on automatically when any package has a hash.)
    tox==3.0 --hash=sha256:9ee7de958a43806402a38c0d2aa07fa8553f4d2c20a15b140e9f771c2afeade0

I think the latter is much more helpful, and clearly some work went into it via the MissingHashes class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although perhaps this is a better place to implement MissingHashes logic?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried replacing MissingHashes with a raise in get_best. It works, except that it makes the error about missing hashes take precedence over the error about not having versions pinned. Currently not having versions pinned in hash mode takes precedence. I am not sure whether this order of precedence is intentional.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it’s a good idea to use MissingHashes here (but I don’t think you can just move it here?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer to not filter, with a comment about how if hashes is None, an error will be raised in unpack_url which explains what hash to pin to. If desired, in a future PR I could refactor away MissingHashes and raise the same error in get_best instead (which is what I meant by "perhaps this is a better place to implement MissingHashes logic"). But I'd rather not mix refactoring with feature work, and it would require some care to get the order of precedence for error messages right (there's a lot of error handling logic right now in RequirementPreparer.prepare_linked_requirement which would need to get moved around).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a plan how the refactoring would be done? If so, maybe it would be better to refactor first, and implement the feature on top of that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that such a refactoring has a downside the current implementation of MissingHashes will provide the correct hash to the user even if there is no remote hash, whereas if I were to move the raise into get_best it would happen before the candidate is downloaded and so I could only provide the hash to the user if it was provided by the index. So I think it is not worth doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright then, let’s work on what we have and improve (if possible) in the future.

)
return is_match

candidates = [c for c in candidates if test_against_hashes(c)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This combined with the list() call above unnecessarily loops through the candidate list twice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of filtering, I want to only test the best value (in the common case where the best value's hash is correct). This would be both more performant and avoid spammy warnings.

@alexbecker
Copy link
Author

I updated the PR to:

  • Look through the candidates in the reverse order of preference, and only warn when a candidate that would otherwise be preferred is skipped due to missing hashes.
  • Get rid of the local test_againt_hashes function, which I don't think adds anything now that I don't need it for a list comprehension.
  • Comment on why I don't filter out all candidates if we are in hash-checking mode but no hashes were provided.

@cjerdonek cjerdonek added the type: enhancement Improvements to functionality label May 17, 2019
@cjerdonek
Copy link
Member

Some high-level comments after looking at this quickly:

  1. It doesn't seem like a warning should be logged every time there is a hash mismatch -- only if the file isn't being used that would otherwise be preferred if hashes weren't being checked.
  2. I think the hash-checking logic would be better located in CandidateEvaluator rather than FoundCandidates. (FoundCandidates should be viewed more as a container return value rather than implementing actual logic.)
  3. The logic regarding prereleases also needs to be thought through, to make sure no changes are needed there.
  4. It seems like some unnecessary code-branching / duplication is getting added in the PR (e.g. CandidateEvaluator.get_best_candidates() getting introduced alongside CandidateEvaluator.get_best_candidate(), and if-branching inside FoundCandidates.get_best()), which leads to more cases that need to be tested. It would be better if the same code path is used, when possible.

@alexbecker
Copy link
Author

@cjerdonek Thanks for looking! Point-by-point responses:

  1. Since this PR traverses the candidates in reverse order of preference, it only emits a warning when a more preferred candidate is skipped due to hash mismatches.
  2. I can move the check in to get_best_candidate, eliminating the need for a new get_best_candidates function. The only downside is that's another level I have to pass hashes through.
  3. I don't see how pre-releases are any different here, but I'm not very familiar with pip's special-case handling of pre-releases. Can you expand on your concerns?
  4. I think moving the check into CandidateEvaluator mostly fixes this concern. I introduced the duplication so that we could continue to use max (which is O(n)) instead of sorted (which is O(n log n)) to find the best candidate when not in hash-checking mode. But the performance impact should be so minimal that it's not worth having two code paths.

@uranusjr
Copy link
Member

uranusjr commented May 20, 2019

Regarding the complexity of sorted and max, the practical difference is small enough to ignore IMO. But I also think it’s better to put the logic inside CandidateEvaluator instead, and this would avoid sorted altogether. One way to do that would be to add hashes=None to get_best_candidate(), and filter the list before passing it to max.

@cjerdonek
Copy link
Member

  1. Since this PR traverses the candidates in reverse order of preference, it only emits a warning when a more preferred candidate is skipped due to hash mismatches.

A couple things related to this:

First, I'm not sure this should necessarily be a warning, but perhaps just an info() message. Also, I think it might be better for such a message to occur only once (for the first / most preferred candidate). After that, new information isn't really being added. So subsequent ones can probably occur as debug messages.

Second, I would also take a look at @dstufft's original comment here to make sure you're following what he suggested:

I think it's a good idea to do the solution described here, to limit our installation to only things we have hashes for. Ideally I think we'd print a warning of some kind if we're not installing the preferred file due to it not having a hash listed though.

He only mentioned a warning if the hash isn't listed -- not if it's different. (If it's not listed, I guess the issue is skipping over a possible correct match?)

  1. I don't see how pre-releases are any different here, but I'm not very familiar with pip's special-case handling of pre-releases. Can you expand on your concerns?

It wasn't so much a concern as a suggestion for you to look at that code, since you're working on a PR. For example, the hash-checking docs here say, "Requirements that take the form of project names (rather than URLs or local filesystem paths) must be pinned to a specific version using ==." So I'm wondering if it's possible for someone to allow pre-releases if checking hashes or if that combination should be disallowed (or maybe it's already disallowed somewhere). It looks like the filter() methods in packaging/specifiers.py might have logic to yield pre-releases if no version matches. I'm also wondering if hash-checking means that the set of version strings from installation candidates will always have length one, or if that should be enforced (or if it's already enforced somewhere).

Re: sorting, my suggestion was more about keeping the logic together in CandidateEvaluator. If there is an early return / fast path that uses max, that could certainly be okay. We just have to see what the code looks like first.

@alexbecker
Copy link
Author

@uranusjr thanks, I'll start moving the code.

@cjerdonek responses inline:

First, I'm not sure this should necessarily be a warning, but perhaps just an info() message.

I think it should be a warning if it's something we want the user to act on, and an info message if not. So to me, the question boils down to: if a more preferred distribution is published after you pin hashes, should you switch to it? I think the answer is yes, because even with this fix you can't assume that every tool you use will skip the more preferred release (e.g. older versions of pip, or tools that vendor pip code).

Also, I think it might be better for such a message to occur only once (for the first / most preferred candidate). After that, new information isn't really being added. So subsequent ones can probably occur as debug messages.

I could do that, but I'm not sure it's worth the (admittedly small) additional complexity. This warning is generally triggered when a platform-specific wheel gets published for an old release, and I haven't run across a case where more than one such wheel would be preferred over all existing distributions.

Actually, it might be that the most likely cause of multiple skipping multiple candidates is a MitM or compromise of pypi.org, in which case a warning seems desirable.

He only mentioned a warning if the hash isn't listed -- not if it's different.

I don't think there's any way to distinguish between a hash not being listed and being different, since pinned hashes do not specify what distribution they are for. Note that this PR changes the error a user will observe if pypi.org is MitM-ed/compromised, from a "Hash mismatch" error to a series of warnings about skipping distributions due to hash mismatch, followed by a "No matching candidates" error. (If just pythonhosted.org is MitM-ed/compromised, users will still see a "Hash mismatch" error.)

(If it's not listed, I guess the issue is skipping over a possible correct match?)

Are you asking what the undesirable behavior in #5874 is? (It's that, if the hash is not listed, pip will download the distribution anyway and then raise a "Hash mismatch" error.)

So I'm wondering if it's possible for someone to allow pre-releases if checking hashes or if that combination should be disallowed (or maybe it's already disallowed somewhere).

I believe it's possible but might require an additional flag. At any rate it isn't affected by this PR.

I'm also wondering if hash-checking means that the set of version strings from installation candidates will always have length one, or if that should be enforced (or if it's already enforced somewhere).

It will. This is enforced in operations/prepare.py, L298-316.

@uranusjr
Copy link
Member

uranusjr commented May 21, 2019

I'm also wondering if hash-checking means that the set of version strings from installation candidates will always have length one, or if that should be enforced (or if it's already enforced somewhere).

It will. This is enforced in operations/prepare.py, L298-316.

Requirements in hash-checking mode must be pinned with == (or ===), othereise you get this:

ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
These do not:
...

It is still possible to have matches of length zero. But that would just cause an error in a later stage.

@cjerdonek
Copy link
Member

(If it's not listed, I guess the issue is skipping over a possible correct match?)

Are you asking what the undesirable behavior in #5874 is? (It's that, if the hash is not listed, pip will download the distribution anyway and then raise a "Hash mismatch" error.)

To clarify, I meant if the hash isn't listed in the URL. In your PR, you return those as matches immediately, without examining later candidates:

for c in candidates:
    link = c.location
    if not link.hash:
        # We can only filter candidates with hashes in their URLs.
        return c
    ...

Wouldn't you want to treat this as a non-match (at least initially) and continue looking so you have the chance to find a matching hash that might appear later?

@alexbecker
Copy link
Author

@cjerdonek Ah, so you are suggesting that, if some but not all of the candidates have hashes in their URLs, we should prefer candidates with hashes in their URLs that match the pinned hashes?

This is possible, but I think it's undesirable for a few reasons:

  1. Candidates without hashes in the URL are still probably matches for our pinned URLs, the index is just not advertising that fact.
  2. Thanks to 1, there could be cases where pip was installing packages in hash-checking mode just fine before, but then after this change it starts installing a less-preferred distribution (presumably an sdist), which may fail.
  3. I believe it's very rare for some but not all candidates to have hashes in their URLs.

@cjerdonek
Copy link
Member

Ah, so you are suggesting that, if some but not all of the candidates have hashes in their URLs, we should prefer candidates with hashes in their URLs that match the pinned hashes?

Yeah, I was suggesting choosing the first candidate that has a matching hash in the URL (and going back to the first if none match). That's how I interpreted the original proposal. This is also how I interpreted the suggestion I quoted above (i.e. if the first file doesn't have a hash in the URL):

Ideally I think we'd print a warning of some kind if we're not installing the preferred file due to it not having a hash listed though.

Otherwise, if a matching hash occurs later in the list, but you choose the first that doesn't have a hash in the URL, what happens if the latter winds up not matching -- would it fail?

@alexbecker
Copy link
Author

Otherwise, if a matching hash occurs later in the list, but you choose the first that doesn't have a hash in the URL, what happens if the latter winds up not matching -- would it fail?

It would fail (specifically it would raise a HashMismatch error). This is the same as the current behavior. I agree that in this case choosing a different candidate with a matching hash is a better experience, but for the 3 reasons I outlined above I don't think the trade-off is worth it (I really want to avoid breaking anyone's builds with this PR that are currently working).

@cjerdonek
Copy link
Member

It would fail (specifically it would raise a HashMismatch error). This is the same as the current behavior.

Right, and this is also the thing that #5874 wants to fix! The logic seems straightforward to me. If you're worried about making an error, I and I'm sure others would be happy to look it over for correctness.

@alexbecker
Copy link
Author

@cjerdonek I think you misunderstand my concern. Currently, if someone is using hash-checking mode and not getting errors, this PR will never change what distributions they install. I.e. this PR will never break a build that isn't already broken. However, if I prefer candidates with hashes over those without, someone who is using hash-checking mode with no errors currently may start installing a different distribution then they were previously, which could potentially fail to install.

It may be worth doing what you're suggesting, but it's a much more radical change than what I've implemented, and I think it requires a broader discussion before implementing.

@uranusjr
Copy link
Member

I have a feeling you two are talking past each other 🤔 What Chris meant was specifically to the for loop + early return:

for c in candidates:
    link = c.location
    if not link.hash:
        # We can only filter candidates with hashes in their URLs.
        return c
    ...

This does not change the (current) behaviour of candidate selection, but it changes how that is done. I believe what Chris was implying to is that self._evaluator.get_best_candidate() should always be used to select the candidate to return. Instead of implementing a new code path to do the selecting, the if hashes: block should instead filter the candidates iterable (emitting warning messages when needed), and pass the filtered result to the evaluator for the final selection.

@alexbecker
Copy link
Author

alexbecker commented May 25, 2019

@uranusjr I've moved the logic into get_best_candidate and removed get_best_candidates. I don't think this is quite in line with your most recent comment. The problem with filtering the candidates iterable before passing to the evaluator is that (at least to my understanding) we only want to warn when a preferred candidate is skipped, not when any candidate is skipped. So the check against hashes needs to happen either during or post evaluation.

@cjerdonek I think I've been misunderstanding the change you want, partly because I was trying to understand them relative to the changes I had already planned but not committed. Can you suggest the change you're thinking of against this updated code?

@cjerdonek
Copy link
Member

cjerdonek commented May 25, 2019

I have a feeling you two are talking past each other

@uranusjr I'm pretty sure @alexbecker and I are discussing a difference in how we think things should behave rather than just an implementation detail.

@alexbecker I think you understood what we were discussing correctly, but @uranusjr's message might have caused you to second guess your interpretation. Regarding your next-to-last message, here's a question to consider about the two scenarios, for comparison:

In your hypothetical, with my approach of choosing a later file whose hash matches exactly over a first file with no hash listed, if that choice were to fail for some reason, what would be the corrective action the user should take?

Similarly, with your approach of choosing a first file with no hash listed over a later file whose hash matches exactly, if that choice were to fail for some reason, what would be the corrective action the user should take?

I'm happy to ask / get confirmation on the other thread about how things should behave.

@alexbecker
Copy link
Author

@cjerdonek That's a good question. I'll go through what I would do in each case (ignoring questions about auditing packages in response to hash changes), which is the closest I can get to the correct course of action:

In your approach, if a less-preferred file is chosen because it has a provided hash that matches exactly, but that install fails, the user would see:

  1. A warning from the hash-filtering code that a preferred distribution had been skipped due to missing hashes.
  2. A bunch of gcc/other tool errors relating to the distribution failing to build.

In this case, the user should pip download the failed package directly, run pip hash on the downloaded file, then replace the hash in their requirements.txt file (or wherever else they invoke pip from). However, I do not think most users would know to do this, since I think 1) most Python users don't really understand the difference between different kinds of distributions, and 2) there would be potentially pages of errors from the installation that seem more directly relevant.

In my approach, if the file with no hashes provided in the URL does not actually match the hashes provided to pip, the user would see an error like:

THESE PACKAGES DO NOT MATCH THE HASHES
    tracefront==0.1:
        Expected sha256 123451234512345123451234512345123451234512345
        Got sha256 bcdefbcdefbcdefbcdefbcdefbcdefbcdefbcdefbcdef

In this case the user should replace the hash in requirements.txt with bcdefbcdefbcdefbcdefbcdefbcdefbcdefbcdefbcdef.

@cjerdonek
Copy link
Member

In this case, the user should pip download the failed package directly, run pip hash on the downloaded file, then replace the hash in their requirements.txt file (or wherever else they invoke pip from).

Couldn't the user also just delete the hash that resulted in the failure? In the hypothetical scenario you described, you were worried about a situation where things were working for a user when it selected a first file with no hash listed, but then things not working if pip were to change to choosing a later file that does list a hash and that matches.

For the second case I described, I was asking about a situation where the user doesn't want the first file but wants the later one they have the matching hash for. With the PR that you're proposing, what corrective action could the user take to get pip to select the later file that matches their hash exactly?

@alexbecker
Copy link
Author

I guess they could delete the hash of the distribution they're having trouble installing from the list of hashes for that package in their requirements.txt (which is how I understand your suggestion). This seems more brittle to me than adding the correct hash, so I wouldn't recommend it, and I also don't think it would be obvious to a user encountering this problem.

In the second case, if the user doesn't want the file that pip prefers, the course of action is the same as any other time you want a non-preferred file. Most often this means you want an sdist, so you should supply the --no-binary <package_name> flag. If you want a binary distribution that is not the most preferred distribution... pip does not have a good story for that AFAIK (see e.g. #6523), but I don't think hash-checking mode should be how you define which binary distribution you want.

@cjerdonek
Copy link
Member

This seems more brittle to me than adding the correct hash, so I wouldn't recommend it,

Wouldn't they already have to have it listed though? In the hypothetical scenario you described, you were worried about a case where things were originally working for a user in hash-checking mode when it selected a first file with no hash listed, but then not working if choosing a later file whose hash does match one of the hashes provided by the user. I'm just trying to understand better the case that you were worried about (which is the reason for my questions).

@alexbecker
Copy link
Author

I'm worried about a case where pip selects a file with no hash in the URL on PyPI (i.e. link.hash is None), but where the actual hash of the file (computed after download) matches what they have provided in requirements.txt.

@cjerdonek
Copy link
Member

I'm worried about a case where pip selects a file with no hash in the URL on PyPI (i.e. link.hash is None), but where the actual hash of the file (computed after download) matches what they have provided in requirements.txt.

Yes, I understand that. So won't that continue to work with the approach I described as long as the user doesn't have additional hashes in their requirements.txt that match less-preferred files whose hashes are listed in the URL's on PyPI?

@alexbecker
Copy link
Author

Ah, you are right that they would have to have the correct hash (by which I mean the hash for the more preferred file, which up until this change they have been installing without issue) in their requirements.txt. And my suggestion of adding that hash makes no sense because that doesn't change the filtering behavior. The only way to change the filtering behavior would be to remove the hashes of every distribution that does have a hash in the URL on the index server. I guess I would recommend they pip download and pip hash the file they actually want, and remove all other hashes. If anything, I think this is a more confusing process for the user to go through IMO.

@BrownTruck
Copy link
Contributor

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

@cjerdonek
Copy link
Member

Closing as this was instead implemented by PR #6699. Thanks for your initial work on this, though, @alexbecker!

@cjerdonek cjerdonek closed this Jul 24, 2019
@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Aug 23, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Aug 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation needs rebase or merge PR has conflicts with current master type: enhancement Improvements to functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants