-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #5874: Filter out candidates with hash mismatches. #6464
Conversation
Changes the behavior of `pip install` in hash-checking mode to filter out any candidates whose hashes (obtained via URL) do not match the hashes provided. This prevents a HashMismatch error when a more preferred binary distribution is upload for a release after a user pins the hashes for that release. Note that a second hash comparison is performed when the candidate is downloaded. This is important because the first check is not secure: it trusts that the hash in the URL is the same as the hash in the content, and it also does not error when the user is in hash-checking mode but has not provided a hash.
@@ -764,7 +787,9 @@ def find_requirement(self, req, upgrade): | |||
Raises DistributionNotFound or BestVersionAlreadyInstalled otherwise | |||
""" | |||
candidates = self.find_candidates(req.name, req.specifier) | |||
best_candidate = candidates.get_best() | |||
# Get any hashes supplied by the user to filter candidates. | |||
hashes = req.hashes(trust_internet=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, there should be a difference between supplying --require-hashes
(i.e. hash mode with an empty hash list) and not (not hash mode). candidates.get_best()
should only receieve the hash comparer in hash mode, and None
otherwise (and get_best
probably needs to check for None
instead of a falsy hash comparer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think get_best
should not filter on falsy hashes
. If I supply --require-hashes
without any hashes and filter on hashes
, this filters out all candidates and results in the following error:
ERROR: Could not find a version that satisfies the requirement <package_name> (from versions: ...)
If I don't filter on falsy hashes
then using --require-hashes
without any hashes results in:
ERROR: Hashes are required in --require-hashes mode, but they are missing from some requirements. Here is a list of those requirements along with the hashes their downloaded archives actually had. Add lines like these to your requirements files to prevent tampering. (If you did not enable --require-hashes manually, note that it turns on automatically when any package has a hash.)
tox==3.0 --hash=sha256:9ee7de958a43806402a38c0d2aa07fa8553f4d2c20a15b140e9f771c2afeade0
I think the latter is much more helpful, and clearly some work went into it via the MissingHashes
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although perhaps this is a better place to implement MissingHashes
logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried replacing MissingHashes
with a raise
in get_best
. It works, except that it makes the error about missing hashes take precedence over the error about not having versions pinned. Currently not having versions pinned in hash mode takes precedence. I am not sure whether this order of precedence is intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it’s a good idea to use MissingHashes
here (but I don’t think you can just move it here?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer to not filter, with a comment about how if hashes
is None
, an error will be raised in unpack_url
which explains what hash to pin to. If desired, in a future PR I could refactor away MissingHashes
and raise the same error in get_best
instead (which is what I meant by "perhaps this is a better place to implement MissingHashes
logic"). But I'd rather not mix refactoring with feature work, and it would require some care to get the order of precedence for error messages right (there's a lot of error handling logic right now in RequirementPreparer.prepare_linked_requirement
which would need to get moved around).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a plan how the refactoring would be done? If so, maybe it would be better to refactor first, and implement the feature on top of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that such a refactoring has a downside the current implementation of MissingHashes
will provide the correct hash to the user even if there is no remote hash, whereas if I were to move the raise
into get_best
it would happen before the candidate is downloaded and so I could only provide the hash to the user if it was provided by the index. So I think it is not worth doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright then, let’s work on what we have and improve (if possible) in the future.
src/pip/_internal/index.py
Outdated
) | ||
return is_match | ||
|
||
candidates = [c for c in candidates if test_against_hashes(c)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This combined with the list()
call above unnecessarily loops through the candidate list twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of filtering, I want to only test the best value (in the common case where the best value's hash is correct). This would be both more performant and avoid spammy warnings.
I updated the PR to:
|
Some high-level comments after looking at this quickly:
|
@cjerdonek Thanks for looking! Point-by-point responses:
|
Regarding the complexity of |
A couple things related to this: First, I'm not sure this should necessarily be a warning, but perhaps just an Second, I would also take a look at @dstufft's original comment here to make sure you're following what he suggested:
He only mentioned a warning if the hash isn't listed -- not if it's different. (If it's not listed, I guess the issue is skipping over a possible correct match?)
It wasn't so much a concern as a suggestion for you to look at that code, since you're working on a PR. For example, the hash-checking docs here say, "Requirements that take the form of project names (rather than URLs or local filesystem paths) must be pinned to a specific version using ==." So I'm wondering if it's possible for someone to allow pre-releases if checking hashes or if that combination should be disallowed (or maybe it's already disallowed somewhere). It looks like the Re: sorting, my suggestion was more about keeping the logic together in |
@uranusjr thanks, I'll start moving the code. @cjerdonek responses inline:
I think it should be a warning if it's something we want the user to act on, and an info message if not. So to me, the question boils down to: if a more preferred distribution is published after you pin hashes, should you switch to it? I think the answer is yes, because even with this fix you can't assume that every tool you use will skip the more preferred release (e.g. older versions of pip, or tools that vendor pip code).
I could do that, but I'm not sure it's worth the (admittedly small) additional complexity. This warning is generally triggered when a platform-specific wheel gets published for an old release, and I haven't run across a case where more than one such wheel would be preferred over all existing distributions. Actually, it might be that the most likely cause of multiple skipping multiple candidates is a MitM or compromise of pypi.org, in which case a warning seems desirable.
I don't think there's any way to distinguish between a hash not being listed and being different, since pinned hashes do not specify what distribution they are for. Note that this PR changes the error a user will observe if pypi.org is MitM-ed/compromised, from a "Hash mismatch" error to a series of warnings about skipping distributions due to hash mismatch, followed by a "No matching candidates" error. (If just pythonhosted.org is MitM-ed/compromised, users will still see a "Hash mismatch" error.)
Are you asking what the undesirable behavior in #5874 is? (It's that, if the hash is not listed, pip will download the distribution anyway and then raise a "Hash mismatch" error.)
I believe it's possible but might require an additional flag. At any rate it isn't affected by this PR.
It will. This is enforced in |
Requirements in hash-checking mode must be pinned with
It is still possible to have matches of length zero. But that would just cause an error in a later stage. |
To clarify, I meant if the hash isn't listed in the URL. In your PR, you return those as matches immediately, without examining later candidates: for c in candidates:
link = c.location
if not link.hash:
# We can only filter candidates with hashes in their URLs.
return c
... Wouldn't you want to treat this as a non-match (at least initially) and continue looking so you have the chance to find a matching hash that might appear later? |
@cjerdonek Ah, so you are suggesting that, if some but not all of the candidates have hashes in their URLs, we should prefer candidates with hashes in their URLs that match the pinned hashes? This is possible, but I think it's undesirable for a few reasons:
|
Yeah, I was suggesting choosing the first candidate that has a matching hash in the URL (and going back to the first if none match). That's how I interpreted the original proposal. This is also how I interpreted the suggestion I quoted above (i.e. if the first file doesn't have a hash in the URL):
Otherwise, if a matching hash occurs later in the list, but you choose the first that doesn't have a hash in the URL, what happens if the latter winds up not matching -- would it fail? |
It would fail (specifically it would raise a |
Right, and this is also the thing that #5874 wants to fix! The logic seems straightforward to me. If you're worried about making an error, I and I'm sure others would be happy to look it over for correctness. |
@cjerdonek I think you misunderstand my concern. Currently, if someone is using hash-checking mode and not getting errors, this PR will never change what distributions they install. I.e. this PR will never break a build that isn't already broken. However, if I prefer candidates with hashes over those without, someone who is using hash-checking mode with no errors currently may start installing a different distribution then they were previously, which could potentially fail to install. It may be worth doing what you're suggesting, but it's a much more radical change than what I've implemented, and I think it requires a broader discussion before implementing. |
I have a feeling you two are talking past each other 🤔 What Chris meant was specifically to the for loop + early return: for c in candidates:
link = c.location
if not link.hash:
# We can only filter candidates with hashes in their URLs.
return c
... This does not change the (current) behaviour of candidate selection, but it changes how that is done. I believe what Chris was implying to is that |
@uranusjr I've moved the logic into @cjerdonek I think I've been misunderstanding the change you want, partly because I was trying to understand them relative to the changes I had already planned but not committed. Can you suggest the change you're thinking of against this updated code? |
@uranusjr I'm pretty sure @alexbecker and I are discussing a difference in how we think things should behave rather than just an implementation detail. @alexbecker I think you understood what we were discussing correctly, but @uranusjr's message might have caused you to second guess your interpretation. Regarding your next-to-last message, here's a question to consider about the two scenarios, for comparison: In your hypothetical, with my approach of choosing a later file whose hash matches exactly over a first file with no hash listed, if that choice were to fail for some reason, what would be the corrective action the user should take? Similarly, with your approach of choosing a first file with no hash listed over a later file whose hash matches exactly, if that choice were to fail for some reason, what would be the corrective action the user should take? I'm happy to ask / get confirmation on the other thread about how things should behave. |
@cjerdonek That's a good question. I'll go through what I would do in each case (ignoring questions about auditing packages in response to hash changes), which is the closest I can get to the correct course of action: In your approach, if a less-preferred file is chosen because it has a provided hash that matches exactly, but that install fails, the user would see:
In this case, the user should In my approach, if the file with no hashes provided in the URL does not actually match the hashes provided to
In this case the user should replace the hash in |
Couldn't the user also just delete the hash that resulted in the failure? In the hypothetical scenario you described, you were worried about a situation where things were working for a user when it selected a first file with no hash listed, but then things not working if pip were to change to choosing a later file that does list a hash and that matches. For the second case I described, I was asking about a situation where the user doesn't want the first file but wants the later one they have the matching hash for. With the PR that you're proposing, what corrective action could the user take to get pip to select the later file that matches their hash exactly? |
I guess they could delete the hash of the distribution they're having trouble installing from the list of hashes for that package in their In the second case, if the user doesn't want the file that |
Wouldn't they already have to have it listed though? In the hypothetical scenario you described, you were worried about a case where things were originally working for a user in hash-checking mode when it selected a first file with no hash listed, but then not working if choosing a later file whose hash does match one of the hashes provided by the user. I'm just trying to understand better the case that you were worried about (which is the reason for my questions). |
I'm worried about a case where |
Yes, I understand that. So won't that continue to work with the approach I described as long as the user doesn't have additional hashes in their requirements.txt that match less-preferred files whose hashes are listed in the URL's on PyPI? |
Ah, you are right that they would have to have the correct hash (by which I mean the hash for the more preferred file, which up until this change they have been installing without issue) in their |
Hello! I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the |
Closing as this was instead implemented by PR #6699. Thanks for your initial work on this, though, @alexbecker! |
Implements @di's proposed solution to #5874.
Changes the behavior of
pip install
in hash-checking mode to filter out any candidates whose hashes (obtained via URL) do not match the hashes provided. This prevents a HashMismatch error when a more preferred binary distribution is upload for a release after a user pins the hashes for that release. Instead a warning is logged when a distribution is skipped due to hash mismatches:Note that a second hash comparison is performed when the candidate is downloaded. This is important because the first check is not secure: it trusts that the hash in the URL is the same as the hash in the content, and it also does not error when the user is in hash-checking mode but has not provided a hash.
This still needs tests, but I would like to check that this is the right approach before adding them.