Skip to content
This repository has been archived by the owner on Aug 27, 2023. It is now read-only.

pip package hash support #222

Closed
thehesiod opened this issue Nov 12, 2019 · 14 comments
Closed

pip package hash support #222

thehesiod opened this issue Nov 12, 2019 · 14 comments

Comments

@thehesiod
Copy link
Contributor

when using tools like poetry they generate lock files with the hashes of the modules. I've noticed that when it hits our pypicloud server it doesn't record any hashes (python-poetry/poetry#1553), so presumably pypicloud is missing some feature. I haven't had a chance to look into why this is yet. Presumably some sort of metadata API/part of response is missing

@stevearc
Copy link
Owner

I'm not familiar with poetry. If you can track down what the missing metadata or API is, I can get an estimate for how difficult it would be to add it

@thehesiod
Copy link
Contributor Author

I'll try to track this down

@Andor
Copy link

Andor commented Mar 25, 2020

@stevearc Looks like here are some API example, but i can't find the exact spec.

@Andor
Copy link

Andor commented Mar 25, 2020

here is it:

The URL SHOULD include a hash in the form of a URL fragment with the following syntax: #<hashname>=<hashvalue>, where <hashname> is the lowercase name of the hash function (such as sha256) and <hashvalue> is the hex encoded digest.
Repositories SHOULD choose a hash function from one of the ones guaranteed to be available via the hashlib module in the Python standard library (currently md5, sha1, sha224, sha256, sha384, sha512). The current recommendation is to use sha256.

@terricain
Copy link
Contributor

@stevearc Calculating the hash's would require reading the whole package into memory to compute the hash, so that the hashes could be added to the package metadata. Once the hash is in the metadata, providing it to the user is pretty easy. Assuming you are ok with that, I can PR adding it to the base cache upload function, and probably add in some opportunistic hash generation of old packages on cache rebuild (though that would require fetching and putting of packages again if hash is missing)

@stevearc
Copy link
Owner

Calculating the hash upon upload and storing it in the metadata would be great! You could also do it when we fetch the packages from the upstream repo. I'm a bit more wary of doing it during cache rebuild. Some people using pypicloud have large numbers of packages, and that could cause some significant problems during startup. If you want to do something with existing packages, I think the best way to go about it would be to have a script perform the migration.

@terricain
Copy link
Contributor

Ah true, I suppose considering the hashing the file is standard, I could whip up a simple migration script for s3/azure that does get-hash-put.

@thehesiod
Copy link
Contributor Author

how about having a background "groomer" that would only launch if there are missing hashes? it could disable itself once all packages have hashes since the uploads would then set the hashes. Also I don't think you need to load the whole file into memory, you can read the files in chunks

@terricain
Copy link
Contributor

@thehesiod I could read the file in chunks but i'd need the hashes up front to add into the files metadata. My thinking is that if the hashes are part of the metadata, then if a db is lost the hashes dont need to be recalculated.

@thehesiod
Copy link
Contributor Author

so a simplification could be to just to use the S3 ETag, the ETag is the MD5 of the file. For multipart it's a a little more complicated. Here's my notes about ETags:

# ETags
# ------------------------------
# There are two types of ETags (MD5 checksums): regular + multi-part
#
#   Regular
#   -------
#     I've verified that ETags for default (non-multipart) uploads are the same under the following scenarios:
#      - across different regions (Oregon + Ireland)
#      - same file uploaded with different names and different folders have the same ETag
#      - locally generated MD5 matches AWS ETag
#   Multipart
#   ---------
#   Multipart-uploads have ETags that contain a "-" and we need to support them based on existing data
#   You can calculate the multipart E-Tag locally as follows:
#      http://stackoverflow.com/questions/12186993/what-is-the-algorithm-to-compute-the-amazon-s3-etag-for-a-file-larger-than-5gb/19896823#19896823

@terricain
Copy link
Contributor

That might work for S3, but doesn't provide a sha256 hash (pypi returns both md5 and sha256) and azure blob storage's ETag is completely useless as its a random int. I think calculating it upfront is the simplest solution for now

@thehesiod
Copy link
Contributor Author

that spec was saying it doesn't need to be both right? It said above to chose one

@terricain
Copy link
Contributor

terricain commented May 19, 2020

PyPI reports both md5 and sha256 in the json api so I thought it best to match the real PyPI. Only adds sha256 to the fragment

@beaugunderson
Copy link

beaugunderson commented Feb 11, 2021

we're hitting this as well--our tool is hitting /simple/pypi/funcy/json which gets redirected to pypi.org/simple/funcy, which is not the JSON endpoint it expects (it should be directed to pypi.org/pypi/funcy).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants