pip package hash support #222

thehesiod · 2019-11-12T21:13:48Z

when using tools like poetry they generate lock files with the hashes of the modules. I've noticed that when it hits our pypicloud server it doesn't record any hashes (python-poetry/poetry#1553), so presumably pypicloud is missing some feature. I haven't had a chance to look into why this is yet. Presumably some sort of metadata API/part of response is missing

stevearc · 2019-11-21T04:16:53Z

I'm not familiar with poetry. If you can track down what the missing metadata or API is, I can get an estimate for how difficult it would be to add it

thehesiod · 2019-11-25T06:03:52Z

I'll try to track this down

Andor · 2020-03-25T09:59:03Z

@stevearc Looks like here are some API example, but i can't find the exact spec.

Andor · 2020-03-25T10:07:12Z

here is it:

The URL SHOULD include a hash in the form of a URL fragment with the following syntax: #<hashname>=<hashvalue>, where <hashname> is the lowercase name of the hash function (such as sha256) and <hashvalue> is the hex encoded digest.

Repositories SHOULD choose a hash function from one of the ones guaranteed to be available via the hashlib module in the Python standard library (currently md5, sha1, sha224, sha256, sha384, sha512). The current recommendation is to use sha256.

terricain · 2020-05-18T10:42:31Z

@stevearc Calculating the hash's would require reading the whole package into memory to compute the hash, so that the hashes could be added to the package metadata. Once the hash is in the metadata, providing it to the user is pretty easy. Assuming you are ok with that, I can PR adding it to the base cache upload function, and probably add in some opportunistic hash generation of old packages on cache rebuild (though that would require fetching and putting of packages again if hash is missing)

stevearc · 2020-05-19T15:34:28Z

Calculating the hash upon upload and storing it in the metadata would be great! You could also do it when we fetch the packages from the upstream repo. I'm a bit more wary of doing it during cache rebuild. Some people using pypicloud have large numbers of packages, and that could cause some significant problems during startup. If you want to do something with existing packages, I think the best way to go about it would be to have a script perform the migration.

terricain · 2020-05-19T19:19:55Z

Ah true, I suppose considering the hashing the file is standard, I could whip up a simple migration script for s3/azure that does get-hash-put.

thehesiod · 2020-05-19T22:45:57Z

how about having a background "groomer" that would only launch if there are missing hashes? it could disable itself once all packages have hashes since the uploads would then set the hashes. Also I don't think you need to load the whole file into memory, you can read the files in chunks

terricain · 2020-05-19T22:51:07Z

@thehesiod I could read the file in chunks but i'd need the hashes up front to add into the files metadata. My thinking is that if the hashes are part of the metadata, then if a db is lost the hashes dont need to be recalculated.

thehesiod · 2020-05-19T22:58:07Z

so a simplification could be to just to use the S3 ETag, the ETag is the MD5 of the file. For multipart it's a a little more complicated. Here's my notes about ETags:

# ETags
# ------------------------------
# There are two types of ETags (MD5 checksums): regular + multi-part
#
#   Regular
#   -------
#     I've verified that ETags for default (non-multipart) uploads are the same under the following scenarios:
#      - across different regions (Oregon + Ireland)
#      - same file uploaded with different names and different folders have the same ETag
#      - locally generated MD5 matches AWS ETag
#   Multipart
#   ---------
#   Multipart-uploads have ETags that contain a "-" and we need to support them based on existing data
#   You can calculate the multipart E-Tag locally as follows:
#      http://stackoverflow.com/questions/12186993/what-is-the-algorithm-to-compute-the-amazon-s3-etag-for-a-file-larger-than-5gb/19896823#19896823

terricain · 2020-05-19T23:01:49Z

That might work for S3, but doesn't provide a sha256 hash (pypi returns both md5 and sha256) and azure blob storage's ETag is completely useless as its a random int. I think calculating it upfront is the simplest solution for now

thehesiod · 2020-05-19T23:08:33Z

that spec was saying it doesn't need to be both right? It said above to chose one

terricain · 2020-05-19T23:09:36Z

PyPI reports both md5 and sha256 in the json api so I thought it best to match the real PyPI. Only adds sha256 to the fragment

beaugunderson · 2021-02-11T00:33:27Z

we're hitting this as well--our tool is hitting /simple/pypi/funcy/json which gets redirected to pypi.org/simple/funcy, which is not the JSON endpoint it expects (it should be directed to pypi.org/pypi/funcy).

terricain mentioned this issue May 19, 2020

Added package hashing logic #244

Merged

7 tasks

stevearc closed this as completed in b81d3fd Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pip package hash support #222

pip package hash support #222

thehesiod commented Nov 12, 2019

stevearc commented Nov 21, 2019

thehesiod commented Nov 25, 2019

Andor commented Mar 25, 2020

Andor commented Mar 25, 2020 •

edited

Loading

terricain commented May 18, 2020

stevearc commented May 19, 2020

terricain commented May 19, 2020

thehesiod commented May 19, 2020

terricain commented May 19, 2020

thehesiod commented May 19, 2020

terricain commented May 19, 2020

thehesiod commented May 19, 2020

terricain commented May 19, 2020 •

edited

Loading

beaugunderson commented Feb 11, 2021 •

edited

Loading

pip package hash support #222

pip package hash support #222

Comments

thehesiod commented Nov 12, 2019

stevearc commented Nov 21, 2019

thehesiod commented Nov 25, 2019

Andor commented Mar 25, 2020

Andor commented Mar 25, 2020 • edited Loading

terricain commented May 18, 2020

stevearc commented May 19, 2020

terricain commented May 19, 2020

thehesiod commented May 19, 2020

terricain commented May 19, 2020

thehesiod commented May 19, 2020

terricain commented May 19, 2020

thehesiod commented May 19, 2020

terricain commented May 19, 2020 • edited Loading

beaugunderson commented Feb 11, 2021 • edited Loading

Andor commented Mar 25, 2020 •

edited

Loading

terricain commented May 19, 2020 •

edited

Loading

beaugunderson commented Feb 11, 2021 •

edited

Loading