-
Notifications
You must be signed in to change notification settings - Fork 143
pip package hash support #222
Comments
I'm not familiar with poetry. If you can track down what the missing metadata or API is, I can get an estimate for how difficult it would be to add it |
I'll try to track this down |
|
@stevearc Calculating the hash's would require reading the whole package into memory to compute the hash, so that the hashes could be added to the package metadata. Once the hash is in the metadata, providing it to the user is pretty easy. Assuming you are ok with that, I can PR adding it to the base cache upload function, and probably add in some opportunistic hash generation of old packages on cache rebuild (though that would require fetching and putting of packages again if hash is missing) |
Calculating the hash upon upload and storing it in the metadata would be great! You could also do it when we fetch the packages from the upstream repo. I'm a bit more wary of doing it during cache rebuild. Some people using pypicloud have large numbers of packages, and that could cause some significant problems during startup. If you want to do something with existing packages, I think the best way to go about it would be to have a script perform the migration. |
Ah true, I suppose considering the hashing the file is standard, I could whip up a simple migration script for s3/azure that does get-hash-put. |
how about having a background "groomer" that would only launch if there are missing hashes? it could disable itself once all packages have hashes since the uploads would then set the hashes. Also I don't think you need to load the whole file into memory, you can read the files in chunks |
@thehesiod I could read the file in chunks but i'd need the hashes up front to add into the files metadata. My thinking is that if the hashes are part of the metadata, then if a db is lost the hashes dont need to be recalculated. |
so a simplification could be to just to use the S3 ETag, the ETag is the MD5 of the file. For multipart it's a a little more complicated. Here's my notes about ETags:
|
That might work for S3, but doesn't provide a sha256 hash (pypi returns both md5 and sha256) and azure blob storage's ETag is completely useless as its a random int. I think calculating it upfront is the simplest solution for now |
that spec was saying it doesn't need to be both right? It said above to chose one |
PyPI reports both md5 and sha256 in the json api so I thought it best to match the real PyPI. Only adds sha256 to the fragment |
we're hitting this as well--our tool is hitting |
when using tools like poetry they generate lock files with the hashes of the modules. I've noticed that when it hits our pypicloud server it doesn't record any hashes (python-poetry/poetry#1553), so presumably pypicloud is missing some feature. I haven't had a chance to look into why this is yet. Presumably some sort of metadata API/part of response is missing
The text was updated successfully, but these errors were encountered: