-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WeightsEnum: use checksums #7210
Comments
Thanks for the proposal @adamjstewart , the request definitely seems reasonable. Changes to tochhub can be in scope too. I won't be able to get to this in the next 2-3 weeks because we'll be very busy with the release. If I don't come back to this in ~1 month, please feel free to ping me again! |
Are the SHA256 hash prefixes insufficient/unsuitable? |
That works if the filename follows the very specific This brings up the question of whether or not to retire MD5 and use SHA256 for all datasets. MD5 isn't exactly secure, but it depends on how high priority security is. But that's probably a conversation for a different PR. |
I see. I think it'd be nice if On a side note, I'm actually quite surprised that
I'm not knowledgeable about whether the datasets make use of the |
Datasets are also security risks because tarfile has known vulnerabilities. See #7039 for a discussion on this. It was noted there that MD5 isn't perfect but it's probably good enough. |
I wouldn't rely on MD5 for any new functionality. In fact, we used SHA256 for the datasets v2:
The only thing stopping us from backporting that is that our v2 datasets don't cover everything in v1 yet and thus we don't have the SHA256 checksums for everything. And I'm not keen on downloading a gazillion GB just for computing a checksum. |
🚀 The feature
All weight enums should store and use the MD5 checksum in order to verify the integrity of the download. We already do the same thing for datasets, and weights should be no different.
Motivation, pitch
When weight enums are passed to a model, torchvision calls
WeightsEnum.get_state_dict
, which callstorch.hub.load_state_dict_from_url
, which downloads the weights and callstorch.load
, which uses pickle to load the data. However, pickle's documentation notes that the pickle format is not secure and the pickle format can allow for arbitrary code execution. For security reasons, it's recommended to first verify the checksum of any pickled files before unpickling.Alternatives
No response
Additional context
It's unclear exactly where the checksum should be used. Honestly, we should really add a MD5 check directly in
torch.hub.load_state_dict_from_url
. Otherwise, we'll have to split the download and the load into two separate steps in torchvision so that we can verify the checksum in the middle.@NicolasHug
The text was updated successfully, but these errors were encountered: