Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WeightsEnum: use checksums #7210

Closed
adamjstewart opened this issue Feb 9, 2023 · 6 comments · Fixed by #7219
Closed

WeightsEnum: use checksums #7210

adamjstewart opened this issue Feb 9, 2023 · 6 comments · Fixed by #7219

Comments

@adamjstewart
Copy link
Contributor

🚀 The feature

All weight enums should store and use the MD5 checksum in order to verify the integrity of the download. We already do the same thing for datasets, and weights should be no different.

Motivation, pitch

When weight enums are passed to a model, torchvision calls WeightsEnum.get_state_dict, which calls torch.hub.load_state_dict_from_url, which downloads the weights and calls torch.load, which uses pickle to load the data. However, pickle's documentation notes that the pickle format is not secure and the pickle format can allow for arbitrary code execution. For security reasons, it's recommended to first verify the checksum of any pickled files before unpickling.

Alternatives

No response

Additional context

It's unclear exactly where the checksum should be used. Honestly, we should really add a MD5 check directly in torch.hub.load_state_dict_from_url. Otherwise, we'll have to split the download and the load into two separate steps in torchvision so that we can verify the checksum in the middle.

@NicolasHug

@NicolasHug
Copy link
Member

Thanks for the proposal @adamjstewart , the request definitely seems reasonable. Changes to tochhub can be in scope too. I won't be able to get to this in the next 2-3 weeks because we'll be very busy with the release. If I don't come back to this in ~1 month, please feel free to ping me again!

@nps1ngh
Copy link
Contributor

nps1ngh commented Feb 9, 2023

Are the SHA256 hash prefixes insufficient/unsuitable? torch.hub.load_state_dict_from_url also seems to have a check_hash argument.
But maybe I'm overlooking something?

@adamjstewart
Copy link
Contributor Author

That works if the filename follows the very specific filename-<sha256>.<ext> format, but doesn't work in general. TorchGeo weights don't follow that convention, but if they need to in order to simplify the process we're more than happy to change that. I was curious what that hash was and it wasn't long enough to be a full sha256 so I just assumed it was some random git commit hash for uniqueness, not for security.

This brings up the question of whether or not to retire MD5 and use SHA256 for all datasets. MD5 isn't exactly secure, but it depends on how high priority security is. But that's probably a conversation for a different PR.

@nps1ngh
Copy link
Contributor

nps1ngh commented Feb 9, 2023

I see. I think it'd be nice if torch.hub.load_state_dict_from_url also supported checking a passed in hash like how torch.hub.download_url_to_file does.
This could also make it easier for you guys @adamjstewart. However, if you would need to calculate the hashes of the weights anyway (I don't see any MD5 ones 👀), then it might just be worth it to rename everything? I'm not sure tbh what would be easier.

On a side note, I'm actually quite surprised that WeightsEnum does not set check_hash to True (see here).


This brings up the question of whether or not to retire MD5 and use SHA256 for all datasets. MD5 isn't exactly secure, but it depends on how high priority security is. But that's probably a conversation for a different PR.

I'm not knowledgeable about whether the datasets make use of the pickle module or not, but since the weights do, then I'd certainly prefer to use SHA256 over MD5. Especially, since security is the main motivation to check the hashes and the logic to check the hashes already supports SHA256.

@adamjstewart
Copy link
Contributor Author

Datasets are also security risks because tarfile has known vulnerabilities. See #7039 for a discussion on this. It was noted there that MD5 isn't perfect but it's probably good enough.

@pmeier
Copy link
Collaborator

pmeier commented Feb 10, 2023

I wouldn't rely on MD5 for any new functionality. In fact, we used SHA256 for the datasets v2:

sha256="af6ece2f339791ca20f855943d8b55dd60892c0a25105fcd631ee3d6430f9926",

The only thing stopping us from backporting that is that our v2 datasets don't cover everything in v1 yet and thus we don't have the SHA256 checksums for everything. And I'm not keen on downloading a gazillion GB just for computing a checksum.

@adamjstewart adamjstewart changed the title WeightsEnum: add MD5 checksums WeightsEnum: use checksums Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants