WeightsEnum: use checksums #7210

adamjstewart · 2023-02-09T17:09:46Z

🚀 The feature

All weight enums should store and use the MD5 checksum in order to verify the integrity of the download. We already do the same thing for datasets, and weights should be no different.

Motivation, pitch

When weight enums are passed to a model, torchvision calls WeightsEnum.get_state_dict, which calls torch.hub.load_state_dict_from_url, which downloads the weights and calls torch.load, which uses pickle to load the data. However, pickle's documentation notes that the pickle format is not secure and the pickle format can allow for arbitrary code execution. For security reasons, it's recommended to first verify the checksum of any pickled files before unpickling.

Alternatives

No response

Additional context

It's unclear exactly where the checksum should be used. Honestly, we should really add a MD5 check directly in torch.hub.load_state_dict_from_url. Otherwise, we'll have to split the download and the load into two separate steps in torchvision so that we can verify the checksum in the middle.

@NicolasHug

The text was updated successfully, but these errors were encountered:

NicolasHug · 2023-02-09T17:23:34Z

Thanks for the proposal @adamjstewart , the request definitely seems reasonable. Changes to tochhub can be in scope too. I won't be able to get to this in the next 2-3 weeks because we'll be very busy with the release. If I don't come back to this in ~1 month, please feel free to ping me again!

nps1ngh · 2023-02-09T21:10:01Z

Are the SHA256 hash prefixes insufficient/unsuitable? torch.hub.load_state_dict_from_url also seems to have a check_hash argument.
But maybe I'm overlooking something?

adamjstewart · 2023-02-09T21:46:29Z

That works if the filename follows the very specific filename-<sha256>.<ext> format, but doesn't work in general. TorchGeo weights don't follow that convention, but if they need to in order to simplify the process we're more than happy to change that. I was curious what that hash was and it wasn't long enough to be a full sha256 so I just assumed it was some random git commit hash for uniqueness, not for security.

This brings up the question of whether or not to retire MD5 and use SHA256 for all datasets. MD5 isn't exactly secure, but it depends on how high priority security is. But that's probably a conversation for a different PR.

nps1ngh · 2023-02-09T22:26:09Z

I see. I think it'd be nice if torch.hub.load_state_dict_from_url also supported checking a passed in hash like how torch.hub.download_url_to_file does.
This could also make it easier for you guys @adamjstewart. However, if you would need to calculate the hashes of the weights anyway (I don't see any MD5 ones 👀), then it might just be worth it to rename everything? I'm not sure tbh what would be easier.

On a side note, I'm actually quite surprised that WeightsEnum does not set check_hash to True (see here).

This brings up the question of whether or not to retire MD5 and use SHA256 for all datasets. MD5 isn't exactly secure, but it depends on how high priority security is. But that's probably a conversation for a different PR.

I'm not knowledgeable about whether the datasets make use of the pickle module or not, but since the weights do, then I'd certainly prefer to use SHA256 over MD5. Especially, since security is the main motivation to check the hashes and the logic to check the hashes already supports SHA256.

adamjstewart · 2023-02-09T22:33:40Z

Datasets are also security risks because tarfile has known vulnerabilities. See #7039 for a discussion on this. It was noted there that MD5 isn't perfect but it's probably good enough.

pmeier · 2023-02-10T08:50:37Z

I wouldn't rely on MD5 for any new functionality. In fact, we used SHA256 for the datasets v2:

vision/torchvision/prototype/datasets/_builtin/caltech.py

Line 51 in c974742

sha256="af6ece2f339791ca20f855943d8b55dd60892c0a25105fcd631ee3d6430f9926",

The only thing stopping us from backporting that is that our v2 datasets don't cover everything in v1 yet and thus we don't have the SHA256 checksums for everything. And I'm not keen on downloading a gazillion GB just for computing a checksum.

NicolasHug self-assigned this Feb 9, 2023

NicolasHug added enhancement weights module: models module: hub labels Feb 9, 2023

adamjstewart changed the title ~~WeightsEnum: add MD5 checksums~~ WeightsEnum: use checksums Feb 10, 2023

This was referenced Feb 10, 2023

Check sha256 of weights #7219

Merged

Weights: add sha256 suffix microsoft/torchgeo#1105

Merged

NicolasHug closed this as completed in #7219 May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WeightsEnum: use checksums #7210

WeightsEnum: use checksums #7210

adamjstewart commented Feb 9, 2023

NicolasHug commented Feb 9, 2023

nps1ngh commented Feb 9, 2023

adamjstewart commented Feb 9, 2023

nps1ngh commented Feb 9, 2023 •

edited

Loading

adamjstewart commented Feb 9, 2023

pmeier commented Feb 10, 2023

WeightsEnum: use checksums #7210

WeightsEnum: use checksums #7210

Comments

adamjstewart commented Feb 9, 2023

🚀 The feature

Motivation, pitch

Alternatives

Additional context

NicolasHug commented Feb 9, 2023

nps1ngh commented Feb 9, 2023

adamjstewart commented Feb 9, 2023

nps1ngh commented Feb 9, 2023 • edited Loading

adamjstewart commented Feb 9, 2023

pmeier commented Feb 10, 2023

nps1ngh commented Feb 9, 2023 •

edited

Loading