-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Video feature and kernels #6667
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw you put comments for us, so from that I assume you want us to have a look. If not the case, feel free to ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One high level comment:
At the moment, if i understand it correctly, Video
feature class only deals with seq-of-images data.
How hard (or easy) would it be to extend this to additional video modalities (say audio)?
As is, vision/torchvision/datasets/kinetics.py Lines 72 to 73 in 07ae61b
(Noob question: we don't see Could you be more specific what kind of transformations on video with audio you have in mind? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. A few comments below. We also need to adapt the AutoAugment classes.
One concern I have is if there is a specific transform that doesn't support Videos (squashes dimensions for example) or does something unexpected. I'm happy to go overall to the API and do checks for things we need to do or missed. Another thing we need after this PR as clean-up is remove the image
wording (both from internal method variables and from arguments like image_size
).
Ah, ok, makes sense.
Yeah. Basically, audio should be
Usually the transforms are handled separately (when I was last working in multimodal video field, we'd apply things separately). So for example in the dataset
This all makes sense though. Let's worry about audio (and if we'd even need it) down the line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, after addressing some of the comments below. I'll have an other look on the whole API immediately after merging this to ensure we didn't miss anything.
if isinstance(inpt, torch.Tensor) and (torch.jit.is_scripting() or not isinstance(inpt, features.Image)): | ||
) -> features.ImageOrVideoTypeJIT: | ||
if isinstance(inpt, torch.Tensor) and ( | ||
torch.jit.is_scripting() or not isinstance(inpt, (features.Image, features.Video)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specific isinstance
check for image or video is correct, but I think we missed it in all other places on the dispatchers. I think we need to add this everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the whole API again and found a few places where we need to decide if we want video support there:
- classificaton transforms:
MixUp
,CutMix
- object detection transforms:
SimpleCopyPaste
,RandomIoUCrop
- deprecated transforms:
Grayscale
,RandomGrayscale
- outlier transforms:
FiveCrop
,TenCrop
(we are not supporting anything besides images here so far) - image only transforms (by name):
ConvertImageDtype
For everything else this is fixed.
] | ||
|
||
|
||
xfails_image_degenerate_or_multi_batch_dims = xfail_all_tests( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added them here to properly test video kernels that have the squash / unsquash "wrappers" now and thus do support arbitrary batch dimensions.
We got breakages on some tests when we throw in batches:
We might need to review the mitigations. |
The breakages that we observed in
This effectively unwraps the vision/torchvision/prototype/transforms/_auto_augment.py Lines 514 to 516 in 7eb5d7f
the input gets dispatched to the image kernel. In theory, that is no issue since the video kernel added by this PR does the same. However, as detailed in #6670, Previously, we opted to only fix the video kernels in 0d2ad96. To use this fix, we would need to wrap the inputs to the AA dispatchers into a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, feel free to merge on green CI.
Summary: * add video feature * add video kernels * add video testing utils * add one kernel info * fix kernel names in Video feature * use only uint8 for video testing * require at least 4 dims for Video feature * add TODO for image_size -> spatial_size * image -> video in feature constructor * introduce new combined images and video type * add video to transform utils * fix transforms test * fix auto augment * cleanup * address review comments * add remaining video kernel infos * add batch dimension squashing to some kernels * fix tests and kernel infos * add xfails for arbitrary batch sizes on some kernels * fix test setup * fix equalize_image_tensor for multi batch dims * fix adjust_sharpness_image_tensor for multi batch dims * address review comments Reviewed By: NicolasHug Differential Revision: D40427483 fbshipit-source-id: 748602811638a2b9c56134f14ea107714de86040
No description provided.