Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Video feature and kernels #6667

Merged
merged 29 commits into from
Oct 7, 2022
Merged

add Video feature and kernels #6667

merged 29 commits into from
Oct 7, 2022

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Sep 29, 2022

No description provided.

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw you put comments for us, so from that I assume you want us to have a look. If not the case, feel free to ignore.

Copy link
Contributor

@bjuncek bjuncek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One high level comment:

At the moment, if i understand it correctly, Video feature class only deals with seq-of-images data.
How hard (or easy) would it be to extend this to additional video modalities (say audio)?

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 4, 2022

At the moment, if i understand it correctly, Video feature class only deals with seq-of-images data.
How hard (or easy) would it be to extend this to additional video modalities (say audio)?

As is, Video is a tensor subclass and thus does not support more than one storage. Thus, I don't think it is possible to bundle video and audio into one tensor. There are "nested" tensors, but the are in prototype state for quite some time now. Plus, the tensors ought to have the same dimensionality (not shape), but that will be quite awkward for audio. I'm guessing the actual shape should be something like (*, K, L) like we have in our video datasets now:

- audio(Tensor[K, L]): the audio frames, where `K` is the number of channels
and `L` is the number of points in torch.float tensor

(Noob question: we don't see T here, because the audio is sampled differently, read a lot higher, than the video, right?)

Could you be more specific what kind of transformations on video with audio you have in mind?

@pmeier pmeier marked this pull request as ready for review October 4, 2022 13:33
@pmeier pmeier requested a review from datumbox October 4, 2022 13:48
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. A few comments below. We also need to adapt the AutoAugment classes.

One concern I have is if there is a specific transform that doesn't support Videos (squashes dimensions for example) or does something unexpected. I'm happy to go overall to the API and do checks for things we need to do or missed. Another thing we need after this PR as clean-up is remove the image wording (both from internal method variables and from arguments like image_size).

@bjuncek
Copy link
Contributor

bjuncek commented Oct 4, 2022

As is, Video is a tensor subclass and thus does not support more than one storage. Thus, I don't think it is possible to bundle video and audio into one tensor.

Ah, ok, makes sense.

Noob question: we don't see T here, because the audio is sampled differently, read a lot higher, than the video, right?

Yeah. Basically, audio should be num_channels (K) x 1-D signal
On average signal is sampled at 48kHZ, so for a second of stereo audio, you'd have
2*48000 size tensor.

what kind of transformations on video with audio you have in mind?

Usually the transforms are handled separately (when I was last working in multimodal video field, we'd apply things separately). So for example in the dataset __getitem__ we'd have:

...
video, audio, info = read_video(path)
if v_transform is not None:
    video = v_transform(video)
if a_transform is not None:
    audio = a_transform(audio)
...

This all makes sense though. Let's worry about audio (and if we'd even need it) down the line.

@pmeier pmeier requested a review from datumbox October 5, 2022 13:44
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, after addressing some of the comments below. I'll have an other look on the whole API immediately after merging this to ensure we didn't miss anything.

if isinstance(inpt, torch.Tensor) and (torch.jit.is_scripting() or not isinstance(inpt, features.Image)):
) -> features.ImageOrVideoTypeJIT:
if isinstance(inpt, torch.Tensor) and (
torch.jit.is_scripting() or not isinstance(inpt, (features.Image, features.Video))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific isinstance check for image or video is correct, but I think we missed it in all other places on the dispatchers. I think we need to add this everywhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the whole API again and found a few places where we need to decide if we want video support there:

  • classificaton transforms: MixUp, CutMix
  • object detection transforms: SimpleCopyPaste, RandomIoUCrop
  • deprecated transforms: Grayscale, RandomGrayscale
  • outlier transforms: FiveCrop, TenCrop (we are not supporting anything besides images here so far)
  • image only transforms (by name): ConvertImageDtype

For everything else this is fixed.

@datumbox
Copy link
Contributor

datumbox commented Oct 6, 2022

@pmeier As discussed offline, we should also make sure we handle batches for the video kernels until #6670 is resolved.

]


xfails_image_degenerate_or_multi_batch_dims = xfail_all_tests(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added them here to properly test video kernels that have the squash / unsquash "wrappers" now and thus do support arbitrary batch dimensions.

@datumbox
Copy link
Contributor

datumbox commented Oct 7, 2022

We got breakages on some tests when we throw in batches:

TypeError: Input image tensor should have 3 or 4 dimensions, but found 5
_ TestSmoke.test_auto_augment[AugMix-torchvision.prototype.features._video.Video-20] _
Traceback (most recent call last):
  File "/home/runner/work/vision/vision/test/test_prototype_transforms.py", line 183, in test_auto_augment
    transform(input)
  File "/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 517, in forward
    aug, transform_id, magnitude, interpolation=self.interpolation, fill=self.fill
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 133, in _apply_image_or_video_transform
    return F.adjust_sharpness(image, sharpness_factor=1.0 + magnitude)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/functional/_color.py", line 84, in adjust_sharpness
    return adjust_sharpness_image_tensor(inpt, sharpness_factor=sharpness_factor)
  File "/home/runner/work/vision/vision/torchvision/transforms/functional_tensor.py", line 845, in adjust_sharpness
    return _blend(img, _blurred_degenerate_image(img), sharpness_factor)
  File "/home/runner/work/vision/vision/torchvision/transforms/functional_tensor.py", line 825, in _blurred_degenerate_image
    result_tmp = conv2d(result_tmp, kernel, groups=result_tmp.shape[-3])
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 1, 1, 23, 15]
_ TestSmoke.test_auto_augment[AugMix-torchvision.prototype.features._video.Video-21] _
Traceback (most recent call last):
  File "/home/runner/work/vision/vision/test/test_prototype_transforms.py", line 183, in test_auto_augment
    transform(input)
  File "/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 517, in forward
    aug, transform_id, magnitude, interpolation=self.interpolation, fill=self.fill
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 133, in _apply_image_or_video_transform
    return F.adjust_sharpness(image, sharpness_factor=1.0 + magnitude)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/functional/_color.py", line 84, in adjust_sharpness
    return adjust_sharpness_image_tensor(inpt, sharpness_factor=sharpness_factor)
  File "/home/runner/work/vision/vision/torchvision/transforms/functional_tensor.py", line 845, in adjust_sharpness
    return _blend(img, _blurred_degenerate_image(img), sharpness_factor)
  File "/home/runner/work/vision/vision/torchvision/transforms/functional_tensor.py", line 825, in _blurred_degenerate_image
    result_tmp = conv2d(result_tmp, kernel, groups=result_tmp.shape[-3])
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 1, 1, 26, 19]
_ TestSmoke.test_auto_augment[AugMix-torchvision.prototype.features._video.Video-22] _
Traceback (most recent call last):
  File "/home/runner/work/vision/vision/test/test_prototype_transforms.py", line 183, in test_auto_augment
    transform(input)
  File "/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 517, in forward
    aug, transform_id, magnitude, interpolation=self.interpolation, fill=self.fill
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 133, in _apply_image_or_video_transform
    return F.adjust_sharpness(image, sharpness_factor=1.0 + magnitude)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/functional/_color.py", line 84, in adjust_sharpness
    return adjust_sharpness_image_tensor(inpt, sharpness_factor=sharpness_factor)
  File "/home/runner/work/vision/vision/torchvision/transforms/functional_tensor.py", line 845, in adjust_sharpness
    return _blend(img, _blurred_degenerate_image(img), sharpness_factor)
  File "/home/runner/work/vision/vision/torchvision/transforms/functional_tensor.py", line 825, in _blurred_degenerate_image
    result_tmp = conv2d(result_tmp, kernel, groups=result_tmp.shape[-3])
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 3, 3, 24, 22]
_ TestSmoke.test_auto_augment[AugMix-torchvision.prototype.features._video.Video-23] _
Traceback (most recent call last):
  File "/home/runner/work/vision/vision/test/test_prototype_transforms.py", line 183, in test_auto_augment
    transform(input)
  File "/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 517, in forward
    aug, transform_id, magnitude, interpolation=self.interpolation, fill=self.fill
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 141, in _apply_image_or_video_transform
    return F.equalize(image)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/functional/_color.py", line 204, in equalize
    return equalize_image_tensor(inpt)
  File "/home/runner/work/vision/vision/torchvision/transforms/functional_tensor.py", line 900, in equalize
    raise TypeError(f"Input image tensor should have 3 or 4 dimensions, but found {img.ndim}")
TypeError: Input image tensor should have 3 or 4 dimensions, but found 5

We might need to review the mitigations.

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 7, 2022

The breakages that we observed in AugMix were two-fold:

  1. There was an error in the video generation for the tests. I thought I had fixed it in f1e2bfa but I needed d8945e6 as well.

  2. The problem with AugMix is that we don't transform the extracted image directly, but perform some operations on it first:

    batch = image.view([1] * max(4 - image.ndim, 0) + orig_dims)

This effectively unwraps the features.Video into a torch.Tensor. When we call the dispatchers later

aug = self._apply_image_transform(
aug, transform_id, magnitude, interpolation=self.interpolation, fill=self.fill
)

the input gets dispatched to the image kernel. In theory, that is no issue since the video kernel added by this PR does the same. However, as detailed in #6670, equalize_image_tensor and adjust_sharpness_image_tensor only support 4d inputs. Batched videos are 5d and thus the call fails.

Previously, we opted to only fix the video kernels in 0d2ad96. To use this fix, we would need to wrap the inputs to the AA dispatchers into a Video again, which might entail a significant runtime cost until #6681 or #6718 are merged. Thus, I've opted to fix the image kernels here since we need to do this anyway at some point.

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, feel free to merge on green CI.

@pmeier pmeier merged commit 3118fb5 into pytorch:main Oct 7, 2022
@pmeier pmeier deleted the video branch October 7, 2022 13:59
facebook-github-bot pushed a commit that referenced this pull request Oct 17, 2022
Summary:
* add video feature

* add video kernels

* add video testing utils

* add one kernel info

* fix kernel names in Video feature

* use only uint8 for video testing

* require at least 4 dims for Video feature

* add TODO for image_size -> spatial_size

* image -> video in feature constructor

* introduce new combined images and video type

* add video to transform utils

* fix transforms test

* fix auto augment

* cleanup

* address review comments

* add remaining video kernel infos

* add batch dimension squashing to some kernels

* fix tests and kernel infos

* add xfails for arbitrary batch sizes on some kernels

* fix test setup

* fix equalize_image_tensor for multi batch dims

* fix adjust_sharpness_image_tensor for multi batch dims

* address review comments

Reviewed By: NicolasHug

Differential Revision: D40427483

fbshipit-source-id: 748602811638a2b9c56134f14ea107714de86040
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants