add Video feature and kernels #6667

pmeier · 2022-09-29T18:46:57Z

No description provided.

test/prototype_common_utils.py

torchvision/prototype/features/_video.py

torchvision/prototype/transforms/functional/_augment.py

test/prototype_transforms_kernel_infos.py

datumbox

I saw you put comments for us, so from that I assume you want us to have a look. If not the case, feel free to ignore.

torchvision/prototype/features/_video.py

torchvision/prototype/transforms/functional/_augment.py

torchvision/prototype/transforms/functional/_color.py

torchvision/prototype/transforms/functional/_misc.py

bjuncek

One high level comment:

At the moment, if i understand it correctly, Video feature class only deals with seq-of-images data.
How hard (or easy) would it be to extend this to additional video modalities (say audio)?

test/prototype_common_utils.py

torchvision/prototype/features/_video.py

pmeier · 2022-10-04T07:05:47Z

At the moment, if i understand it correctly, Video feature class only deals with seq-of-images data.
How hard (or easy) would it be to extend this to additional video modalities (say audio)?

As is, Video is a tensor subclass and thus does not support more than one storage. Thus, I don't think it is possible to bundle video and audio into one tensor. There are "nested" tensors, but the are in prototype state for quite some time now. Plus, the tensors ought to have the same dimensionality (not shape), but that will be quite awkward for audio. I'm guessing the actual shape should be something like (*, K, L) like we have in our video datasets now:

vision/torchvision/datasets/kinetics.py

Lines 72 to 73 in 07ae61b

    
                       - audio(Tensor[K, L]): the audio frames, where `K` is the number of channels 
        
                         and `L` is the number of points in torch.float tensor

(Noob question: we don't see T here, because the audio is sampled differently, read a lot higher, than the video, right?)

Could you be more specific what kind of transformations on video with audio you have in mind?

datumbox

Overall looks good. A few comments below. We also need to adapt the AutoAugment classes.

One concern I have is if there is a specific transform that doesn't support Videos (squashes dimensions for example) or does something unexpected. I'm happy to go overall to the API and do checks for things we need to do or missed. Another thing we need after this PR as clean-up is remove the image wording (both from internal method variables and from arguments like image_size).

torchvision/prototype/transforms/_augment.py

torchvision/prototype/transforms/_color.py

torchvision/prototype/transforms/functional/_augment.py

bjuncek · 2022-10-04T14:28:58Z

As is, Video is a tensor subclass and thus does not support more than one storage. Thus, I don't think it is possible to bundle video and audio into one tensor.

Ah, ok, makes sense.

Noob question: we don't see T here, because the audio is sampled differently, read a lot higher, than the video, right?

Yeah. Basically, audio should be num_channels (K) x 1-D signal
On average signal is sampled at 48kHZ, so for a second of stereo audio, you'd have
2*48000 size tensor.

what kind of transformations on video with audio you have in mind?

Usually the transforms are handled separately (when I was last working in multimodal video field, we'd apply things separately). So for example in the dataset __getitem__ we'd have:

...
video, audio, info = read_video(path)
if v_transform is not None:
    video = v_transform(video)
if a_transform is not None:
    audio = a_transform(audio)
...

This all makes sense though. Let's worry about audio (and if we'd even need it) down the line.

datumbox

LGTM, after addressing some of the comments below. I'll have an other look on the whole API immediately after merging this to ensure we didn't miss anything.

torchvision/prototype/features/_video.py

torchvision/prototype/transforms/_auto_augment.py

torchvision/prototype/transforms/functional/_augment.py

torchvision/prototype/transforms/functional/_color.py

datumbox · 2022-10-06T08:44:02Z

torchvision/prototype/transforms/functional/_meta.py

-    if isinstance(inpt, torch.Tensor) and (torch.jit.is_scripting() or not isinstance(inpt, features.Image)):
+) -> features.ImageOrVideoTypeJIT:
+    if isinstance(inpt, torch.Tensor) and (
+        torch.jit.is_scripting() or not isinstance(inpt, (features.Image, features.Video))


This specific isinstance check for image or video is correct, but I think we missed it in all other places on the dispatchers. I think we need to add this everywhere.

I went through the whole API again and found a few places where we need to decide if we want video support there:

classificaton transforms: MixUp, CutMix

object detection transforms: SimpleCopyPaste, RandomIoUCrop

deprecated transforms: Grayscale, RandomGrayscale

outlier transforms: FiveCrop, TenCrop (we are not supporting anything besides images here so far)

image only transforms (by name): ConvertImageDtype

For everything else this is fixed.

datumbox · 2022-10-06T09:51:57Z

@pmeier As discussed offline, we should also make sure we handle batches for the video kernels until #6670 is resolved.

pmeier · 2022-10-06T22:31:29Z

test/prototype_transforms_kernel_infos.py

+    ]
+
+
+xfails_image_degenerate_or_multi_batch_dims = xfail_all_tests(


I've added them here to properly test video kernels that have the squash / unsquash "wrappers" now and thus do support arbitrary batch dimensions.

datumbox · 2022-10-07T09:15:40Z

We got breakages on some tests when we throw in batches:

TypeError: Input image tensor should have 3 or 4 dimensions, but found 5
_ TestSmoke.test_auto_augment[AugMix-torchvision.prototype.features._video.Video-20] _
Traceback (most recent call last):
  File "/home/runner/work/vision/vision/test/test_prototype_transforms.py", line 183, in test_auto_augment
    transform(input)
  File "/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 517, in forward
    aug, transform_id, magnitude, interpolation=self.interpolation, fill=self.fill
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 133, in _apply_image_or_video_transform
    return F.adjust_sharpness(image, sharpness_factor=1.0 + magnitude)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/functional/_color.py", line 84, in adjust_sharpness
    return adjust_sharpness_image_tensor(inpt, sharpness_factor=sharpness_factor)
  File "/home/runner/work/vision/vision/torchvision/transforms/functional_tensor.py", line 845, in adjust_sharpness
    return _blend(img, _blurred_degenerate_image(img), sharpness_factor)
  File "/home/runner/work/vision/vision/torchvision/transforms/functional_tensor.py", line 825, in _blurred_degenerate_image
    result_tmp = conv2d(result_tmp, kernel, groups=result_tmp.shape[-3])
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 1, 1, 23, 15]
_ TestSmoke.test_auto_augment[AugMix-torchvision.prototype.features._video.Video-21] _
Traceback (most recent call last):
  File "/home/runner/work/vision/vision/test/test_prototype_transforms.py", line 183, in test_auto_augment
    transform(input)
  File "/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 517, in forward
    aug, transform_id, magnitude, interpolation=self.interpolation, fill=self.fill
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 133, in _apply_image_or_video_transform
    return F.adjust_sharpness(image, sharpness_factor=1.0 + magnitude)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/functional/_color.py", line 84, in adjust_sharpness
    return adjust_sharpness_image_tensor(inpt, sharpness_factor=sharpness_factor)
  File "/home/runner/work/vision/vision/torchvision/transforms/functional_tensor.py", line 845, in adjust_sharpness
    return _blend(img, _blurred_degenerate_image(img), sharpness_factor)
  File "/home/runner/work/vision/vision/torchvision/transforms/functional_tensor.py", line 825, in _blurred_degenerate_image
    result_tmp = conv2d(result_tmp, kernel, groups=result_tmp.shape[-3])
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 1, 1, 26, 19]
_ TestSmoke.test_auto_augment[AugMix-torchvision.prototype.features._video.Video-22] _
Traceback (most recent call last):
  File "/home/runner/work/vision/vision/test/test_prototype_transforms.py", line 183, in test_auto_augment
    transform(input)
  File "/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 517, in forward
    aug, transform_id, magnitude, interpolation=self.interpolation, fill=self.fill
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 133, in _apply_image_or_video_transform
    return F.adjust_sharpness(image, sharpness_factor=1.0 + magnitude)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/functional/_color.py", line 84, in adjust_sharpness
    return adjust_sharpness_image_tensor(inpt, sharpness_factor=sharpness_factor)
  File "/home/runner/work/vision/vision/torchvision/transforms/functional_tensor.py", line 845, in adjust_sharpness
    return _blend(img, _blurred_degenerate_image(img), sharpness_factor)
  File "/home/runner/work/vision/vision/torchvision/transforms/functional_tensor.py", line 825, in _blurred_degenerate_image
    result_tmp = conv2d(result_tmp, kernel, groups=result_tmp.shape[-3])
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 3, 3, 24, 22]
_ TestSmoke.test_auto_augment[AugMix-torchvision.prototype.features._video.Video-23] _
Traceback (most recent call last):
  File "/home/runner/work/vision/vision/test/test_prototype_transforms.py", line 183, in test_auto_augment
    transform(input)
  File "/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 517, in forward
    aug, transform_id, magnitude, interpolation=self.interpolation, fill=self.fill
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/_auto_augment.py", line 141, in _apply_image_or_video_transform
    return F.equalize(image)
  File "/home/runner/work/vision/vision/torchvision/prototype/transforms/functional/_color.py", line 204, in equalize
    return equalize_image_tensor(inpt)
  File "/home/runner/work/vision/vision/torchvision/transforms/functional_tensor.py", line 900, in equalize
    raise TypeError(f"Input image tensor should have 3 or 4 dimensions, but found {img.ndim}")
TypeError: Input image tensor should have 3 or 4 dimensions, but found 5

We might need to review the mitigations.

torchvision/prototype/transforms/functional/_color.py

pmeier · 2022-10-07T13:02:31Z

The breakages that we observed in AugMix were two-fold:

There was an error in the video generation for the tests. I thought I had fixed it in f1e2bfa but I needed d8945e6 as well.
The problem with AugMix is that we don't transform the extracted image directly, but perform some operations on it first:

vision/torchvision/prototype/transforms/_auto_augment.py

Line 485 in 7eb5d7f

batch = image.view([1] * max(4 - image.ndim, 0) + orig_dims)

This effectively unwraps the features.Video into a torch.Tensor. When we call the dispatchers later

vision/torchvision/prototype/transforms/_auto_augment.py

Lines 514 to 516 in 7eb5d7f

    
           aug = self._apply_image_transform( 
        
               aug, transform_id, magnitude, interpolation=self.interpolation, fill=self.fill 
        
           )

the input gets dispatched to the image kernel. In theory, that is no issue since the video kernel added by this PR does the same. However, as detailed in #6670, equalize_image_tensor and adjust_sharpness_image_tensor only support 4d inputs. Batched videos are 5d and thus the call fails.

Previously, we opted to only fix the video kernels in 0d2ad96. To use this fix, we would need to wrap the inputs to the AA dispatchers into a Video again, which might entail a significant runtime cost until #6681 or #6718 are merged. Thus, I've opted to fix the image kernels here since we need to do this anyway at some point.

datumbox

LGTM, feel free to merge on green CI.

Summary: * add video feature * add video kernels * add video testing utils * add one kernel info * fix kernel names in Video feature * use only uint8 for video testing * require at least 4 dims for Video feature * add TODO for image_size -> spatial_size * image -> video in feature constructor * introduce new combined images and video type * add video to transform utils * fix transforms test * fix auto augment * cleanup * address review comments * add remaining video kernel infos * add batch dimension squashing to some kernels * fix tests and kernel infos * add xfails for arbitrary batch sizes on some kernels * fix test setup * fix equalize_image_tensor for multi batch dims * fix adjust_sharpness_image_tensor for multi batch dims * address review comments Reviewed By: NicolasHug Differential Revision: D40427483 fbshipit-source-id: 748602811638a2b9c56134f14ea107714de86040

pmeier added 4 commits September 29, 2022 19:45

add video feature

4874907

add video kernels

a1b00b4

add video testing utils

e7a229c

add one kernel info

5d8b8b6

pmeier added module: transforms module: video prototype module: tv_tensors labels Sep 29, 2022

facebook-github-bot added the cla signed label Sep 29, 2022

pmeier commented Sep 30, 2022

View reviewed changes

datumbox reviewed Sep 30, 2022

View reviewed changes

bjuncek reviewed Oct 3, 2022

View reviewed changes

test/prototype_common_utils.py Show resolved Hide resolved

test/prototype_common_utils.py Outdated Show resolved Hide resolved

torchvision/prototype/features/_video.py Show resolved Hide resolved

torchvision/prototype/features/_video.py Outdated Show resolved Hide resolved

pmeier added 7 commits October 4, 2022 14:08

Merge branch 'main' into video

2380f10

fix kernel names in Video feature

a04d667

use only uint8 for video testing

35642b9

require at least 4 dims for Video feature

ae59458

add TODO for image_size -> spatial_size

0fb1c35

image -> video in feature constructor

2d1e560

introduce new combined images and video type

91e15b2

pmeier marked this pull request as ready for review October 4, 2022 13:33

pmeier requested a review from datumbox October 4, 2022 13:48

datumbox reviewed Oct 4, 2022

View reviewed changes

torchvision/prototype/transforms/_augment.py Outdated Show resolved Hide resolved

torchvision/prototype/transforms/_color.py Outdated Show resolved Hide resolved

torchvision/prototype/transforms/functional/_augment.py Outdated Show resolved Hide resolved

pmeier added 5 commits October 5, 2022 10:15

add video to transform utils

81237fe

fix transforms test

aa26292

fix auto augment

93d7556

Merge branch 'main' into video

6df2f0f

cleanup

a99765d

pmeier requested a review from datumbox October 5, 2022 13:44

datumbox approved these changes Oct 6, 2022

View reviewed changes

pmeier added 7 commits October 6, 2022 21:24

Merge branch 'main' into video

17ee7f7

address review comments

4506cdf

add remaining video kernel infos

36f52dc

add batch dimension squashing to some kernels

0d2ad96

fix tests and kernel infos

f1e2bfa

add xfails for arbitrary batch sizes on some kernels

93fc321

Merge branch 'main' into video

f843612

pmeier commented Oct 6, 2022

View reviewed changes

pmeier mentioned this pull request Oct 6, 2022

Not all prototype kernels support arbitrary batch sizes #6670

Closed

pmeier added 5 commits October 7, 2022 14:14

Merge branch 'main' into video

ad4d424

fix test setup

d8945e6

fix equalize_image_tensor for multi batch dims

1c86193

fix adjust_sharpness_image_tensor for multi batch dims

1c2b615

Merge branch 'video' of https://github.com/pmeier/vision into video

7f3a8b7

datumbox reviewed Oct 7, 2022

View reviewed changes

torchvision/prototype/transforms/functional/_color.py Outdated Show resolved Hide resolved

datumbox reviewed Oct 7, 2022

View reviewed changes

torchvision/prototype/transforms/functional/_color.py Outdated Show resolved Hide resolved

address review comments

2d7b07d

datumbox approved these changes Oct 7, 2022

View reviewed changes

pmeier merged commit 3118fb5 into pytorch:main Oct 7, 2022

pmeier deleted the video branch October 7, 2022 13:59

datumbox mentioned this pull request Oct 7, 2022

Adding support of Video to remaining Transforms and Kernels #6724

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Video feature and kernels #6667

add Video feature and kernels #6667

pmeier commented Sep 29, 2022

datumbox left a comment

bjuncek left a comment

pmeier commented Oct 4, 2022

datumbox left a comment

bjuncek commented Oct 4, 2022

datumbox left a comment

datumbox Oct 6, 2022

pmeier Oct 6, 2022

datumbox commented Oct 6, 2022

pmeier Oct 6, 2022

datumbox commented Oct 7, 2022 •

edited

Loading

pmeier commented Oct 7, 2022

datumbox left a comment

		]


		xfails_image_degenerate_or_multi_batch_dims = xfail_all_tests(

add Video feature and kernels #6667

add Video feature and kernels #6667

Conversation

pmeier commented Sep 29, 2022

datumbox left a comment

Choose a reason for hiding this comment

bjuncek left a comment

Choose a reason for hiding this comment

pmeier commented Oct 4, 2022

datumbox left a comment

Choose a reason for hiding this comment

bjuncek commented Oct 4, 2022

datumbox left a comment

Choose a reason for hiding this comment

datumbox Oct 6, 2022

Choose a reason for hiding this comment

pmeier Oct 6, 2022

Choose a reason for hiding this comment

datumbox commented Oct 6, 2022

pmeier Oct 6, 2022

Choose a reason for hiding this comment

datumbox commented Oct 7, 2022 • edited Loading

pmeier commented Oct 7, 2022

datumbox left a comment

Choose a reason for hiding this comment

datumbox commented Oct 7, 2022 •

edited

Loading