Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let Resize() handle uint8 images natively for bilinear mode #7497

Closed
NicolasHug opened this issue Apr 4, 2023 · 7 comments
Closed

Let Resize() handle uint8 images natively for bilinear mode #7497

NicolasHug opened this issue Apr 4, 2023 · 7 comments

Comments

@NicolasHug
Copy link
Member

NicolasHug commented Apr 4, 2023

On uint8 tensors, Resize() currently converts the input image to and from torch.float to pass it to interpolate(), because interpolate() didn't support native uint8 inputs in the past. This is suboptimal.

@vfdev-5 and I have recently implemented native uint8 support for interpolate(mode="bilinear") in pytorch/pytorch#90771 and pytorch/pytorch#96848.

We should integrate this native uint8 support into torchvision's Resize(). Benchmarks below show that such integration could lead to at least 3X improvement on Resize()'s time, which saves 1ms per image and a 30% improvement of the total pipeline time for a typical classification pipeline (including auto-augment, which is the next bottleneck). This would make the Tensor / DataPoint backend significantly faster than PIL.

Some current challenges before integrations are:

  • improvements for native uint8 are mostly for AVX2 archs. Compared to current Resize() implem (float), is the perf still OK on archs that don’t support AVX2? First: need to identify whether those non-AVX2 targets are critical or not.
  • BC: Although more strictly correct, the uint8 native path may have 1-off differences with current float path. Mitigation: only integrate native uint8 into V2 Resize(), where BC commitments are looser.

Benchmarks made with @pmeier's pmeier/detection-reference-benchmark@0ae9027 and with the following patch

+class ResizeUint8(torch.nn.Module):
+    def __init__(self, force_channels_last):
+        super().__init__()
+        self.force_channels_last = force_channels_last
+
+    def forward(self, img):
+        img = img.unsqueeze(0)
+        if self.force_channels_last:
+            img = img.contiguous(memory_format=torch.channels_last)
+        return torch.nn.functional.interpolate(img, size=[223, 223], mode="bilinear", antialias=True, align_corners=None).squeeze(0)
+
 def classification_complex_pipeline_builder(*, input_type, api_version):
     if input_type == "Datapoint" and api_version == "v1":
         return None
@@ -94,9 +106,15 @@ def classification_complex_pipeline_builder(*, input_type, api_version):
     if api_version == "v1":
         transforms = transforms_v1
         RandomResizedCropWithoutResize = RandomResizedCropWithoutResizeV1
+        resize = transforms.Resize(223, antialias=True)
     elif api_version == "v2":
         transforms = transforms_v2
         RandomResizedCropWithoutResize = RandomResizedCropWithoutResizeV2
+        if input_type in ("Datapoint", "Tensor"):
+            # resize = ResizeUint8(force_channels_last=False)
+            resize = transforms.Resize(223, antialias=True)
+        else:
+            resize = transforms.Resize(223, antialias=True)
     else:
         raise RuntimeError(f"Got {api_version=}")
 
@@ -106,11 +124,14 @@ def classification_complex_pipeline_builder(*, input_type, api_version):
         pipeline.append(transforms.PILToTensor())
     elif input_type == "Datapoint":
         pipeline.append(transforms.ToImageTensor())
+    
+
 
     pipeline.extend(
         [
             RandomResizedCropWithoutResize(224),
-            transforms.Resize(224, antialias=True),
+            # transforms.Resize(223, antialias=True),
+            resize,
             transforms.RandomHorizontalFlip(p=0.5),
             transforms.AutoAugment(transforms.AutoAugmentPolicy.IMAGENET),
         ]

Without uint8 native support:

############################################################
classification-complex
############################################################
input_type='Tensor', api_version='v1'

Results computed for 1_000 samples

                                  median          std   
PILToTensor                          258 µs +-     24 µs
RandomResizedCropWithoutResizeV1     111 µs +-     22 µs
Resize                              1238 µs +-    311 µs
RandomHorizontalFlip                  53 µs +-     21 µs
AutoAugment                         1281 µs +-    840 µs
RandomErasing                         31 µs +-     66 µs
ConvertImageDtype                    120 µs +-     13 µs
Normalize                            186 µs +-     23 µs

total                               3278 µs
------------------------------------------------------------
input_type='Tensor', api_version='v2'

Results computed for 1_000 samples

                                  median          std   
PILToTensor                          271 µs +-     21 µs
RandomResizedCropWithoutResizeV2     113 µs +-     17 µs
Resize                              1226 µs +-    304 µs
RandomHorizontalFlip                  64 µs +-     24 µs
AutoAugment                         1099 µs +-    738 µs
RandomErasing                         39 µs +-     68 µs
ConvertDtype                          96 µs +-     12 µs
Normalize                            150 µs +-     17 µs

total                               3057 µs
------------------------------------------------------------
input_type='PIL', api_version='v1'

Results computed for 1_000 samples

                                  median          std   
RandomResizedCropWithoutResizeV1     162 µs +-     27 µs
Resize                               787 µs +-    186 µs
RandomHorizontalFlip                  53 µs +-     29 µs
AutoAugment                          585 µs +-    342 µs
PILToTensor                           96 µs +-      9 µs
RandomErasing                         32 µs +-     65 µs
ConvertImageDtype                    125 µs +-     14 µs
Normalize                            850 µs +-     83 µs

total                               2688 µs
------------------------------------------------------------
input_type='PIL', api_version='v2'

Results computed for 1_000 samples

                                  median          std   
RandomResizedCropWithoutResizeV2     166 µs +-     26 µs
Resize                               783 µs +-    185 µs
RandomHorizontalFlip                  61 µs +-     33 µs
AutoAugment                          489 µs +-    355 µs
PILToTensor                          115 µs +-      9 µs
RandomErasing                         37 µs +-     65 µs
ConvertDtype                         101 µs +-     11 µs
Normalize                            825 µs +-     84 µs

total                               2577 µs
------------------------------------------------------------
input_type='Datapoint', api_version='v2'

Results computed for 1_000 samples

                                  median          std   
ToImageTensor                        284 µs +-     22 µs
RandomResizedCropWithoutResizeV2     119 µs +-     17 µs
Resize                              1223 µs +-    302 µs
RandomHorizontalFlip                  62 µs +-     29 µs
AutoAugment                         1100 µs +-    625 µs
RandomErasing                         39 µs +-     72 µs
ConvertDtype                         106 µs +-     13 µs
Normalize                            155 µs +-     16 µs

total                               3089 µs
------------------------------------------------------------

Summaries

           v2 / v1
Tensor        0.93
PIL           0.96

                     [a]   [b]   [c]   [d]   [e]
   Tensor, v1, [a]  1.00  1.07  1.22  1.27  1.06
   Tensor, v2, [b]  0.93  1.00  1.14  1.19  0.99
      PIL, v1, [c]  0.82  0.88  1.00  1.04  0.87
      PIL, v2, [d]  0.79  0.84  0.96  1.00  0.83
Datapoint, v2, [e]  0.94  1.01  1.15  1.20  1.00

Slowdown as row / col

With uint8 native support for TensorV2 and DatapointV2:

############################################################
classification-complex
############################################################
input_type='Tensor', api_version='v1'

Results computed for 1_000 samples

                                  median          std   
PILToTensor                          255 µs +-     21 µs
RandomResizedCropWithoutResizeV1     110 µs +-     22 µs
Resize                              1230 µs +-    315 µs
RandomHorizontalFlip                  47 µs +-     24 µs
AutoAugment                         1269 µs +-    870 µs
RandomErasing                         31 µs +-     66 µs
ConvertImageDtype                    121 µs +-     13 µs
Normalize                            186 µs +-     23 µs

total                               3249 µs
------------------------------------------------------------
input_type='Tensor', api_version='v2'

Results computed for 1_000 samples

                                  median          std   
PILToTensor                          270 µs +-     20 µs
RandomResizedCropWithoutResizeV2     110 µs +-     17 µs
ResizeUint8                          402 µs +-    109 µs
RandomHorizontalFlip                  66 µs +-     24 µs
AutoAugment                          996 µs +-    539 µs
RandomErasing                         39 µs +-     64 µs
ConvertDtype                          81 µs +-     10 µs
Normalize                            134 µs +-     14 µs

total                               2099 µs
------------------------------------------------------------
input_type='PIL', api_version='v1'

Results computed for 1_000 samples

                                  median          std   
RandomResizedCropWithoutResizeV1     161 µs +-     28 µs
Resize                               779 µs +-    186 µs
RandomHorizontalFlip                  53 µs +-     29 µs
AutoAugment                          576 µs +-    339 µs
PILToTensor                           93 µs +-      8 µs
RandomErasing                         31 µs +-     64 µs
ConvertImageDtype                    123 µs +-     13 µs
Normalize                            843 µs +-     82 µs

total                               2661 µs
------------------------------------------------------------
input_type='PIL', api_version='v2'

Results computed for 1_000 samples

                                  median          std   
RandomResizedCropWithoutResizeV2     163 µs +-     26 µs
Resize                               788 µs +-    180 µs
RandomHorizontalFlip                  62 µs +-     33 µs
AutoAugment                          492 µs +-    355 µs
PILToTensor                          112 µs +-      9 µs
RandomErasing                         37 µs +-     64 µs
ConvertDtype                         100 µs +-     11 µs
Normalize                            826 µs +-     86 µs

total                               2580 µs
------------------------------------------------------------
input_type='Datapoint', api_version='v2'

Results computed for 1_000 samples

                                  median          std   
ToImageTensor                        284 µs +-     22 µs
RandomResizedCropWithoutResizeV2     118 µs +-     17 µs
ResizeUint8                          410 µs +-    109 µs
RandomHorizontalFlip                  68 µs +-     23 µs
AutoAugment                          994 µs +-    542 µs
RandomErasing                         38 µs +-     63 µs
ConvertDtype                          81 µs +-     10 µs
Normalize                            133 µs +-     14 µs

total                               2127 µs
------------------------------------------------------------

Summaries

           v2 / v1
Tensor        0.65
PIL           0.97

                     [a]   [b]   [c]   [d]   [e]
   Tensor, v1, [a]  1.00  1.55  1.22  1.26  1.53
   Tensor, v2, [b]  0.65  1.00  0.79  0.81  0.99
      PIL, v1, [c]  0.82  1.27  1.00  1.03  1.25
      PIL, v2, [d]  0.79  1.23  0.97  1.00  1.21
Datapoint, v2, [e]  0.65  1.01  0.80  0.82  1.00

Slowdown as row / col
@FrancescoSaverioZuppichini

+1 from my side, this is really important since for most CV pipelines the data processing is the bottleneck

@pmeier
Copy link
Collaborator

pmeier commented May 1, 2023

Regarding the perf regression on non-AVX2 platforms: in pytorch/pytorch#100164, we are adding an option to detect the used CPU capability. With that, we can have both code paths and thus get the improved performance on AVX2 platforms "for free". Note that the BC concerns, i.e. the one-off differences, are still valid.

@NicolasHug
Copy link
Member Author

NicolasHug commented May 22, 2023

TL;DR: From torchvision 0.16, V2 transforms are always faster than PIL and V1 for classification pipelines. Speedups of the whole pipelines range from 10%-40%.
Thanks a ton for the amazing work @vfdev-5!

Update: reported results below are with bilinear mode, but similar speedups are observed for bicubic mode.


I ran a few benchmarks with the latest improvements now that #7557 is merged. Below is a summary, with more details in the expandable section.

  • Simple classification pipeline: PILToTensor -> RandomResizedCrop -> RandomHorizontalFlip -> ConvertDtype -> Normalize
    • V2 is ~40% faster than PIL and ~35% faster than V1
                   [a]    [b]    [c]
---------------  -----  -----  -----
Tensor, v2  [a]   1.00   0.65   0.61
Tensor, v1  [b]   1.53   1.00   0.93
   PIL, v1  [c]   1.64   1.07   1.00

Slowdown computed as row / column
  • Complex classification pipeline: PILToTensor -> RandomResizedCrop -> RandomHorizontalFlip -> AutoAugment -> RandomErasing -> ConvertDtype -> Normalize
                   [a]    [b]    [c]
---------------  -----  -----  -----
Tensor, v2  [a]   1.00   0.72   0.89
Tensor, v1  [b]   1.38   1.00   1.23
   PIL, v1  [c]   1.12   0.81   1.00

Slowdown computed as row / column

Side notes:

  • Resize() is ~2X faster than PIL on those inputs coming from Crop(). Larger speed-ups can be seen with more controlled inputs (up to 4X), e.g. tensors created from e.g. torch.rand, and the variability is explained by the different layout of the input, to which interpolate() is quite sensitive. It's fairly easy to make a design decision that would give an advantage to some kind of inputs while being detrimental to others. We tried our best to balance use-cases.
  • The speed-ups reported above against PIL don't just come from the improvements we made to Resize(), but also from the unexpected and convenient fact that on the tensor backend, in contrast to PIL, the tensors that are fed to Normalize() are CF, which is what Normalize() prefers. The layout is largely un-controllable by the user so in the benchmarks below I also benchmarks scenarios where we manually convert the input to Normalize() to CF or CL, or use torch.compile(normalize). Basically the results show that it's always better to either manually convert to CF before passing to Normalize(), or to torch.compile() it.
  • Now that we have solved the Resize() bottleneck (for bilinear mode), the new slowest transform is AutoAugment() which is 2X slower on tensors than on PIL.

Benchmark were ran with @pmeier 's https://github.com/pmeier/detection-reference-benchmark on this custom branch: https://github.com/pmeier/detection-reference-benchmark/compare/main...NicolasHug:detection-reference-benchmark:bench_classif_after_resize_v2_improvements?expand=1 (commit == 621ccaa)


############################################################
Classif simple, Vanilla
############################################################
input_type='Tensor', api_version='v2'

transform               median
--------------------  --------
PILToTensor                280
RandomResizedCrop          535
RandomHorizontalFlip        69
ConvertDtype                82
Normalize                  140
--------------------  --------
Total                     1116

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='Tensor', api_version='v1'

transform               median
--------------------  --------
PILToTensor                256
RandomResizedCrop         1127
RandomHorizontalFlip        57
ConvertImageDtype          105
Normalize                  162
--------------------  --------
Total                     1705

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='PIL', api_version='v1'

transform               median
--------------------  --------
RandomResizedCrop          938
RandomHorizontalFlip        78
PILToTensor                 90
ConvertImageDtype          116
Normalize                  614
--------------------  --------
Total                     1827

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
Summary

                   [a]    [b]    [c]
---------------  -----  -----  -----
Tensor, v2  [a]   1.00   0.65   0.61
Tensor, v1  [b]   1.53   1.00   0.93
   PIL, v1  [c]   1.64   1.07   1.00

Slowdown computed as row / column
############################################################
Classif simple, CL-Normalize
############################################################
input_type='Tensor', api_version='v2'

transform               median
--------------------  --------
PILToTensor                269
RandomResizedCrop          534
RandomHorizontalFlip        70
ConvertDtype                84
ToCL                       153
Normalize                  581
--------------------  --------
Total                     1692

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='Tensor', api_version='v1'

transform               median
--------------------  --------
PILToTensor                249
RandomResizedCrop         1123
RandomHorizontalFlip        57
ConvertImageDtype          105
Normalize                  163
--------------------  --------
Total                     1709

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='PIL', api_version='v1'

transform               median
--------------------  --------
RandomResizedCrop          909
RandomHorizontalFlip        78
PILToTensor                 89
ConvertImageDtype          120
Normalize                  618
--------------------  --------
Total                     1819

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
Summary

                   [a]    [b]    [c]
---------------  -----  -----  -----
Tensor, v2  [a]   1.00   0.99   0.93
Tensor, v1  [b]   1.01   1.00   0.94
   PIL, v1  [c]   1.08   1.06   1.00

Slowdown computed as row / column
############################################################
Classif simple, CF-Normalize
############################################################
input_type='Tensor', api_version='v2'

transform               median
--------------------  --------
PILToTensor                267
RandomResizedCrop          531
RandomHorizontalFlip        67
ConvertDtype                83
ToCF                        17
Normalize                  140
--------------------  --------
Total                     1119

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='Tensor', api_version='v1'

transform               median
--------------------  --------
PILToTensor                263
RandomResizedCrop         1128
RandomHorizontalFlip        57
ConvertImageDtype          105
Normalize                  164
--------------------  --------
Total                     1747

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='PIL', api_version='v1'

transform               median
--------------------  --------
RandomResizedCrop          880
RandomHorizontalFlip        77
PILToTensor                 89
ConvertImageDtype          114
Normalize                  599
--------------------  --------
Total                     1755

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
Summary

                   [a]    [b]    [c]
---------------  -----  -----  -----
Tensor, v2  [a]   1.00   0.64   0.64
Tensor, v1  [b]   1.56   1.00   1.00
   PIL, v1  [c]   1.57   1.00   1.00

Slowdown computed as row / column
############################################################
Classif simple, compiled-Normalize
############################################################
input_type='Tensor', api_version='v2'

transform               median
--------------------  --------
PILToTensor                268
RandomResizedCrop          538
RandomHorizontalFlip        67
ConvertDtype                81
CompiledNormalize          117
--------------------  --------
Total                     1084

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='Tensor', api_version='v1'

transform               median
--------------------  --------
PILToTensor                266
RandomResizedCrop         1134
RandomHorizontalFlip        57
ConvertImageDtype          103
Normalize                  159
--------------------  --------
Total                     1744

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='PIL', api_version='v1'

transform               median
--------------------  --------
RandomResizedCrop          905
RandomHorizontalFlip        79
PILToTensor                 89
ConvertImageDtype          115
Normalize                  605
--------------------  --------
Total                     1790

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
Summary

                   [a]    [b]    [c]
---------------  -----  -----  -----
Tensor, v2  [a]   1.00   0.62   0.61
Tensor, v1  [b]   1.61   1.00   0.97
   PIL, v1  [c]   1.65   1.03   1.00

Slowdown computed as row / column
############################################################
Classif complex, Vanilla
############################################################
input_type='Tensor', api_version='v2'

transform               median
--------------------  --------
PILToTensor                285
RandomResizedCrop          537
RandomHorizontalFlip        65
AutoAugment               1002
RandomErasing               38
ConvertDtype                83
Normalize                  135
--------------------  --------
Total                     2154

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='Tensor', api_version='v1'

transform               median
--------------------  --------
PILToTensor                249
RandomResizedCrop         1126
RandomHorizontalFlip        52
AutoAugment               1247
RandomErasing               31
ConvertImageDtype          106
Normalize                  167
--------------------  --------
Total                     2973

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='PIL', api_version='v1'

transform               median
--------------------  --------
RandomResizedCrop          889
RandomHorizontalFlip        47
AutoAugment                582
PILToTensor                 92
RandomErasing               31
ConvertImageDtype          110
Normalize                  603
--------------------  --------
Total                     2419

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
Summary

                   [a]    [b]    [c]
---------------  -----  -----  -----
Tensor, v2  [a]   1.00   0.72   0.89
Tensor, v1  [b]   1.38   1.00   1.23
   PIL, v1  [c]   1.12   0.81   1.00

Slowdown computed as row / column
############################################################
Classif complex, CL-Normalize
############################################################
input_type='Tensor', api_version='v2'

transform               median
--------------------  --------
PILToTensor                264
RandomResizedCrop          538
RandomHorizontalFlip        63
AutoAugment               1002
RandomErasing               38
ConvertDtype                84
ToCL                       147
Normalize                  582
--------------------  --------
Total                     2734

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='Tensor', api_version='v1'

transform               median
--------------------  --------
PILToTensor                270
RandomResizedCrop         1155
RandomHorizontalFlip        48
AutoAugment               1253
RandomErasing               31
ConvertImageDtype          108
Normalize                  165
--------------------  --------
Total                     2985

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='PIL', api_version='v1'

transform               median
--------------------  --------
RandomResizedCrop          924
RandomHorizontalFlip        56
AutoAugment                576
PILToTensor                 92
RandomErasing               31
ConvertImageDtype          112
Normalize                  602
--------------------  --------
Total                     2462

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
Summary

                   [a]    [b]    [c]
---------------  -----  -----  -----
Tensor, v2  [a]   1.00   0.92   1.11
Tensor, v1  [b]   1.09   1.00   1.21
   PIL, v1  [c]   0.90   0.82   1.00

Slowdown computed as row / column
############################################################
Classif complex, CF-Normalize
############################################################
input_type='Tensor', api_version='v2'

transform               median
--------------------  --------
PILToTensor                284
RandomResizedCrop          534
RandomHorizontalFlip        60
AutoAugment               1004
RandomErasing               38
ConvertDtype                85
ToCF                        18
Normalize                  128
--------------------  --------
Total                     2172

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='Tensor', api_version='v1'

transform               median
--------------------  --------
PILToTensor                256
RandomResizedCrop         1143
RandomHorizontalFlip        51
AutoAugment               1211
RandomErasing               31
ConvertImageDtype          107
Normalize                  166
--------------------  --------
Total                     2975

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='PIL', api_version='v1'

transform               median
--------------------  --------
RandomResizedCrop          908
RandomHorizontalFlip        48
AutoAugment                571
PILToTensor                 92
RandomErasing               31
ConvertImageDtype          110
Normalize                  604
--------------------  --------
Total                     2440

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
Summary

                   [a]    [b]    [c]
---------------  -----  -----  -----
Tensor, v2  [a]   1.00   0.73   0.89
Tensor, v1  [b]   1.37   1.00   1.22
   PIL, v1  [c]   1.12   0.82   1.00

Slowdown computed as row / column
############################################################
Classif complex, compiled-Normalize
############################################################
input_type='Tensor', api_version='v2'

transform               median
--------------------  --------
PILToTensor                297
RandomResizedCrop          549
RandomHorizontalFlip        57
AutoAugment                993
RandomErasing               39
ConvertDtype                84
CompiledNormalize          118
--------------------  --------
Total                     2157

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='Tensor', api_version='v1'

transform               median
--------------------  --------
PILToTensor                262
RandomResizedCrop         1138
RandomHorizontalFlip        50
AutoAugment               1228
RandomErasing               31
ConvertImageDtype          107
Normalize                  164
--------------------  --------
Total                     2981

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
input_type='PIL', api_version='v1'

transform               median
--------------------  --------
RandomResizedCrop          911
RandomHorizontalFlip        49
AutoAugment                570
PILToTensor                 92
RandomErasing               31
ConvertImageDtype          112
Normalize                  607
--------------------  --------
Total                     2420

Results computed for 1_000 samples and reported in µs
------------------------------------------------------------
Summary

                   [a]    [b]    [c]
---------------  -----  -----  -----
Tensor, v2  [a]   1.00   0.72   0.89
Tensor, v1  [b]   1.38   1.00   1.23
   PIL, v1  [c]   1.12   0.81   1.00

Slowdown computed as row / column
############################################################
Collecting environment information...
PyTorch version: 2.1.0.dev20230522
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (conda-forge gcc 9.5.0-16) 9.5.0
Clang version: 10.0.0-4ubuntu1 
CMake version: version 3.25.2
Libc version: glibc-2.31

Python version: 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:39:03)  [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-1019-aws-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.7.64
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB

Nvidia driver version: 525.85.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          96
On-line CPU(s) list:             0-95
Thread(s) per core:              2
Core(s) per socket:              24
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Stepping:                        7
CPU MHz:                         2999.998
BogoMIPS:                        5999.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       1.5 MiB
L1i cache:                       1.5 MiB
L2 cache:                        48 MiB
L3 cache:                        71.5 MiB
NUMA node0 CPU(s):               0-23,48-71
NUMA node1 CPU(s):               24-47,72-95
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:          Vulnerable
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] pytorch-pfn-extras==0.5.8
[pip3] pytorch-triton==2.1.0+46672772b4
[pip3] torch==2.1.0.dev20230522
[pip3] torchdata==0.5.0a0+25c6180
[pip3] torchvision==0.16.0a0+6ccc712
[pip3] triton==2.1.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               h2bc3f7f_2  
[conda] libblas                   3.9.0            14_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            14_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            14_linux64_mkl    conda-forge
[conda] liblapacke                3.9.0            14_linux64_mkl    conda-forge
[conda] mkl                       2022.0.1           h06a4308_117  
[conda] numpy                     1.22.3                   pypi_0    pypi
[conda] pytorch                   2.1.0.dev20230522 py3.9_cuda11.7_cudnn8.5.0_0    pytorch-nightly
[conda] pytorch-cuda              11.7                 h778d358_5    pytorch-nightly
[conda] pytorch-mutex             1.0                        cuda    pytorch-nightly
[conda] pytorch-pfn-extras        0.5.8                    pypi_0    pypi
[conda] pytorch-triton            2.1.0+46672772b4          pypi_0    pypi
[conda] torch                     2.1.0.dev20230403+cu117          pypi_0    pypi
[conda] torchdata                 0.5.0a0+25c6180           dev_0    <develop>
[conda] torchtriton               2.1.0+7d1a95b046            py39    pytorch-nightly
[conda] torchvision               0.16.0a0+6ccc712           dev_0    <develop>

@FrancescoSaverioZuppichini

That is amazing, thanks a lot @NicolasHug , are we talking on single elements on cpu right? Was wondering what about batching on GPU :)

@vfdev-5
Copy link
Collaborator

vfdev-5 commented May 23, 2023

It's for single image transformations on CPU, single thread (as in the dataloader)

@NicolasHug
Copy link
Member Author

Yes, those improvements are for CPU. The pytorch DataLoder is designed in such a way that only single images are transfomed (not batches), so that's what we benchmarked above. It will work on batches as well of course.
We haven't focused on improving GPU pre-processing speed

@FrancescoSaverioZuppichini

Sure, in my experience it is actually quite efficient to (depending on the pipeline) send the uint8 to the GPU and do the transformations there - or do the resize/crop in each worker -> send the uint8 and do the rest there since in a lot of use cases the transfer time is a bottleneck (or at least on my system)

This addition is actually amazing, I'll test it out as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants