-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let Resize()
handle uint8
images natively for bilinear mode
#7497
Comments
+1 from my side, this is really important since for most CV pipelines the data processing is the bottleneck |
Regarding the perf regression on non-AVX2 platforms: in pytorch/pytorch#100164, we are adding an option to detect the used CPU capability. With that, we can have both code paths and thus get the improved performance on AVX2 platforms "for free". Note that the BC concerns, i.e. the one-off differences, are still valid. |
TL;DR: From torchvision 0.16, V2 transforms are always faster than PIL and V1 for classification pipelines. Speedups of the whole pipelines range from 10%-40%. Update: reported results below are with bilinear mode, but similar speedups are observed for bicubic mode. I ran a few benchmarks with the latest improvements now that #7557 is merged. Below is a summary, with more details in the expandable section.
Side notes:
Benchmark were ran with @pmeier 's https://github.com/pmeier/detection-reference-benchmark on this custom branch: https://github.com/pmeier/detection-reference-benchmark/compare/main...NicolasHug:detection-reference-benchmark:bench_classif_after_resize_v2_improvements?expand=1 (commit == 621ccaa)
|
That is amazing, thanks a lot @NicolasHug , are we talking on single elements on cpu right? Was wondering what about batching on GPU :) |
It's for single image transformations on CPU, single thread (as in the dataloader) |
Yes, those improvements are for CPU. The pytorch DataLoder is designed in such a way that only single images are transfomed (not batches), so that's what we benchmarked above. It will work on batches as well of course. |
Sure, in my experience it is actually quite efficient to (depending on the pipeline) send the uint8 to the GPU and do the transformations there - or do the resize/crop in each worker -> send the uint8 and do the rest there since in a lot of use cases the transfer time is a bottleneck (or at least on my system) This addition is actually amazing, I'll test it out as well |
On
uint8
tensors,Resize()
currently converts the input image to and fromtorch.float
to pass it tointerpolate()
, becauseinterpolate()
didn't support nativeuint8
inputs in the past. This is suboptimal.@vfdev-5 and I have recently implemented native
uint8
support forinterpolate(mode="bilinear")
in pytorch/pytorch#90771 and pytorch/pytorch#96848.We should integrate this native
uint8
support into torchvision'sResize()
. Benchmarks below show that such integration could lead to at least 3X improvement onResize()
's time, which saves 1ms per image and a 30% improvement of the total pipeline time for a typical classification pipeline (including auto-augment, which is the next bottleneck). This would make the Tensor / DataPoint backend significantly faster than PIL.Some current challenges before integrations are:
Resize()
implem (float), is the perf still OK on archs that don’t support AVX2? First: need to identify whether those non-AVX2 targets are critical or not.Benchmarks made with @pmeier's pmeier/detection-reference-benchmark@0ae9027 and with the following patch
Without
uint8
native support:With
uint8
native support for TensorV2 and DatapointV2:The text was updated successfully, but these errors were encountered: