Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed Up Jpeg Encoder Color Conversion #1476

Closed
JimBobSquarePants opened this issue Dec 15, 2020 · 14 comments
Closed

Speed Up Jpeg Encoder Color Conversion #1476

JimBobSquarePants opened this issue Dec 15, 2020 · 14 comments

Comments

@JimBobSquarePants
Copy link
Member

Some analysis of the performance of the encoder based upon a breakdown of this benchmark indicates that encoding a large jpeg takes 80% of the entire processing time.

https://github.com/kleisauke/net-vips/tree/master/tests/NetVips.Benchmarks

This is due to the lack of hardware acceleration in our color conversion approach.

The current Jpeg encoder utilizes predefined tables to convert a span of Rgb24 pixels into separate Y Cb Cr Block8x8F planes.

public void Convert(ImageFrame<TPixel> frame, int x, int y, in RowOctet<TPixel> currentRows)
{
this.pixelBlock.LoadAndStretchEdges(frame.PixelBuffer, x, y, currentRows);
Span<Rgb24> rgbSpan = this.rgbBlock.AsSpanUnsafe();
PixelOperations<TPixel>.Instance.ToRgb24(frame.GetConfiguration(), this.pixelBlock.AsSpanUnsafe(), rgbSpan);
ref Block8x8F yBlock = ref this.Y;
ref Block8x8F cbBlock = ref this.Cb;
ref Block8x8F crBlock = ref this.Cr;
ref Rgb24 rgbStart = ref rgbSpan[0];
for (int i = 0; i < 64; i++)
{
ref Rgb24 c = ref Unsafe.Add(ref rgbStart, i);
this.colorTables.ConvertPixelInto(
c.R,
c.G,
c.B,
ref yBlock,
ref cbBlock,
ref crBlock,
i);
}
}

While this is faster than naïve per-pixel floating point calculation it can be heavily optimized.

Short Term Goal

Add AVX2? acceleration directly to the converter to optimize conversion for .NET Core 3.1+. This should be a few hours work for someone with SIMD knowledge.

Long Term Goal

Establish an architecture similar to the Jpeg decoder ColorConverters allowing incremental addition accelerated converters for all platforms and color spaces.

@tannergooding
Copy link
Contributor

DirectX Math is MIT licensed and already provides SIMD accelerated algorithms (for x86/x64 and ARM64) for many standard color conversions: https://docs.microsoft.com/en-us/windows/win32/api/directxmath/nf-directxmath-xmcolorrgbtoyuv

https://github.com/microsoft/DirectXMath/blob/master/Inc/DirectXMathMisc.inl#L1728-L1738

It's documented to use ITU-R BT.601/CCIR 601 W(r) = 0.299 W(b) = 0.114 U(max) = 0.436 V(max) = 0.615., which I believe is what you want.

There is likewise https://docs.microsoft.com/en-us/windows/win32/api/directxmath/nf-directxmath-xmcolorrgbtoyuv_hd which uses ITU-R BT.709 W(r) = 0.2126 W(b) = 0.0722 U(max) = 0.436 V(max) = 0.615.

@JimBobSquarePants
Copy link
Member Author

Thanks @tannergooding I'll see how well the code there fits with our existing architecture.

@antonfirsov
Copy link
Member

Note that the referenced netvips benchmark is quite atypical for Resize. Users usually downscale much more than 90%, so I wouldn't worry that much for the encoder being a bottleneck.

We should profile this though, I wonder how do the bottlenecks distribute exactly.

@JimBobSquarePants
Copy link
Member Author

@Sergio0694 was doing some work with 4K images the other day and the benchmarks he showed me indicated that the encoder was a major bottleneck. Not atypical but also not that uncommon.

I’ll try and dig out the screenshot

@JimBobSquarePants
Copy link
Member Author

jpeg-encoder-bench

@JimBobSquarePants
Copy link
Member Author

@tkp1n Has provided us with some great improvements via #1508 and I'll profile an encode to see where further bottlenecks are.

@antonfirsov
Copy link
Member

Ah yeah saving 4K JPEG must be very slow indeed. (Hope noticeably better with #1508).

What I mean is that we got away with a slow encoder this long because typical thumbnail maker code is usually saving a very small output image, so it's not that hot for web content management probably. (Doesn't mean it's not painful in other use-cases.)

@Sergio0694
Copy link
Member

Yeah I was very surprised to see just how much slower ImageSharp was at JPEG encoding/decoding 😥

I was expecting it to be somewhat on par, but especially the encoding part is really a lot, a lot slower. In my case basically just saving the image takes more than the entirety of copying to GPU, processing it and copying it back. But like, it takes 4x times as all those steps combined, and I haven't even optimized them that much either. I was kinda tempted to switch my samples to System.Drawing, though in the end I didn't because, well, I love you guys, and also the API surface of System.Drawing is ugly 😄

Point is, any speed improvements in this area would be a super welcome improvement, especially if you're all concerned about people running comparative benchmarks between ImageSharp and other common image processing libraries. On this point, will make some tests on a few improvements I've been meaning to add to the resize kernel using FMA instructions too 🚀

@tkp1n
Copy link
Contributor

tkp1n commented Jan 21, 2021

I've attached a speedscope dump from PerfView as asked for in #1517 (comment).. The trace is from a BenchmarkDotNet benchmark of a 4K JPEG export after the optimization in #1517.

Unzip it, open it in https://www.speedscope.app/, select "Left heavy" (top menu), scroll all the way down..

JPEG_encode.speedscope.zip

@antonfirsov
Copy link
Member

Not a pro with this tool, but if I'm reading it right, RowOctet constructor, and Emit are the new bottlenecks. @tkp1n can you confirm?

Here only the RowOctet thing is related to color conversion and can be fixed by addressing the following TODO note:

// TODO: Try pushing this to the outer loop!
var currentRows = new RowOctet<TPixel>(pixelBuffer, y + yOff);

@tkp1n
Copy link
Contributor

tkp1n commented Jan 21, 2021

can you confirm?

Exactly, yes.

@antonfirsov
Copy link
Member

@JimBobSquarePants I think we can close this in favor of a general JpegEncoder perf tracking issue.

@JimBobSquarePants
Copy link
Member Author

Ok. Let’s migrate all the relevant info.

@JimBobSquarePants
Copy link
Member Author

Fixed via #2120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants